For scientists · Concordance Engine

Where it helps

Pre-submission self-check

Before sending a manuscript out, run every reported p-value through verify_statistics_pvalue from the raw inputs. Catches transposed numerators, wrong-tail conventions, and copy-paste drift between draft revisions.

Peer review

Reviewing a chemistry paper claiming a stoichiometric mechanism? Paste the equation into verify_chemistry. If it doesn't balance, the proposed mechanism is incomplete — and you have a concrete comment to write.

Reproducibility audits

Ask Claude (with the engine connected) to walk a paper's results: "for each table reporting a p-value, recompute from the supplied n and effect size; flag mismatches." That's a real reproducibility pass, not a vibe check.

Teaching

Have students hand-derive a t-statistic, then verify against the engine. The mismatch is the lesson — what assumption did they get wrong?

A worked example

You're reviewing a methods section that reports:

"A two-sample t-test (n₁ = n₂ = 30, mean diff = 1.0, pooled SD = 1.0) yielded p < 0.001."

Run it:

verify_statistics_pvalue({
  "test": "two_sample_t",
  "n1": 30, "n2": 30,
  "mean1": 5.0, "mean2": 4.0,
  "sd1": 1.0, "sd2": 1.0,
  "tail": "two",
  "claimed_p": 0.001
})

→ {"status": "MISMATCH",
   "detail": "claimed p=0.001, recomputed p=0.000297 (diff 7.0e-04)",
   "data": {"recomputed_t": 3.873, "df": 58.0,
             "recomputed_p": 0.000297, "tail": "two-sided"}}

The author wasn't lying — they were rounding. But "p < 0.001" is true and "p ≈ 0.001" is not. The recomputed value is 0.0003. Now you have a precise reviewer comment instead of a hunch.

What it doesn't replace

The engine doesn't read your data. It can't tell you whether your experimental design is sound, whether your effect size is meaningful, whether you should have pre-registered, or whether your sample is representative. Those are scientific judgments. The engine catches the layer below judgment — the arithmetic that has to be right before any of the judgment matters.

A correct p-value computed from a confounded experiment is still wrong in the way that matters. The engine is necessary, not sufficient.

Domains covered for science workflows

Statistics: 12 test types including paired-t, Fisher exact, Mann-Whitney, Wilcoxon, regression-coefficient t. Multiple-comparisons correction (Bonferroni, BH/FDR). CI bound recomputation from raw inputs.
Chemistry: equation balance with charge handling. Catches copy-paste errors and stoichiometric mistakes.
Physics: dimensional consistency. Named conservation laws (energy, momentum, charge, mass).
Mathematics: sympy-backed equality, derivative, integral, limit, solve, matrix algebra, inequalities, infinite series, ordinary differential equations.
Biology: Hardy-Weinberg, Mendelian-ratio chi-squared, primer Tm/GC, molarity arithmetic, dose-response monotonicity, power analysis.
Computer science: static termination, functional correctness from test cases, runtime / space complexity, run-twice determinism for stochastic claims.

The math is wrong more often than the prose admits.

Where it helps

Pre-submission self-check

Peer review

Reproducibility audits

Teaching

A worked example

What it doesn't replace

Domains covered for science workflows