Statcheck-style audits of psychology papers find statistical reporting errors in roughly half of them; about one in eight changes the significance verdict. Statisticians who replay computations for the FDA or a journal find similar rates. Most of those errors are not fraud — they are arithmetic. The Concordance Engine catches that arithmetic.
Before sending a manuscript out, run every reported p-value through verify_statistics_pvalue from the raw inputs. Catches transposed numerators, wrong-tail conventions, and copy-paste drift between draft revisions.
Reviewing a chemistry paper claiming a stoichiometric mechanism? Paste the equation into verify_chemistry. If it doesn't balance, the proposed mechanism is incomplete — and you have a concrete comment to write.
Ask Claude (with the engine connected) to walk a paper's results: "for each table reporting a p-value, recompute from the supplied n and effect size; flag mismatches." That's a real reproducibility pass, not a vibe check.
Have students hand-derive a t-statistic, then verify against the engine. The mismatch is the lesson — what assumption did they get wrong?
You're reviewing a methods section that reports:
"A two-sample t-test (n₁ = n₂ = 30, mean diff = 1.0, pooled SD = 1.0) yielded p < 0.001."
Run it:
verify_statistics_pvalue({
"test": "two_sample_t",
"n1": 30, "n2": 30,
"mean1": 5.0, "mean2": 4.0,
"sd1": 1.0, "sd2": 1.0,
"tail": "two",
"claimed_p": 0.001
})
→ {"status": "MISMATCH",
"detail": "claimed p=0.001, recomputed p=0.000297 (diff 7.0e-04)",
"data": {"recomputed_t": 3.873, "df": 58.0,
"recomputed_p": 0.000297, "tail": "two-sided"}}
The author wasn't lying — they were rounding. But "p < 0.001" is true and "p ≈ 0.001" is not. The recomputed value is 0.0003. Now you have a precise reviewer comment instead of a hunch.
The engine doesn't read your data. It can't tell you whether your experimental design is sound, whether your effect size is meaningful, whether you should have pre-registered, or whether your sample is representative. Those are scientific judgments. The engine catches the layer below judgment — the arithmetic that has to be right before any of the judgment matters.
A correct p-value computed from a confounded experiment is still wrong in the way that matters. The engine is necessary, not sufficient.