Field-level evaluation and a release gate for LLM document extraction.
Everyone ships "extract the fields from invoices with an LLM" now. Almost nobody can answer the two questions that decide whether it survives production: how accurate is it, field by field, and did yesterday's prompt change make anything worse? This project is the missing harness: a corpus with exact ground truth, a scorer that measures each field honestly, and a gate that blocks a release when any metric regresses.
- A synthetic corpus with truth by construction. 50 invoices across 5
layout families: a US office supplier, a bilingual German freight
company (
1.480,00 EUR,14.05.2026), a minimal freelancer invoice with leading-zero numbers, an ALL-CAPS industrial invoice with planted distractors (a P.O. Box in the remit address, a quote number in the notes), and six one-off documents that follow no house layout at all, including a French facture and an Australian tax invoice in AUD. Every document is rendered from structured truth, so the golden set is exact: amounts are computed, then printed, then recorded. No hand labelling, no labelling errors. - Two extractors behind one interface. A regex baseline written in good faith against the four house layouts, and Claude with a prompt derived from the same schema the scorer reads. Swapping extractors changes one flag, which is the point: the harness outlives any single approach.
- A normalizer, because raw string comparison lies. "May 14, 2026" and "2026-05-14" are the same date; "1,308ドル.13" and "1308.13" are the same amount; "3.522,40 EUR" is not 3.52. Golden and predicted values both pass through the same canonicalization before comparison, so the scores measure extraction, not formatting taste.
- Field-level scoring. Accuracy per field, plus precision and recall on nullable fields (due date, PO number, tax), where an extractor can fail by inventing a value that is not on the page. A wrong value counts as both a false positive and a false negative: it invented something and it missed the truth.
- A release gate with an exit code. Candidate vs baseline, metric by metric. Any drop beyond the threshold blocks the release. It cannot be gamed by shrinking the eval set or deleting a metric: both fail outright.
python data/generate.py # regenerate the corpus (already committed)
python main.py run --extractor rules
Real output of that command on the committed corpus:
extractor: rules documents: 50
documents with every field correct: 45/50 (0.900)
field accuracy precision recall
invoice_number 0.960 - -
invoice_date 0.960 - -
due_date 0.980 1.000 0.964
vendor_name 1.000 - -
po_number 0.980 1.000 0.967
currency 0.980 - -
tax 0.980 1.000 0.972
total 0.960 - -
layout breakdown (share of field comparisons correct)
acme 1.000 ##############################
brightside 1.000 ##############################
meridian 1.000 ##############################
nordwind 1.000 ##############################
oneoff 0.792 ########################
failures (10 of 10 shown)
inv-007 oneoff invoice_number expected 'f-2026-83', got None
inv-007 oneoff invoice_date expected '2026-04-05', got '2026-05-04'
inv-007 oneoff due_date expected '2026-05-05', got None
inv-007 oneoff po_number expected 'ar-7741', got None
inv-007 oneoff tax expected '450.00', got None
inv-007 oneoff total expected '2700.00', got '700.00'
inv-025 oneoff currency expected 'AUD', got 'USD'
inv-032 oneoff total expected '4800.00', got None
inv-033 oneoff invoice_date expected '2026-06-02', got None
inv-047 oneoff invoice_number expected '88301124', got None
To run the LLM extractor instead: export ANTHROPIC_API_KEY=... and
python main.py run --extractor claude, then python main.py gate runs/claude.json to compare it against the committed rules baseline.
The rules baseline is not a strawman. On the four layouts it was written against it scores a clean 1.000, which is exactly how these systems look in the demo: you write rules against the documents you have, they pass, you ship. Then the fifth layout arrives.
Every line in that failures list is a classic:
inv-007is a French facture.05/04/2026reads as May 4 under the US convention and April 5 under the French one, and nothing on the line says which; the rules guess wrong. Its total,2 700,00, uses a space as the thousands separator, so the naive token regex captures700,00and reports a total of 700.00 with full confidence. A number that parses cleanly and is silently wrong by 2,000 is a much worse failure than a crash.inv-025charges inA$. The symbol contains a dollar sign, the rules map dollar signs to USD, and the currency field is quietly wrong while every amount on the invoice is right.inv-032says "Amount payable" where every house layout says some variant of "total". No label match, no total.inv-033is a handwritten-style contractor invoice whose date has no label at all. Rules need labels.
Two more things the numbers show. Per-field accuracy never falls below 0.96, yet 10% of documents carry at least one error: per-field averages and document-level correctness are different questions, and an invoice processor pays the document-level price. And the corpus regenerates from a seed, so all of this is reproducible byte for byte; one test asserts the gap is still visible, because if it ever closes, the corpus has stopped doing its job.
python main.py gate runs/claude.json --baseline data/baseline.json
The gate compares every metric and fails loudly. Here it is blocking a candidate (a simulated regression, edited from a real run file) that improved the headline number while quietly losing the due-date field:
baseline: rules (perfect rate 0.900)
candidate: claude-new-prompt (simulated) (perfect rate 0.920)
checked 15 metric(s) against the baseline
GATE FAILED: 2 regression(s)
- due_date accuracy dropped 0.980 -> 0.900 (max allowed drop 0.02)
- due_date recall dropped 0.964 -> 0.850 (max allowed drop 0.02)
That is the whole argument for gating on fields instead of averages: this candidate looks like a win in any summary dashboard. Exit code 0 releases, 1 blocks, so in CI it is one line:
python main.py run --extractor claude --out runs/candidate.json
python main.py gate runs/candidate.json # fails the build on regression
When a candidate legitimately improves, its run file becomes the new
data/baseline.json, and the ratchet only turns one way.
| field | type | nullable | the catch |
|---|---|---|---|
| invoice_number | text | no | leading zeros must survive ("0048", not 48) |
| invoice_date | date | no | four formats, two day/month conventions |
| due_date | date | yes | "Net 30" is not a due date; null unless printed |
| vendor_name | text | no | the issuer, not the customer on the "For:" line |
| po_number | text | yes | a P.O. Box and a quote number sit nearby as bait |
| currency | currency | no | A$ is not USD |
| tax | money | yes | "TAX: EXEMPT" means null, not 0.00 |
| total | money | no | three thousands-separator conventions |
The schema lives in one list (app/schema.py). The generator writes
golden values for it, the LLM prompt is built from its descriptions, the
normalizer knows its types and the scorer knows which fields are
nullable. Add a field there and the whole pipeline follows.
The corpus is synthetic so the repo is self-contained and the truth is
exact. For a real engagement the harness stays and the data changes:
drop your documents in data/docs/, write data/golden.json for a
labelled sample (a few dozen documents is enough to start), adjust the
schema list, and the runner, scorer, report and gate work unchanged.
PDFs and scans enter through whatever text layer you already have; the
eval does not care how the text was obtained.
python -m pytest
39 tests, no API key, no network: the corpus generator is checked for determinism, the normalizer policies are pinned case by case, the scorer arithmetic is verified against hand-computed fixtures, the gate is shown to block each kind of regression and to reject a shrunken eval set, and the LLM pipeline runs end to end against a scripted model, including a scripted API failure that is scored as a miss instead of crashing the run.
This evaluates text extraction. OCR quality, table structure recovery and layout detection are separate problems that sit upstream; the honest claim here is: given the text of a document, this measures how well fields come out and refuses to let them get worse.