Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

vinimabreu/doc-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

1 Commit

Repository files navigation

doc-eval

Field-level evaluation and a release gate for LLM document extraction.

Everyone ships "extract the fields from invoices with an LLM" now. Almost nobody can answer the two questions that decide whether it survives production: how accurate is it, field by field, and did yesterday's prompt change make anything worse? This project is the missing harness: a corpus with exact ground truth, a scorer that measures each field honestly, and a gate that blocks a release when any metric regresses.

architecture

What is in the box

  • A synthetic corpus with truth by construction. 50 invoices across 5 layout families: a US office supplier, a bilingual German freight company (1.480,00 EUR, 14.05.2026), a minimal freelancer invoice with leading-zero numbers, an ALL-CAPS industrial invoice with planted distractors (a P.O. Box in the remit address, a quote number in the notes), and six one-off documents that follow no house layout at all, including a French facture and an Australian tax invoice in AUD. Every document is rendered from structured truth, so the golden set is exact: amounts are computed, then printed, then recorded. No hand labelling, no labelling errors.
  • Two extractors behind one interface. A regex baseline written in good faith against the four house layouts, and Claude with a prompt derived from the same schema the scorer reads. Swapping extractors changes one flag, which is the point: the harness outlives any single approach.
  • A normalizer, because raw string comparison lies. "May 14, 2026" and "2026-05-14" are the same date; "1,308ドル.13" and "1308.13" are the same amount; "3.522,40 EUR" is not 3.52. Golden and predicted values both pass through the same canonicalization before comparison, so the scores measure extraction, not formatting taste.
  • Field-level scoring. Accuracy per field, plus precision and recall on nullable fields (due date, PO number, tax), where an extractor can fail by inventing a value that is not on the page. A wrong value counts as both a false positive and a false negative: it invented something and it missed the truth.
  • A release gate with an exit code. Candidate vs baseline, metric by metric. Any drop beyond the threshold blocks the release. It cannot be gamed by shrinking the eval set or deleting a metric: both fail outright.

Quickstart (no API key needed)

python data/generate.py # regenerate the corpus (already committed)
python main.py run --extractor rules

Real output of that command on the committed corpus:

extractor: rules documents: 50
documents with every field correct: 45/50 (0.900)
field accuracy precision recall
invoice_number 0.960 - -
invoice_date 0.960 - -
due_date 0.980 1.000 0.964
vendor_name 1.000 - -
po_number 0.980 1.000 0.967
currency 0.980 - -
tax 0.980 1.000 0.972
total 0.960 - -
layout breakdown (share of field comparisons correct)
 acme 1.000 ##############################
 brightside 1.000 ##############################
 meridian 1.000 ##############################
 nordwind 1.000 ##############################
 oneoff 0.792 ########################
failures (10 of 10 shown)
 inv-007 oneoff invoice_number expected 'f-2026-83', got None
 inv-007 oneoff invoice_date expected '2026-04-05', got '2026-05-04'
 inv-007 oneoff due_date expected '2026-05-05', got None
 inv-007 oneoff po_number expected 'ar-7741', got None
 inv-007 oneoff tax expected '450.00', got None
 inv-007 oneoff total expected '2700.00', got '700.00'
 inv-025 oneoff currency expected 'AUD', got 'USD'
 inv-032 oneoff total expected '4800.00', got None
 inv-033 oneoff invoice_date expected '2026-06-02', got None
 inv-047 oneoff invoice_number expected '88301124', got None

To run the LLM extractor instead: export ANTHROPIC_API_KEY=... and python main.py run --extractor claude, then python main.py gate runs/claude.json to compare it against the committed rules baseline.

The finding: the generalization gap, measured

The rules baseline is not a strawman. On the four layouts it was written against it scores a clean 1.000, which is exactly how these systems look in the demo: you write rules against the documents you have, they pass, you ship. Then the fifth layout arrives.

Every line in that failures list is a classic:

  • inv-007 is a French facture. 05/04/2026 reads as May 4 under the US convention and April 5 under the French one, and nothing on the line says which; the rules guess wrong. Its total, 2 700,00, uses a space as the thousands separator, so the naive token regex captures 700,00 and reports a total of 700.00 with full confidence. A number that parses cleanly and is silently wrong by 2,000 is a much worse failure than a crash.
  • inv-025 charges in A$. The symbol contains a dollar sign, the rules map dollar signs to USD, and the currency field is quietly wrong while every amount on the invoice is right.
  • inv-032 says "Amount payable" where every house layout says some variant of "total". No label match, no total.
  • inv-033 is a handwritten-style contractor invoice whose date has no label at all. Rules need labels.

Two more things the numbers show. Per-field accuracy never falls below 0.96, yet 10% of documents carry at least one error: per-field averages and document-level correctness are different questions, and an invoice processor pays the document-level price. And the corpus regenerates from a seed, so all of this is reproducible byte for byte; one test asserts the gap is still visible, because if it ever closes, the corpus has stopped doing its job.

The release gate

python main.py gate runs/claude.json --baseline data/baseline.json

The gate compares every metric and fails loudly. Here it is blocking a candidate (a simulated regression, edited from a real run file) that improved the headline number while quietly losing the due-date field:

baseline: rules (perfect rate 0.900)
candidate: claude-new-prompt (simulated) (perfect rate 0.920)
checked 15 metric(s) against the baseline
GATE FAILED: 2 regression(s)
 - due_date accuracy dropped 0.980 -> 0.900 (max allowed drop 0.02)
 - due_date recall dropped 0.964 -> 0.850 (max allowed drop 0.02)

That is the whole argument for gating on fields instead of averages: this candidate looks like a win in any summary dashboard. Exit code 0 releases, 1 blocks, so in CI it is one line:

python main.py run --extractor claude --out runs/candidate.json
python main.py gate runs/candidate.json # fails the build on regression

When a candidate legitimately improves, its run file becomes the new data/baseline.json, and the ratchet only turns one way.

The schema

field type nullable the catch
invoice_number text no leading zeros must survive ("0048", not 48)
invoice_date date no four formats, two day/month conventions
due_date date yes "Net 30" is not a due date; null unless printed
vendor_name text no the issuer, not the customer on the "For:" line
po_number text yes a P.O. Box and a quote number sit nearby as bait
currency currency no A$ is not USD
tax money yes "TAX: EXEMPT" means null, not 0.00
total money no three thousands-separator conventions

The schema lives in one list (app/schema.py). The generator writes golden values for it, the LLM prompt is built from its descriptions, the normalizer knows its types and the scorer knows which fields are nullable. Add a field there and the whole pipeline follows.

Pointing it at your documents

The corpus is synthetic so the repo is self-contained and the truth is exact. For a real engagement the harness stays and the data changes: drop your documents in data/docs/, write data/golden.json for a labelled sample (a few dozen documents is enough to start), adjust the schema list, and the runner, scorer, report and gate work unchanged. PDFs and scans enter through whatever text layer you already have; the eval does not care how the text was obtained.

Tests

python -m pytest

39 tests, no API key, no network: the corpus generator is checked for determinism, the normalizer policies are pinned case by case, the scorer arithmetic is verified against hand-computed fixtures, the gate is shown to block each kind of regression and to reject a shrunken eval set, and the LLM pipeline runs end to end against a scripted model, including a scripted API failure that is scored as a miss instead of crashing the run.

Scope notes

This evaluates text extraction. OCR quality, table structure recovery and layout detection are separate problems that sit upstream; the honest claim here is: given the text of a document, this measures how well fields come out and refuses to let them get worse.

About

Field-level evaluation and release gate for LLM document extraction: exact golden set, per-field precision/recall, CI gate that blocks regressions

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

Languages

AltStyle によって変換されたページ (->オリジナル) /