Back

LongExtractionBench

Seven production extraction systems on the same 225 documents, long, content-rich documents that stress extraction systems.

Recall

1. Executive Summary

We evaluated seven production extraction systems on the same 225 documents: four dedicated document-extraction platforms (Reducto Deep Extract, Extend - MAX, LlamaExtract - Agentic, Datalab Extract - Balanced) plus three frontier LLMs called directly (GPT-5.5, Claude Opus 4.8, Gemini 3.1 Pro). The corpus is deliberately hard: documents average 358 pages and roughly 88,700 ground-truth fields each. Every system was run in its strongest available configuration: The frontier models were run with maximum thinking/reasoning enabled across the board, and Reducto, Extend, LlamaExtract, and Datalab Extract in their highest-accuracy extraction modes.

The headline is not a single accuracy number; it is a gap in who can finish the job at all. Results are reported along four separate dimensions, kept deliberately distinct: success performance (accuracy on completed documents), failure metrics (accepted then could not finish), incompatibility (refused up front as unsupported in kind, independent of size), and latency.

What we found

1

Recall is the great separator. Precision and leaf accuracy cluster high across systems; recall ranges from 49% to 99.6% and tracks completeness, exactly what long, dense documents stress.

2

Direct frontier LLM baselines had substantially lower completion rates on long documents. Gemini 3.1 Pro and Claude Opus 4.8 completed only 112 and 116 of 225 documents. Their high accuracy figures (96.2% and 91.7% leaf) are a conditional metric, true only on the short documents they managed to finish.

3

The strongest dedicated platforms beat raw frontier models on robustness and completeness. Reducto, Extend and LlamaExtract finish far more of the corpus and achieve higher recall than any frontier model.

2. Why This Benchmark Exists

Much of the highest-value work in enterprise data extraction lives in long, content-rich, table-heavy documents: statistical releases, census tabulations, financial and regulatory filings, healthcare datasets, scientific reference tables. These documents routinely run from hundreds to thousands of pages and pack tens of thousands of individual values into dense, repeating tables, and the numbers locked inside them feed real downstream workflows, from financial models to regulatory reporting to operational decisions. Extracting them is a substantial challenge: the length alone strains context windows and output limits, tabular structure has to be held together across page breaks and shifting layouts, and dropped or misread rows can quietly corrupt an entire analysis. This is exactly the environment where extraction systems degrade or fail outright the capability this benchmark measures.

The question this benchmark answers is simple: which platform performs best when the documents are dense, across accuracy, robustness, and latency, rather than on a curated simple set. We measure not just "how accurate is it when it works," but "how often does it work at all, and how does it fail when it doesn't."

3. Independence & Governance

Governance Principle Description
Roles Reducto commissioned this benchmark. micro1 sourced the document set. Reducto created the methodology for drafting ground truth and for running and grading the models. micro1 employed human annotation reconciliation and verification on the ground-truth data, conducted independent technical diligence on the benchmark methodology, and publishes these results. Reducto did not add, remove, or modify any documents in the set to obtain favorable results.
No extraction vendor touches the ground truth. The answer key is drafted by frontier models reading the raw document directly, then finalized by human review. No document-extraction product is invoked anywhere in ground-truth creation, and no parsed text, OCR, or layout from any extraction system is ever fed in.
Failures are recorded, not hidden. Every document in the benchmark set is run with every system. When a system fails or refuses a document, that outcome is recorded as a failure or incompatibility. A system's success-accuracy is always reported alongside its completion rate so the two can never be confused.

4. How Ground Truth Was Built

Ground truth was created independently from the raw documents, without any extraction vendor or product in the loop, and was human-reviewed and reconciled before acceptance.

The candidate labels were drafted by frontier models reading each source document directly: GPT-5.5 and Claude Opus 4.7. No document-extraction product, and no OCR, parsed text, or layout from any extraction system, was part of the process. Human annotators then reconciled the disagreements between the two models and additionally reviewed a sample of the labels the models agreed on, confirming that agreement reflected correctness rather than a shared mistake, before the result was accepted as ground truth.

5. The Corpus

Documents

225

Pages / document

358

GT fields / document

≈88,700

Schema fields / document

≈44

The dataset is a mixture of short releases and documents in the thousands of pages. Density is the defining property: a few dozen schema fields expand into tens of thousands of ground-truth values per document because those fields repeat across long tables. The corpus spans government and public-sector statistics, census and demographic tabulations, labor series, financial filings and asset-backed-securities reports, healthcare and Medicare datasets, regulatory/permit filings, and scientific reference tables; it is predominantly English-language public documents.

6. How We Score

The grader is deterministic: no LLM, no fuzzy matching beyond an explicit cosmetic normalizer. Each document yields three numbers in [0, 100]:

Precision

Precision of the array rows a system returned, the fraction that match a real ground-truth row - paired by the key the grader infers for that array, not by position. Penalizes hallucinated, duplicated, or extra rows.

Recall

Recall of the ground-truth array rows, the fraction the system returned and correctly matched. Penalizes missed rows and near-duplicates that fail to match.

Leaf accuracy

Leaf accuracy of the ground-truth leaf values a system could be scored on (i.e. on correctly-matched rows), what fraction match after cosmetic normalization. This measures cell-level correctness given that a row was returned.

Row matching is key-based, never positional. For each ground-truth array the grader infers the field(s) that most uniquely identify a row, then matches predicted rows to ground-truth rows by that key. Documents are score and then averaged with equal weight.

7. Results: The Four Dimensions

7.1 Coverage: who finished the job

Coverage spreads across a wide range rather than splitting into neat tiers: one full-coverage system (Reducto), a cluster that finishes in the high-80s to low-90s (Extend, LlamaExtract, and GPT-5.5), Datalab Extract (Balanced) at 74%, and the two lowest coverage frontier models that fall closer to ~50%. The spread is the whole story: the documents the lower-coverage systems drop are the hard ones, so every accuracy number below must be read against this column.

7.2 Success performance: accuracy on completed documents

These figures describequality given that the system returned a result.They donotpenalize non-completion and are only interpretable jointly with coverage above.

System Completed /225 Precision Recall Leaf acc.
Reducto Deep Extract 225 99.6% 99.6% 99.3%
Extend 210 86.4% 92.7% 92.8%
LlamaExtract 203 80.0% 77.5% 88.9%
GPT-5.5 198 95.8% 52.7% 96.2%
Datalab Extract - Balanced 166 92.8% 33.8% 90.9%
Claude Opus 4.8 116 92.0% 70.7% 91.7%
Gemini 3.1 Pro 112 95.8% 48.6% 96.2%

Apples-to-apples: the common-completed subset

To remove the coverage confound, all seven systems are re-scored on only the documents every system completed, the 61-document intersection. Caveat: this subset is comprised primarily of the easy/short documents the weakest systems could finish, so it understates the gap on hard documents.

System Precision Recall Leaf
Reducto Deep Extract 99.0% 99.6% 98.7%
Extend 85.9% 98.4% 97.2%
LlamaExtract 75.8% 89.2% 91.0%
GPT-5.5 95.4% 77.8% 98.8%
Datalab Extract - Balanced 96.5% 64.4% 95.2%
Claude Opus 4.8 92.3% 73.2% 91.8%
Gemini 3.1 Pro 98.1% 77.2% 96.6%

Even on this easier subset, frontier-model recall sits at 74–79%, so some rows go unextracted on documents these models complete. Datalab Extract (Balanced) is lower at 64.4%. Reducto still leads recall (99.8%), and leaf accuracy is effectively a tie at the top (Reducto 98.9%, GPT-5.5 98.7%). The gap narrows on easier documents without fully closing - the main differentiator is robustness on hard documents rather than cell-level accuracy on easy ones.

7.3 Failure metrics: defeated by the document, not refused for kind

A failure is a document a system could not complete for capacity or operational reasons: output truncation, input-context overflow, or timeout.

System Failures Failure rate
Reducto Deep Extract 0 0.0%
Extend 8 3.6%
LlamaExtract 22 9.8%
GPT-5.5 27 12.0%
Datalab Extract - Balanced 59 26.2%
Claude Opus 4.8 81 36.0%
Gemini 3.1 Pro 109 48.4%

The frontier LLMs break far more often than the strongest dedicated platforms (GPT-5.5 on 12% of the corpus, Gemini on 36%, Claude on nearly one document in two at 48%), and the mode of failure differs by model. GPT-5.5's failures are exclusively input-side: the document plus schema overruns its context window before generation begins. Claude and Gemini mostly break mid-generation, cut off by their output-token ceilings. The dedicated platforms split: Extend and LlamaExtract break only on a small slice of the longest documents (3.6% and 9.8%) and Reducto had no recorded failures in this run, while Datalab Extract (Balanced) failed on 26.2% of the corpus.

7.4 Incompatibility: refused as unsupported in kind

An incompatibility is a document a system refused up front as unsupported in kind: a malformed or non-representable schema/request rejected before processing (e.g.INVALID_ARGUMENT), independent of document size. It is conceptually distinct from a failure and we never merge the two. Crucially, a size-driven up-front rejection (context-window overflow, page cap) is a failure, not an incompatibility: the identical request on a smaller document would have been accepted.

System Failures Failure rate
Reducto Deep Extract 0 0.0%
LlamaExtract - Agentic 0 0.0%
Claude Opus 4.8 0 0.0%
GPT-5.5 0 0.0%
Datalab Extract - Balanced 0 0.0%
Extend MAX 7 3.1%
Gemini 3.1 Pro 32 14.2%

Incompatibility is now concentrated almost entirely in Gemini: 32 documents (about one in seven) are rejected up front on documents well within its size limits, a genuine schema/request rejection, not a size problem. Extend refuses 7 on non-retryable schema grounds. Every other system refuses nothing up front.

7.5 Failure modes by category

Every document a system accepted and then could not finish, broken out by failure mode. These are failures only: up-front refusals (schema / request rejected) are incompatibilities, not failures, and are shown separately under incompatibility (7.4 above). Each row sums to that system's total failures.

System Output truncation Input exceeds context window Other (timeout / server)
Reducto Deep Extract 0 0 0
Extend MAX 0 0 8
LlamaExtract 0 18 4
GPT-5.5 0 27 0
Datalab Extract - Balanced 0 2 57
Claude Opus 4.8 67 41 1
Gemini 3.1 Pro 66 15 0

The single dominant mode across frontier models is output truncation (67 documents for Claude, 66 for Gemini): asked to emit a very large structured object, the model is cut off by its output-token ceiling. These are capability limits on long documents, not accuracy mistakes, and they hit the same documents a dedicated platform processes without incident. "Input exceeds context window" bundles both token-context overflow and hard page caps (LlamaExtract and Gemini at 1,000 pages, Claude at 600): both are the same failure at root, the input document being too large for the model to take in before generation begins.

Datalab Extract (Balanced) fails differently from other systems. Its 59 failures are almost all in the rightmost column: 28 documents whose schema was too complex for the Balanced tier (a configuration/capability limit of the mode tested, not document length), 21 server-side pipeline errors, and 8 timeouts. Only 2 failures were size-driven page caps (documents beyond 7,000 pages).

7.6 Latency

System Avg pages n Mean p50 p90 p95
Reducto Deep Extract 367 225 540s 306s 1,440s 1,969s
Extend Max 309 210 738s 276s 1,859s 3,685s
LlamaExtract - Agentic 112 203 524s 270s 1,241s 2,079s
GPT-5.5 115 198 327s 314s 498s 590s
Datalab Extract - Balanced 110 166 1,609s 1,110s 3,179s 4,928s
Claude Opus 4.8 86 116 420s 366s 841s 980s
Gemini 3.1 Pro 121 112 187s 187s 289s 358s

These distributions are heavy-tailed, so the mean and p50 diverge sharply. Read the median for a typical document and the p90/p95 for the worst case.

Latency is measured only on successful runs (n is each system's success count, and the Avg pages column is the average length of those documents). Failures and timeouts contribute no latency and are excluded, not capped: a failed run persists no latency value, and a timed-out run leaves no output to time. A system that gives up on a hard document therefore is not penalized by this metric.

Because of this, latency must be read against coverage: no two systems are timed over the same workload. The full-coverage systems are timed on far longer documents on average (Reducto 367, Extend 309 pages) than any frontier model (86–121 pages), because the long documents that produce long latencies are exactly the ones the frontier models failed to complete. Gemini's low, tight latency is therefore not evidence of speed on hard documents. Datalab Extract (Balanced) is the slowest system by a wide margin (mean 1,609s, p95 4,928s) even though its completed documents average only 110 pages - so unlike the full-coverage systems, its latency is not explained by being handed longer documents.

8. Findings & Implications

  1. Recall is the greatest differentiator. Precision and leaf accuracy are high almost everywhere; recall (getting all the rows out of a dense document) is what separates the field.
  2. Completeness is a real challenge. Most systems are reasonably accurate on what they return. The benchmark is decided by how much of a dense corpus a system can return at all.
  3. In this benchmark, direct frontier LLM calls were less robust than dedicated extraction systems on long, dense documents. Called directly, they break on the defining property of this corpus (size) via truncation, context limits, page caps, and schema rejections. Their strong accuracy on the documents they finish does not carry to the larger scale documents.
  4. The strongest dedicated extraction platforms add real robustness over raw LLMs (Extend and LlamaExtract finish 90%+ of the corpus vs. ~50% for the lowest coverage frontier models).

9. Limitations

This benchmark aims to be a fair, reproducible comparison, but its scope and provenance carry caveats that should be read alongside the results:

  • Sponsorship. Reducto commissioned this benchmark. Reducto is also one of the systems under test and the top performer on these documents.
  • Methodology provenance. Reducto authored parts of the methodology and the ground truth generation harness.
  • Model-assisted ground truth. Ground truth was drafted with frontier models and then reconciled by hand; it was not written from scratch by humans for every document.
  • Label bias. AI-drafted labels can carry the biases of the models that produced them, including blind spots shared with the systems being graded - a model may be scored generously where it errs in the same way the labeler did.
  • Different classes of system. Direct frontier-LLM API calls are not equivalent to dedicated extraction platforms. This compares two different classes of system, with different interfaces, defaults, and intended use, on a single task formulation.
  • Scope. Results are specific to this corpus, schema setup, provider configurations, and run date. Provider models and services change over time; numbers may not reproduce on a later run.
  • Sampled label checks. Agreement-field sampling provides evidence that the labels are sound, not a mathematical guarantee that the full corpus is correct. Unsampled documents could contain label errors that this process would not surface.
  • Reproducibility. The evaluation harness, grading methodology, provider adapters, and benchmark implementation are available in the accompanying GitHub repository. Due to licensing and commercial restrictions, the full benchmark corpus is not publicly released. A representative subset of 50 benchmark tasks is provided to illustrate the methodology and enable inspection of the evaluation pipeline.