Back

Realm Prospera: Tax intelligence benchmark

The standard for evaluating tax reasoning in AI systems

Pass@3 score

Introduction

Tax return preparation is one of the most demanding tests of an AI agent's ability to perform long horizon, structured reasoning. A complete federal return requires reading dozens of source documents, applying hundreds of tax code rules across interconnected forms and schedules, and carrying computed values forward through a chain of dependent calculations. A tax return is a single end to end task where every intermediate result feeds into the next, and a small mistake early on can cascade through the entire filing.

This dataset was made up of tasks authored by domain experts including tax professionals and CPAs. Every task presents a complete taxpayer scenario with realistic W-2s, 1099s, investment statements, and life events. The agent is given the full set of source documents and must produce a complete Form 1040 along with all required supporting schedules and forms. There are no hints about which forms to file or which calculations to perform. The agent must determine the correct filing strategy entirely on its own.

Each task is scored against 20 or more expert authored criteria, where each criterion checks a specific tax computation on a specific line of a specific form. This granular scoring lets us see not just whether a model gets the final answer right, but exactly which computations it handles correctly and where it breaks down. We evaluated three frontier models with high effort reasoning enabled: Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro.

Key findings

GPT-5.4 leads performance

Highest mean reward (0.368) and pass@3 (28%) across all models.

Nearly half of tasks fail

44% of evaluation criteria remain unsolved across all models.

Speed varies significantly

Gemini is fastest (6.1m), while Opus is slowest (15.4m).

More reasoning = better results

GPT-5.4 uses more steps (~34 turns) to achieve higher accuracy.

Performance ceiling exists

No model exceeds a reward of 0.89 on any run.

Complexity breaks models

Failures increase sharply with multi-step and multi-form dependencies.

The error cascade

All three models handle direct transcription reasonably well. Reading a wage amount from a W-2, pulling interest income from a 1099-INT, or identifying a filing status from the scenario description are tasks that models complete correctly roughly half the time. The challenge emerges when these transcribed values need to flow through calculations. Our criteria are authored at different stages of the tax computation pipeline, which lets us observe exactly where performance degrades as the reasoning chain grows longer.

We classify criteria into three stages. Stage 1 covers direct transcription, where the model reads a value from a source document and places it on the correct line of the correct form. At this stage, models pass about 50% of criteria. Stage 2 covers intermediate calculations like adjusted gross income, net business income, itemized deduction totals, and schedule summaries. Here performance drops to around 40%. Stage 3 covers final computed values: tax liability, refund amounts, credits, and penalties that depend on long chains of prior calculations. At this stage, pass rates fall to roughly 23%.

This waterfall pattern reveals something important. Models have gotten good enough to understand tax concepts and read source documents. The bottleneck is not comprehension but computation over long horizons. When a task requires chaining ten or more dependent calculations across multiple forms, the probability of getting every link right drops sharply. Each intermediate error compounds, making the final values increasingly unreliable.

How models approach the problem

Examining the agent trajectories reveals three distinct strategies for tackling a complete tax return.

GPT-5.4

Uses an iterative approach, breaking problems into small steps. It averages 34 turns and ~16 bash calls to run calculations, verify results, and fetch IRS rules. This yields high single-run scores but is less reproducible, with results sensitive to early steps.

Claude Opus 4.6

Uses a depth first approach with extensive upfront reasoning (~16k characters). It generates a Python script to compute the entire return, averaging 9 turns and producing highly consistent results. The downside is that errors cannot be iteratively corrected.

Gemini 3.1 Pro

Fastest model, finishing in about one-third the time of the slowest. It often skips computation, relying on internal reasoning instead of code. Averaging 18 turns per run, it handles simple returns well but struggles with complex, calculation-heavy tasks.

What this means

State of the art models can read and extract information from complex scenarios with reasonable accuracy, but reliably chaining those extractions through long sequences of dependent computations remains an open challenge. In tax preparation, this manifests as a characteristic error cascade: models get the source data right about half the time, handle intermediate calculations less reliably, and produce correct final values only about one in five attempts. The gap between reading comprehension and sustained computational reasoning is where the most significant capability limitations lie.

This type of evaluation, with expert authored criteria placed at each stage of a computation pipeline, provides a more granular signal than simple pass or fail benchmarks. It reveals not just whether a model gets the right answer, but where exactly in the reasoning chain it breaks down. For model developers, this means targeted improvements in arithmetic reliability and cross form value propagation could yield disproportionate gains. For practitioners considering AI for tax preparation, the results suggest that current models are useful assistants for data extraction and simple returns, but are not yet reliable enough for complex filings without human review at every stage of the computation.