Back

Realm Quantis: Financial reasoning benchmark

An evaluation of frontier models on finance reasoning and spreadsheet-grounded analysis

Introduction

Realm Quantis is a finance-reasoning benchmark built around the actual work product that practitioners deliver: IFRS reconciliation workbooks, systematic hedge-fund backtests, venture-capital term sheet analyses, and treasury cash-flow forecasts. Each of the 103 tasks is grounded in the same source materials a human analyst would open — named-range Excel workbooks, broker PDFs, earnings call transcripts, monetary-policy decisions — so that benchmark performance maps directly onto the question finance teams care about: can I trust this output?

The results reveal a clear and consequential split. Frontier models are genuinely capable in back- and middle-office functions — process-intensive work where the premium is on accurate extraction, consistent rule application, and well-structured reporting. They are materially weaker in the front-office functions that drive P&L: capital allocation, portfolio construction, and investment decision-making. That asymmetry matters because the cost of an error is not symmetric. A mistake in a reconciliation memo is caught in review; a mistake in a capital allocation recommendation has direct financial consequences. We ran every task three times against GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro — 927 rollouts in all — to surface both the capability gaps and the structural differences in how each model reasons through a finance problem.

Three findings shape the practical implications. The back and middle office are defensible today — models scoring 70–80% on treasury and operational finance tasks can materially accelerate workflows where accuracy and throughput matter most. The front office is a different story: models are not yet reliable as analysts on capital allocation questions, and should be treated as research accelerators rather than decision support systems. And the reasoning architecture matters more than the headline score — GPT-5.5, Opus, and Gemini post similar aggregate numbers through fundamentally different approaches, which will determine their robustness as task complexity increases.

Headline results

The three models score similarly — and none clears 50% on tasks that demand a judgment call

GPT-5.5 leads on mean reward, but the headline number for all three models is below 50% on rubrics graded against exact numeric criteria and signed recommendations. pass@3 gives the operational picture: the fraction of tasks where at least one of three independent rollouts clears half the rubric weight. Even the best model clears that bar less than half the time. For finance professionals evaluating whether to deploy these tools in judgment-heavy workflows, that is the number that matters — not average performance across a benchmark, but the probability that any single run of the model produces a usable answer on a hard task.

Finding 1 - A fundamental breakdown in decision-making

Models can extract and calculate, but cannot make autonomous decisions

Every finance deliverable chains together the same sequence: find the right number, apply the right transformation, project it forward, then take a position. Quantis makes this chain explicit by grading each rubric criterion into one of four capability buckets. The pattern in the data is not a gradual decline — it is a step change at the final link.

  • Extraction / reporting
    Reading the right value from a workbook or document, with correct units and sign. This is the highest-scoring bucket (~50% on average) because it is the most constrained and the most similar to tasks on which these models have been trained at scale.
  • Calculation
    Applying the correct transformation: growth rates, margins, WACC, Sharpe ratios, portfolio statistics. Roughly 10 percentage points lower than extraction, reflecting the accumulated error from choosing the right formula and applying it correctly to the right inputs.
  • Forecast / scenario

    Propagating assumptions through bull/base/bear or stress scenarios. Consistently the weakest of the analytical buckets, roughly equal to calculation overall but with higher variance. The issue is not computing a single scenario correctly; it is keeping assumptions consistent across multiple parallel paths while preserving internal coherence.
  • Decision / recommendation
    The final write-up committing to a buy, sell, allocate, or proceed call that follows from the analysis. This is the worst positive bucket at ~25–30%. It is critical to understand what this failure mode means: models are not simply wrong about the facts. On many of these tasks, the intermediate calculations are partially correct. The failure is that the conclusion does not follow from those calculations. A recommendation to proceed accompanies analysis that shows the deal doesn't clear the hurdle. A suggested allocation is inconsistent with the risk constraints the model itself computed. This is not error propagation — it is a structural inability to translate completed analysis into a coherent final judgment.
  • Guardrail avoidance
    Avoiding specific flagged-wrong values (the "negative criteria" in each rubric). At ~75–80%, this is the strongest bucket, but roughly 1 in 4 negative criteria are still tripped — confident hallucinations of plausible-looking but verifiably wrong numbers.

The implication for finance practitioners is direct: the current generation of frontier models is a plausible assistant for the analytical portions of a workflow — the research, the data ingestion, the calculation. It is not yet reliable as a decision engine. An analyst using these tools should expect to add value primarily at the final step, reviewing model output not just for numerical accuracy but for logical consistency between analysis and conclusion. That is a meaningful reduction in research burden but not an elimination of professional judgment.

Finding 2 - Capital allocation is the hard ceiling

Models are weakest where money is actually at risk

The performance gap across finance domains is not incidental — it tracks directly onto the distinction between functions that execute rules and functions that allocate capital. Treasury operations (cash-flow forecasting, FX hedging, intercompany settlements) and insurance (reserve adequacy, regulatory reporting) sit at the high end of the benchmark, with cross-model averages above 70%. These are functions where a professional's job is largely to apply a defined methodology to structured inputs, produce a precisely formatted output, and flag exceptions. Models are good at this. Venture capital and trading/hedge-fund tasks sit at the low end, averaging 22.8% and 31.3%, respectively, roughly 40-50 percentage points below treasury. These are functions where the job is to form a view, size a position, and commit to a recommendation that will be acted upon with real capital.

  • Venture Capital
    Weakest bucket is forecast / scenario (0.34 avg; GPT-5.5 0.61, Opus 0.41, Gemini 0.00). A VC analyst's core output is a set of entry, base, and downside scenarios that together justify a valuation and drive the investment committee memo. Models consistently fail to propagate assumptions through those scenarios coherently.
  • Trading / Hedge Funds
    Weakest bucket is decision / recommendation (0.22 avg; GPT-5.5 0.23, Opus 0.20, Gemini 0.25). A portfolio manager's deliverable is not an analysis — it is a position: a specific weight, a defined risk budget, a concrete trade. Models reach the analysis but fail to commit to the trade in a way that follows from it.
  • Corporate finance

    Weakest bucket is decision / recommendation (0.28 avg; GPT-5.5 0.30, Opus 0.24, Gemini 0.31). CFO-level outputs — M&A go/no-go recommendations, capital structure decisions, board-ready forecasts — require translating completed analysis into a defensible position. This is where models break down.
  • Insurance
    Weakest bucket is extraction / reporting (0.65 avg; GPT-5.5 1.00, Opus 0.60, Gemini 0.35). High variance across models on a small task count; the aggregate score masks significant inconsistency.
  • Treasury function
    Weakest bucket is forecast / scenario (0.74 avg; GPT-5.5 0.68, Opus 0.83, Gemini 0.71). Even in the strongest domain, multi-horizon scenario propagation introduces meaningful error.

Finding 3 - Divergent reasoning architectures

Similar scores, fundamentally different approaches — with implications that grow over time

Across 927 rollouts, GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro post similar aggregate scores. But the trajectories that produced those scores look nothing alike, and the differences matter for how these models will behave on more complex, longer-horizon tasks.

GPT-5.5 favors large, monolithic Python heredocs — it reads the data, transforms it, and prints the answer in a single extended code block. This is efficient on well-defined tasks but brittle when assumptions prove wrong partway through: the model tends to commit to a computation path and execute it rather than iterating. Claude Opus 4.7 splits work between Python and its native read/edit tools, mixing programmatic and natural-language reasoning steps in a way that is more transparent but also more variable. Gemini 3.1 Pro writes Python most frequently, but in shorter, more iterative invocations — inspect the file, compute one step, verify, compute the next — closer to how an analyst would actually work through an unfamiliar dataset.

Total tool invocations across the trajectory (read, write, bash, web search, etc.).

Counts how often the agent invoked Python or Node (viapython -c, heredoc, or a script file). A proxy for "did the agent actually compute, or reason in its head?"

Web searches plus webfetch calls — used to ground rates, market data, or methodology references that aren't in the attached files.

How often agents pip / apt install missing libraries (pandas, openpyxl, scipy, pdfplumber). Higher means more aggressive environment setup.

The shell-behavior chart makes the strategic difference concrete. GPT-5.5's shell calls are dominated by python_inline — long heredocs that load data, compute, and output in one pass, averaging 5.5 code invocations per run. Gemini writes Python much more often (13.3 invocations per run) but in shorter, incremental bursts. Opus sits between them, with a notably higher share of native file-read and edit operations that bypass Python entirely. These are not cosmetic differences: they reflect fundamentally different reasoning strategies under uncertainty. GPT's one-shot approach is fast and efficient when the computation is well-specified, but it leaves little room to correct assumptions mid-task. Gemini's incremental approach is more robust to data surprises but generates higher overhead. Opus's hybrid approach trades efficiency for interpretability.

At the scale of Quantis tasks — well-specified, bounded, 103 expert-authored problems — these differences don't strongly predict final score. On real-world finance workflows that are longer-horizon, require adapting to unexpected data, or involve ambiguous briefs, the differences will compound. Teams building production finance AI should not assume that a model's benchmark performance on narrow tasks predicts how it behaves on the open-ended analysis work that takes most of a professional's day.

Environment: the agent's world

Each Quantis task drops a model into its own fresh sandbox with a real-world finance brief — an IFRS reconciliation workbook to balance, a hedge-fund backtest to run, a venture term sheet to evaluate, a treasury cash-flow forecast to build — together with the source files a human analyst would actually start from: workbooks with named ranges, broker reports, term sheets, monetary-policy decisions, call transcripts. The 103 tasks ship 144 attachments in all.

The agent runs on the OpenCode harness with a generic toolkit: a shell, file read/write, a code editor, web search, and webfetch. It is free to install packages, write Python or Node scripts, fetch external data, and iterate until it commits a final written answer. Reading the attachments well requires picking the right tool for each format — pandas for CSV, openpyxl for XLSX, pdftotext or python-docx for text-heavy files — which is itself part of what the benchmark measures.

Once the agent completes the task, an LLM judge — GPT-5.5 mini — scores the answer against an expert-authored rubric. The rubric is a list of weighted criteria, most positive ("states free cash flow as $14.2M ± $0.5M for FY26"), some negative ("does not report the wrong WACC of 8.5%"). The judge reads the submission and decides, criterion by criterion, whether each item is satisfied. The run reward is the signed-score ratio of earned weight to total positive weight, clamped to [0, 1].

Implications: What does this mean for finance organizations deploying AI?

Quantis is hard because it is concrete. The rubric grades exact numbers, signed scenarios, and final recommendations — there is no credit for sounding plausible. All three frontier models demonstrate real capability on this benchmark: they read messy spreadsheets without hand-holding, write code to compute, and produce coherent finance memos. But three structural findings should shape how practitioners deploy them.

The back and middle office are defensible today. Models scoring 70–80% on treasury, insurance, and operational finance tasks are capable of materially accelerating workflows where the premium is on accuracy, consistency, and throughput — reconciliation, reporting, regulatory submissions, financial statement analysis. The remaining error rate still requires human review, but the productivity gain is real and the risk of undetected error is manageable in contexts with downstream checks.

The front office is a different story. Mean reward below 35% on trading and VC tasks, and a decision/recommendation pass rate of ~23% in trading specifically, means models are not yet reliable as analysts on capital allocation questions. The failure mode is not primarily computational — it is the inability to translate completed analysis into a coherent investment recommendation. In a context where a wrong recommendation can move capital, the risk profile is asymmetric in ways that benchmark averages obscure. Institutions deploying AI in investment-facing workflows should treat current models as research accelerators, not as decision support systems, and should build review processes accordingly.

The reasoning architecture matters more than the headline score. GPT-5.5, Opus, and Gemini score within a few percentage points of each other on Quantis. How they arrive at those scores — one-shot computation vs. iterative reasoning vs. hybrid approaches — will determine which model is most robust as task complexity increases. Finance teams building AI infrastructure should evaluate models on the actual task distribution they face, including long-horizon, open-ended, and ambiguous briefs, not only on narrow benchmark performance.

Domain: Trading / Hedge Funds. Each dropdown option shows one model's full rollout on this task: the run metadata, the LLM judge's per-criterion grading, the submitted answer artifact, and the full tool-call trajectory.