Back

Realm Warren: Legal reasoning benchmark

As part of our Realm benchmark series, we built the standard for evaluating legal reasoning in AI systems.

Mean score

Introduction

We built Realm Warren to evaluate legal reasoning as a skill that develops across multiple steps. The benchmark is set in realistic litigation, transactional, and compliance contexts. A task may ask the model to draft a motion and respond to new arguments, write a memo identifying potential claims and revise it as the discovery record changes, or prepare a compliance analysis that must be updated for a new jurisdiction or a shift in the relevant timeframe. Each task tests whether a model can produce a single legal work product and whether it can adapt that work as circumstances evolve. That is the core of long-horizon legal reasoning.

The benchmark spans federal and state law. Tasks cover established fields like IP, securities, criminal law, civil rights, and state civil procedure, alongside specialized or evolving areas like NCAA disputes, veterans benefits, and government contracts. The breadth is designed to give a representative view of model capability across well-developed doctrinal areas with abundant public materials and areas with messy records, fragmented standards, and changing law.

Realm Warren scores the final output. Its rubrics decompose performance through IRAC: issue spotting, rule identification, factual application, and legal conclusion. Polished writing can mask weak reasoning; the IRAC decomposition exposes it. Looking beneath the final answer lets us see whether a model is reasoning like a lawyer or just writing like one.

Environment: the agent's world

Each task drops the model into a sandboxed container with a read-only file system of legal materials — statutes, case law, client documents, and (where relevant) image exhibits — and a minimal set of tools. The model writes a legal work product and submits the answer. The sandbox file system is organized like this:

The model has access to a small toolkit: a shell for navigating the file system and running standard utilities, file-read and file-edit tools for working with documents, an image viewer for visual exhibits, and a web search tool for cases the local library doesn't cover. The model writes its memo and submits the final answer when done.

Every criterion in the rubric is checked by an LLM judge against the submitted memo; per-criterion scores are weighted and summed to produce the final reward (0–1).

How three models compare

Claude Opus 4.7 and GPT-5.5 are statistically indistinguishable on mean weighted reward — Opus leads by 0.007, well within either model's 95% confidence interval. Gemini 3.1 Pro is clearly separated below both, at roughly two-thirds of either's score. Confidence intervals were computed by bootstrapping the benchmark (10,000 resamples of tasks with replacement, paired across models).

Key findings

The IRAC chain breaks after issue spotting.

Models reliably identify the legal issue in play, but score much lower on selecting the controlling rule, applying it to the facts, and reaching a supported conclusion. Application failures are sharpest when the legally significant fact is an absence from the record, such as missing service or lack of notice.

Performance fades on later deliverables.

All three models perform better on the first half of a multi-stage task than the second half. They front-load their effort and struggle to revise an earlier conclusion when new facts, contrary evidence, or new authority arrive.

Skipping visual exhibits is costly.

When models decline to open an image, they often invent details about its content, and the rubric penalizes those inventions. Opening the image is no guarantee of a correct read, but skipping it reliably loses points.

IRAC: scoring the primitives of legal reasoning

IRAC stands for Issue, Rule, Application, and Conclusion. It is the framework U.S.-educated lawyers commonly use to structure legal arguments in briefs, memos, and exams, making it one of the closest points of consensus on the basic steps of legal reasoning. For evaluation, IRAC turns "legal reasoning" from a broad label into a sequence of measurable steps: whether the model identified the issue, selected the governing rule, applied that rule to the facts, and reached a supported conclusion.

Each Warren task contains 35 to 60 rubrics scoring different parts of the final deliverable. The distribution is issue 4% · rule 33% · application 48% · conclusion 9% · other 6%. "Other" captures important deliverable-specific criteria that do not map cleanly onto IRAC, such as "mentions adding a delay notification clause to Section 2.1."

Within the IRAC buckets, issue spotting is the clear high point across all three models. Claude Opus 4.7 scores highest at 0.68, followed by GPT-5.5 at 0.63 and Gemini 3.1 Pro at 0.56.

Performance drops sharply once the task moves from recognizing the legal issue to doing the reasoning work behind it. Rule identification and factual application cluster much lower, with scores generally in the 0.20 to 0.37 range. Conclusion scores are similarly weak. This suggests that models can often identify the general legal area implicated by a task, but struggle more with selecting the controlling rule, applying it to the record, and reaching a supported conclusion.

The rest of this section focuses on the two most important failure modes behind that drop: Rule and Application.

Rule failures

Rule failures occur when a model identifies the right legal neighborhood but misses the specific governing authority, standard, exception, or procedural requirement. For example:

Task 13 — NYPD Stop and Frisk Gun Search, Opus 4.7 run. The answer analyzed Fourth Amendment excessive-force claims and cited Graham v. Connor, including the factors used to assess objective reasonableness. It did not, however, identify that the Southern District of New York applies a balancing test for excessiveness, nor state that excessiveness is determined by balancing the nature and quality of the intrusion against the governmental interests at stake.
Task 5 — "The Chernobyl Project" Independent Film, GPT-5.5 run. The answer mentioned Louisiana's right-of-publicity statute, La. Stat. Ann. §§ 51:470.1 to 51:470.11, but did not identify the key statutory limitation: under the Allen Toussaint Legacy Act, use of an identity in an audiovisual work is exempt unless it creates an unauthorized performance — meaning the use of a digital replica to substitute for a professional performer in a work where the performer did not actually appear. The answer discussed digital-replica and likeness issues generally, but did not articulate that limitation in the relevant statutory context.

A caveat on Rule scores: rule identification is not always citation matching. In common law, multiple cases may support the same holding or legal test, so a rubric can risk overweighting one doctrinal path. Realm Warren limits this risk by requiring specific case law only when the authority is seminal, controlling, or necessary. Most Rule rubrics instead cite primary authorities, such as statutes, regulations, or procedural rules, and focus on substance: whether the model understands the governing standard, exception, limitation, burden, or procedural requirement.

Application failures

Application failures occur when the model states a relevant rule but fails to connect it to the facts that decide the issue. For example:

Task 26 — Building Code Enforcement Action, Opus 4.7 run. The criterion required the answer to identify that the Notice of Hearing for VIO2025-00001 was delivered only seven days before the hearing and therefore did not satisfy the 10-day notice requirement under Lee County, Florida Administrative Code 2-14, Rule 1.09(a)(1)(b) (2025). The answer discussed service of the VIO2025-00001 notice, but stated that the July 25, 2025 personal delivery for the August 1, 2025 hearing "fully complies" with the statute and did not flag the 10-day notice defect.

The Application weakness is especially pronounced when the controlling fact is the one missing from the record.

For example:

Task 15 — Citation - BBQ Boat, GPT-5.5 run. This Motion for Rehearing task required the model to identify that the defendant's Motion to Dismiss was not served on the County, and that the court overlooked that defect in its ruling. The answer identified related procedural concerns, including that the motion was not noticed for the hearing and that the County lacked counsel at the hearing. But it did not state that the Motion to Dismiss was not served on the County, did not rely on Fla. R. Civ. P. 1.080(a), and did not argue that the court overlooked the service defect as a basis for rehearing.

Application failures driven by a missing fact matter because many legal arguments turn on absence: lack of notice, missing service, failure to exhaust remedies, absent consent, unsigned agreements, procedural defects, or the nonexistence of a required element. A benchmark that rewards only affirmative fact extraction misses a central part of legal reasoning.

Models can often produce plausible legal conclusions, but those conclusions are most useful when the underlying issue, rule, and factual application are also correct. Realm Warren therefore treats fluency as insufficient. What matters is whether the model reaches a conclusion through the correct reasoning path.

Long-horizon reasoning: the fading tail

Legal work is cumulative and long-horizon by design. In litigation, a motion often has to respond to the other side's latest filing, and client-facing advice changes as discovery reshapes the record. In compliance, an approach that works in one jurisdiction may need revision in another, and a memo may need to change when previously relied-upon authority is narrowed or overturned.

Warren prompts simulate this by introducing new facts, arguments, jurisdictions, or authorities midway through the task. The benchmark then tests whether models can update their analysis and carry that revision into later deliverables.

To measure this, we grouped rubrics by whether they score the earlier deliverable or the second-half deliverable. The chart shows a consistent second-half drop across all three models. Claude Opus 4.7 falls from roughly 0.52 in the first half of multi-stage tasks to about 0.40 in the second half. GPT-5.5 drops from about 0.47 to 0.33. Gemini 3.1 Pro starts lower, around 0.31, and declines slightly to about 0.29.

Average weighted reward on the first half of each task's deliverables vs the second half, across multi-stage tasks. Opus and GPT-5.5 each shed roughly 0.10–0.12 in reward across the boundary; Gemini's smaller absolute drop reflects a much lower starting point.

Stage 1 asks the model to identify viable constitutional claims arising from the May 1, 2024 stop-and-frisk encounter. Stage 2 reassesses those claims against new discovery — the UF-250 form, BOLO notice, and deposition testimony — and analyzes available NYPD disciplinary actions. All three models drop sharply on the post-discovery stage: Opus 0.56 → 0.38, GPT-5.5 0.42 → 0.20, Gemini 0.27 → 0.14.

For example:

Task 19 — CA Housing Discrimination Litigation. This task initially asks the model to prepare a legal memorandum identifying viable state and federal housing discrimination claims for a tenant whose rental application was denied after questions about marital status, family structure, and her child. The task then introduces discovery that complicates the original theory, including rental-application data and deposition testimony. This long-horizon design tests whether the model can revise its earlier claim assessment as the record evolves.

In the GPT-5.5 run, the model discussed California FEHA marital-status claims but failed to recognize that the discovery record weakened the claim by making it difficult to show the landlord's decision was substantially motivated by marital-status discrimination. Opus 4.7 showed a similar pattern. It identified most relevant claims from the pre-discovery record, but failed to revise its conclusions once contrary evidence was introduced.

The drop reflects a reasoning-continuity gap. Long-horizon legal reasoning requires a model to preserve the legally relevant parts of the record, update its analysis as the task develops, and reuse prior conclusions in later work. A model that performs well on a short legal question may still face challenges when asked to reason across a sequence of dependent legal tasks.

How models navigate visual exhibits

Many Warren tasks include visual exhibits — design marks, asset grids, screenshots referenced in notices, photographs attached to client letters. The choices a model makes about which of those files to open are visible in the trajectory, and they show up in the score.

The clearest pattern shows up when a model decides not to look. The judge consistently catches it. We see runs that make confident, specific claims about an image the model never opened — asserting a timecode, describing a screenshot, citing what an exhibit "shows" — and get marked down for inventing facts about content the model didn't see.

The cost of that habit is measurable. On image-tagged criteria, all three models score deeply negative when they skip view_image (≈-0.28 Opus, -0.10 GPT-5.5, -0.35 Gemini), and modestly positive when they open the visuals (+0.19, +0.17, +0.06 respectively). Even the highest-scoring image-aware runs hit only ~0.4 on image criteria, so opening the image is not sufficient for a correct answer. Skipping the image often leads to unsupported claims about content that was available to inspect.

Summary

The headline result is not that legal reasoning can be reduced to a single score. Legal work often allows multiple valid paths to the same conclusion, especially in common law domains where different authorities may support the same rule. We treat the sub-40% result as a diagnostic signal, not a final verdict on model capability. It shows where models tend to break down when asked to reason through evolving legal work products.

Three failure modes drive the result.

First, the IRAC reasoning chain often breaks after issue spotting. Models can identify the general legal area, but struggle to select the controlling rule and apply it to the factual record. Application failures are especially pronounced when the legally significant fact is an absence or omission, such as missing service, lack of notice, or failure to exhaust. The result is a conclusion that sounds plausible but is unsupported.

Second, models front-load their effort and fail to revise. They perform better on the initial deliverable, then degrade when later work requires updating an earlier conclusion based on new facts, contrary evidence, changed procedural posture, or new authority.

Third, multimodal records increase hallucination risk. In some runs, we found evidence that the model did not open the image at all, but still answered the question as if it had reviewed the visual record. This leads to misstated or unsupported facts, especially when legally significant information appears only in image-based materials.

The broader implication is that high-value legal work places heavy demands on models because that work is long-horizon by design. Realm Warren surfaces the incremental capabilities that need to improve: maintaining context, revising conclusions under changed facts, applying rules to incomplete or messy records, and grounding analysis in multimodal sources. These failure modes can guide fine-tuning, targeted evaluation, and agentic design, including when to force record checks, trigger rule verification, require contradiction review, or add safeguards before carrying an earlier conclusion into a later deliverable.

We also see this benchmark as a starting point for future research. Further work should examine how legal evaluations can account for multiple valid doctrinal paths, alternative authorities, and the different ways lawyers reach the same conclusion. The goal isn't to make legal reasoning artificially rigid, but to build evaluations that are objective and structured enough to diagnose failure, while still flexible enough to reflect how lawyers actually reason.

Below are three full runs from the NYPD Stop and Frisk Gun Search task — one per model. Each run includes the per-criterion grading the LLM judge produced, the submitted memo, and the full trajectory of model thinking, tool calls, and tool results. Use the dropdown to switch between models.