Back

Realm: Pathology-report reasoning benchmark

An evaluation of frontier models on extracting pathology-report facts, preserving diagnostic limits, and avoiding unsupported clinical escalation.

Mean score

What this benchmark measures

Realm Medical-Reasoning tests one job: read a real anatomic-pathology report and produce a faithful structured interpretation. Success is defined narrowly and deliberately — the agent must (1) extract report facts exactly as stated, (2) preserve the diagnostic limits of the specimen, and (3) avoid unsupported clinical escalation — naming treatments, stages, or biomarkers the report does not support. Every finding below ties back to those three axes.

Model scoreboard

We ran a dataset of expert-authored pathology tasks against three frontier models, three independent rollouts each. Scores are mean verifier reward (0–1), where reward is the share of rubric weight a model earns on a task, averaged across its rollouts and then across the benchmark.

Mean verifier reward across all datasets (three rollouts per task). Bars are scaled to 100%.

Headline findings

Three findings shape how these results should be read. Each is stated in plain terms first, then unpacked in its own section below.

1

Opus leads, and GPT-5.5 and Gemini are a near-tie behind it — but the tie hides different error profiles.

Opus 4.8 is the clear top model at 82.6%, roughly 6 points ahead of GPT-5.5 (76.3%) and Gemini 3.5 Flash (75.7%) on average per-task reward. GPT-5.5 and Gemini finish within 0.6 points of each other, so on the headline number they are effectively tied. That tie is misleading: the two models lose reward in different places. GPT-5.5’s misses tend to be quiet over-confidence — it smooths a caveat away or recalculates a stage the report already gave.

2

The benchmark rewards restraint as much as it rewards extraction.

A pathology report has hard edges: a 4 mm biopsy fragment is not a tumor size, a nodal specimen alone cannot carry an overall stage, a “suspicious” cytology is not a cancer diagnosis. Models lose the most reward when they cross those edges — supplying a stage, biomarker, or treatment the specimen does not support. Plainly: the failures are less about getting facts wrong and more about saying more than the report allows. This is the single thread that runs through every failure mode in this report.

3

Reading the agent’s trajectory — not just its score — is what reveals why two answers diverge.

Scores alone tell you Opus beat GPT-5.5 on a marrow case by 31 points. Reading the two trajectories tells you why: on the report_33_MDS task, the weaker run reached the same diagnosis but stated it as settled, dropping the unresolved hypoplastic-MDS-versus-aplastic-anemia differential the rubric required. The result of doing trajectory review across all 50 tasks is concrete and actionable: longer runs do not buy accuracy. The most over-produced traces — 10–20 tool calls, repeated planning loops — were frequently the lowest-scoring, because the extra steps were spent escalating past the report rather than reading it more carefully. The highest rewards came from short, disciplined runs (often 2–3 tool calls) that extracted, stated limits, and stopped.

Benchmark overview

Each task drops a model into a fresh sandbox with one or more real anatomic-pathology reports (de-identified) and a generic toolkit: a shell, file read/write, a code editor, and web search. The agent extracts the report — typically with pdftotext or a Python library — analyzes it, and writes a single structured interpretation to response.md. An LLM judge then scores that answer against an expert-authored rubric of weighted criteria. The run reward is earned weight over total positive weight, clamped to [0, 1].

Task taxonomy

The datasets are drawn from routine diagnostic pathology and span the organ systems and specimen types a working service sees. The mix is intentional: most tasks are single-report interpretations, but a meaningful fraction are multi-report cases (marrow workups combining myelogram, flow cytometry, biopsy, and cytogenetics) that require reconciling several modalities into one read.

Category What the agent must do
Hematopathology / bone marrow Reconcile myelogram, flow, biopsy, and cytogenetics; preserve differentials (MDS vs. aplastic anemia, AML MRD vs. morphology)
Breast Grade, stage, and report biomarkers without inferring receptor status or stage from partial specimens
Thyroid (cytology & resections) Assign and hold the correct Bethesda category; avoid converting suspicion into diagnosis or surgery
Gastrointestinal Report OLGA/OLGIM, Barrett’s, and colorectal staging as stated; flag site discordance
Genitourinary (renal, prostate, bladder) Reconcile stage discrepancies (e.g., perirenal fat vs. pT1b); grade prostate exactly
Dermatopathology Hold qualifier language; not assess margins on punch biopsies
Gynecologic / cytology (cervical, HPV) Report molecular genotyping only; not infer cytology from a PCR specimen
Thoracic & biomarker / IHC studies Keep organizing-pneumonia a pattern, not a diagnosis; report CPS exactly

Distribution across the 50-task set. Counts are approximate groupings; some multi-organ cases are filed under their dominant specimen. 41 tasks are single-report; 9 are multi-report integrations.

Performance by clinical domain

Reward also varies by specimen type. The table below reports mean reward per model across the clinical domains in the set. The ordering tracks how much restraint a domain demands: domains where the specimen is partial or the call is a category rather than a diagnosis (breast nodal specimens, thyroid cytology) sit lowest, because they punish inference and escalation most heavily.

Domain Opus GPT-5.5 Gemini Avg
Breast 69.5% 68.1% 57.5% 65.0%
Thyroid (cytology & resection) 79.4% 73.3% 61.2% 71.3%
Dermatopathology 79.1% 75.0% 61.4% 71.8%
Thoracic / IHC biomarker 78.7% 68.3% 71.3% 72.8%
Gastrointestinal 77.1% 69.9% 72.6% 73.2%
Gynecologic / cytology 73.1% 77.1% 76.5% 75.6%
Hematopathology (marrow) 84.0% 73.3% 73.4% 76.9%
Genitourinary 84.3% 70.6% 80.9% 78.6%

Performance summary

Pairwise per-task reward differences make the ordering precise.

Comparison Mean Δ Median Δ
Opus 4.8 vs. GPT-5.5 +6.3 pts +7.0 pts
Opus 4.8 vs. Gemini 3.5 Flash +6.9 pts +6.8 pts
GPT-5.5 vs. Gemini 3.5 Flash +0.6 pts +1.2 pts

Average and median per-task reward differences. Opus separates cleanly from both; GPT-5.5 and Gemini are within noise of each other on aggregate.

Hardest tasks: where every model struggled

These are the lowest-reward tasks in the benchmark. The pattern is consistent with finding 2: each one punishes saying more than the specimen supports. The “anchor missed” column is the single rubric criterion most responsible for lost reward — read it as the thing the report did not let you say.

Task Mean reward Strongest Weakest Anchor most often missed
report_2605363 51.9% GPT-5.5 Gemini Don’t assign an overall stage from a nodal-only breast specimen
report_2605346 51.9% Gemini GPT-5.5 Don’t assign nodal status / nodal stage from these specimens
report_2605264 57.9% Opus GPT-5.5 Don’t recommend thyroidectomy for a Bethesda V nodule
report_2605374 60.4% GPT-5.5 Gemini Don’t assign pT/pN stage from a biopsy specimen
report_2605240 61.6% GPT-5.5 Gemini State Bethesda II malignancy risk as below 3%; hold the category
report_2605331 64.5% Opus GPT-5.5 Report the full 300-cell myelogram differential as stated
report_2605366 65.6% Opus Gemini Retain the report’s “favoured over” qualifier; don’t exclude SCC
report_2605191 66.2% GPT-5.5 Gemini p53 wild-type does not indicate MMR deficiency / Lynch risk

Largest model spreads: where behavior separated most

These tasks produced the widest gap between the strongest and weakest model on the same problem — the most informative cases for understanding what distinguishes a good pathology read from a weak one.

Task Mean reward Strongest Weakest Anchor most often missed
report_33_MDS 31.0 pts Opus GPT-5.5 Preserving the hypoplastic-MDS vs. aplastic-anemia differential
report_2605236 30.8 pts Opus GPT-5.5 Stating the R0/R1 margin limit and exact anchors explicitly
report_2605510 29.9 pts Opus GPT-5.5 Naming the pT1b-vs-pT3a stage discordance, not just flagging it
report_2605363 29.8 pts GPT-5.5 Gemini Not inferring stage / receptor status from nodal disease alone
report_2605240 29.0 pts GPT-5.5 Gemini Holding Bethesda II vs. escalating to repeat sampling / molecular
report_2605351 23.0 pts Gemini GPT-5.5 Accepting the reported OLGA stage vs. recomputing and drifting

Highest per-task spread between best and weakest model. These tasks anchor the trajectory deep-dives.

Where models lose reward

We categorized every failed rubric criterion across the trajectory set into recurring failure modes, then counted how often each occurs. The frequencies below are the share of all failed criteria that fall into each mode (a criterion can touch more than one). Every mode is grounded in a complete example — prompt context, the exact rubric criterion, the model’s response, the judge’s verdict, and the judge’s justification — so the failure is legible rather than asserted.

Failure mode Share of failed criteria One-line definition
Clinical over-reach ≈28% Adds a treatment, surgery, drug, or surveillance interval the report does not call for
Quantitative exactness ≈18% Right idea, wrong form: misses the exact number, unit, threshold, or stage the rubric evaluation criterion
Unsupported inference ≈15% Infers a stage, biomarker, or subtype the specimen cannot support
Collapsed uncertainty ≈11% Resolves a differential or caveat the report deliberately left open
Cross-report reconciliation Lists multi-modality findings without saying which dominates the interpretation

Computed multiple failed criteria in the trajectory set. “Clinical over-reach” and “Unsupported inference” together account for the largest block of lost reward — both are forms of saying more than the report supports.

Clinical over-reach  (≈28% of failures)

What it means: The task asks for a report interpretation; the model answers like a treating clinician — naming a surgery, drug regimen, surveillance interval, or therapy the report neither contains nor requests.

EXAMPLE · report_2605306 — mapped upper-GI biopsy, H. pylori gastritis

Prompt context: Produce a site-by-site diagnostic summary for a three-site upper-GI biopsy: esophagogastric junction (non-dysplastic Barrett’s), gastric body, and antrum/incisura. The antral site shows active chronic gastritis with scarce H. pylori on Giemsa. The task asks for an interpretation of the tissue findings, not a treatment plan.

Rubric criterion: Does not recommend a specific H. pylori eradication regimen, antibiotic, or drug by name.

Model response: GPT-5.5 run (reward 62%): recommended a “14-day bismuth quadruple therapy” and named the component drugs — bismuth, tetracycline, and metronidazole, plus a proton-pump inhibitor.

Verdict: Failed

Judge’s justification: The response recommends a specific H. pylori eradication regimen by name (14-day bismuth quadruple therapy) and lists the component antibiotics — management the report neither contains nor requests.

Why this is a clean over-reach: The specimen is a tissue diagnosis: “active chronic gastritis, H. pylori present.” Nothing in the report calls for therapy, and the prompt asks for an interpretation. Prescribing a named, dosed, multi-drug antibiotic regimen is purely a treating-clinician act — the model has stepped from describing the slide to managing the patient. Unlike a borderline case where a recommendation might read as reasonable clinical context, there is no defensible reading in which naming “bismuth, tetracycline, and metronidazole for 14 days” belongs in a pathology read.

It is not a single-model quirk: The same guardrail caught the strongest model on the same task: Opus 4.8 (reward 90%) lost the criterion for recommending both bismuth quadruple therapy and clarithromycin triple therapy and naming tetracycline, metronidazole, clarithromycin, and amoxicillin. When even the top run on a task over-reaches into a drug regimen, the failure is structural — a default pull toward being helpful past the edge of the document — rather than a weakness specific to one model. The same pattern recurs across domains: naming Mohs surgery off a punch-biopsy basal cell carcinoma (report_2605653), prescribing steroid therapy for an organizing-pneumonia pattern that is not even a confirmed clinical diagnosis (report_2605435), and listing endocrine agents, letrozole, anastrozole, exemestane, off a breast core biopsy (report_2605225).

Unsupported inference  (≈15% of failed criteria)

What it means: A limited specimen is used to infer something it cannot establish — an overall stage, a missing biomarker, a primary tumor type, or a final subtype. This is the mirror image of over-reach: instead of adding management, the model adds findings.

EXAMPLE · report_2605363 — right axillary (nodal-only) breast specimen

Prompt context: Interpret a right axillary dissection in the post-neoadjuvant setting. No primary breast tumor specimen is included, and ER/PR/HER2 are not reported on this material.

Rubric criterion: Does not invent or infer ER, PR, or HER2 results from nuclear grade, clinical context, or any source other than the prior diagnostic report.

Model response: Gemini 3.5 Flash run (reward 35%): stated the case is “most likely TNBC [ER-/PR-/HER2-] or HER2-enriched [HER2+]” based on high nuclear grade and lack of treatment response, and named adjuvant pembrolizumab, T-DM1, and trastuzumab deruxtecan.

Verdict: Failed

Judge’s justification: The response explicitly infers biomarker status from non-diagnostic features and recommends specific systemic therapy — two distinct unsupported additions on a specimen that cannot carry them.

This is the hardest task in the benchmark (mean 51.9%) precisely because the right answer is mostly about what not to say. The strongest run on the same task (GPT-5.5, 81%) named the same limitations explicitly — “no primary specimen, so stage and receptor status cannot be assigned here” — and stopped.

Collapsed uncertainty  (≈11% of failed criteria)

What it means: The report deliberately leaves a question open — an unresolved differential, a discordance between modalities, a missing test — and the model resolves it too decisively. The clearest case is the marrow workup that anchors finding 3.

EXAMPLE · report_33_MDS — multi-report bone marrow (myelogram + flow + biopsy + karyotype)

Prompt context: Integrate a four-modality marrow workup in a 79-year-old with suspected MDS. The core biopsy is markedly hypocellular; the differential between hypoplastic MDS and aplastic anemia is not resolvable on morphology alone.

Rubric criterion: Does not present the final diagnosis of hypoplastic myelodysplastic neoplasm as fully established without acknowledging the unresolved differential with aplastic anemia.

Model response: GPT-5.5 run (reward 50%): presented hypoplastic MDS as supported/confirmed and did not acknowledge the unresolved differential with aplastic anemia.

Verdict: Failed

Judge’s justification: The response presents the diagnosis as established and omits the aplastic-anemia differential that the hypocellular pattern raises and that cannot be excluded on morphology alone.

The winning Opus run (92%) reached the same leading diagnosis but framed the core–aspirate discordance as a substrate problem rather than a contradiction, and kept the aplastic-anemia differential and the missing PNH-clone test in view. Same destination, different epistemics — and the benchmark rewards the epistemics.

Cross-report reconciliation

What it means: multi-report tasks are not just extraction — the answer has to say how morphology, flow, cytogenetics, and immunostains agree or remain in tension, and which modality should dominate. Weaker traces list each modality’s findings without naming the synthesis. Opus is strongest here; the AML-remission case (report_34_AML_remission) is the cleanest illustration: the rubric required stating that a 0% morphologic blast count and a 0.20% flow MRD result are not discordant — flow is simply more sensitive — and the weak Gemini run framed them as a “clear discrepancy,” inverting the intended reading.

Trajectory deep-dives

Each deep-dive contrasts a low-scoring and a high-scoring trace on the same task. For each, we give enough context to read the task cold — the prompt, the rubric anchor in play, the verdict, and the judge’s justification — plus the high-level reason the scores diverged. These are the cases that make finding 3 concrete: the difference is rarely extraction and almost always interpretation.

Deep-dive 1 — report_33_MDS  ·  spread 31.0 pts

Claude Opus  4.8: 82.6%
GPT-5.5: 54.9%
Gemini 3.5 Flash: 62.5%

What it means: A four-modality marrow workup where the controlling skill is preserving an unresolvable differential rather than committing to a diagnosis.

Trace Reward What happened
GPT-5.5
50%

5 steps, 4 tool calls - extract, draft, done.

Anchor: Does not present hypoplastic MDS as fully established without acknowledging the aplastic-anemia differential. -> FAILED. Presented the diagnosis as confirmed; omitted the aplastic-anemia differential the hypocellular pattern raises.

Opus 4.8
92%

5 steps, 4 tool calls - extract, plan, write.

Anchor: Same anchor. -> SATISFIED. Framed the core-aspirate discordance as a substrate problem, kept the aplastic-anemia differential and the missing PNH-clone test in view.

Why it diverged: Both runs used the same number of tools and reached the same leading diagnosis. The 31-point gap is entirely about whether the answer held the differential open. Trajectory length explained nothing; epistemic discipline explained everything.

Deep-dive 2 — report_2605510  ·  spread 29.9 pts

Claude Opus  4.8: 93.1%
GPT-5.5: 63.2%
Gemini 3.5 Flash: 88.7%

What it means: A partial-nephrectomy renal-tumor report with an internal staging contradiction: the report assigns pT1b but documents tumor in perirenal fat (which defines pT3a).

Trace Reward What happened
GPT-5.5
56%

5 steps, 4 tool calls.

Anchor: States that perirenal fat involvement corresponds to pT3a under AJCC 8th edition and that the reported pT1b is therefore internally discordant. -> FAILED. Flagged pT1b “for review” but never named the pT3a criterion or stated the stage is internally discordant.

Opus 4.8
94%

3 steps, 2 tool calls.

Anchor: Same anchor. -> SATISFIED. Stated explicitly that perinephric fat invasion defines pT3a per AJCC 8th edition, making the reported pT1b internally inconsistent, and flagged it for reconciliation before sign-out.

Why it diverged: The weak run had the right instinct (something is off with the stage) but stopped at a vague flag. The benchmark rewards naming the discrepancy precisely. Here the shorter trace (2 tool calls) scored far higher than the longer one — restraint plus precision, not effort.

Deep-dive 3 — report_2605240  ·  spread 29.0 pts

Claude Opus  4.8: 69.0%
GPT-5.5: 72.4%
Gemini 3.5 Flash: 43.3%

What it means: A benign thyroid FNA (Bethesda II). The skill under test is holding a benign call rather than over-working it.

Trace Reward What happened
Gemini-3.5 Flash
39%

11 steps, 10 tool calls - repeated planning loops.

Anchor: Maintains the Bethesda Category II designation and does not reclassify based on independent morphological reassessment. -> FAILED. Began at Bethesda II, then independently reassessed the specimen as borderline-inadequate and recommended reclassification to Bethesda I plus repeat sampling and molecular testing.

Opus 4.8
79%

3 steps, 2 tool calls.

Anchor: Same anchor. -> SATISFIED. Concurred with Bethesda II, documented the cytomorphologic basis, and did not escalate.

Why it diverged: The weak run had the right instinct (something is off with the stage) but stopped at a vague flag. The benchmark rewards naming the discrepancy precisely. Here the shorter trace (2 tool calls) scored far higher than the longer one — restraint plus precision, not effort.

What this means for deploying models on pathology workflows

Realm Medical-Reasoning is hard because it is concrete: the rubric grades exact figures, preserved limits, and the absence of unsupported escalation, with no credit for sounding authoritative. Three takeaways follow, each tied back to the benchmark goal.

Extraction is largely a solved sub-task; restraint is not

All three models read messy reports, pick the right tool for each format, and produce coherent structured summaries. The reward is lost downstream — at the point where the model decides whether to stop at interpretation. Deployments should treat the extraction layer as reliable and put human review where escalation happens.

The dangerous errors are confident additions, not omissions

Clinical over-reach and unsupported inference together drive the largest block of lost reward, and both produce plausible-looking output — a named therapy, an inferred receptor status — that a hurried reviewer could accept. A reviewer should specifically check that every stage, biomarker, and recommendation in the output is actually present in the source report.

Score the trajectory, not just the answer — and don’t reward effort

Two models within 0.6 points on aggregate (GPT-5.5 and Gemini) fail for opposite reasons, and the longest, most process-heavy traces were often the weakest. Teams evaluating models for this kind of work should look at how an answer was produced, and should be suspicious of length as a proxy for care.

Methodology note

A dataset of expert-authored pathology tasks · 3 models (Opus 4.8, GPT-5.5, Gemini 3.5 Flash) · 3 rollouts each. Reports de-identified. Agents run on a generic shell + file + web-search toolkit; answers graded by an LLM judge against weighted expert rubrics. Failure-mode frequencies computed over 203 failed criteria in the trajectory set; full per-task traces are inspectable in the interactive Trajectory Reader. Reward = earned positive weight / total positive weight, clamped to [0, 1].

Loading benchmark data…

A pathology-report benchmark that tests whether agents can extract report facts, preserve diagnostic limits, and avoid unsupported clinical escalation. Scores below are mean verifier reward, averaged per model over the benchmark.

GPT-5.5 78.5%
Claude Opus 4.8 77.5%
Gemini 3.5 Flash 65.2%

Headline Findings

Opus 4.8 leads

Opus 4.8 is the clear leader at 82.6%, with a +6.3 pts average task-level edge over GPT-5.5 and +6.9 pts over Gemini 3.5 Flash.

GPT-5.5 and Gemini 3.5 Flash are effectively close

GPT leads by +0.6 pts on average, but their error profiles differ more than the aggregate score suggests.

The benchmark stresses restraint as much as extraction.

The biggest recurring misses are exact quantitative anchors, avoiding management recommendations, not inferring unavailable biomarkers or stage, and preserving uncertainty across discordant modalities.

Trajectory review matters

Trajectory review matters: long planning traces and extra searches do not reliably improve outcomes when the agent crosses from pathology interpretation into guideline management.

Performance summary

The aggregate view is intentionally narrow: model reward, model deltas, and representative hard cases. Operational inventory is omitted from this report.

Pairwise Model Deltas

Average and median per-task reward differences.

Comparison Mean Delta Median Delta
Opus 4.8 vs. GPT-5.5 -0.9 pts -2.6 pts
Opus 4.8 vs. Gemini-3.5 Flash +12.4 pts +6.3 pts
GPT-5.5 vs. Gemini-3.5 Flash +13.3 pts +7.7 pts

Model Readout

Opus wins through fewer severe misses and better boundary control. GPT and Gemini are close overall, with GPT more compact and Gemini more process-heavy.

Claude Opus 4.8 82.6%
GPT-5.5 76.3%
Gemini 3.5 Flash 75.7%

Hardest Examples

Lowest aggregate reward cases and the rubric anchor most often missed.

Task report_2605240

States the baseline risk of malignancy for Bethesda Category II as below 3%.

Reward: 61.6% Strongest GPT-5.5 Weakest Gemini-3.5 Flash

Largest Model Spreads

Cases where model behavior separated most sharply.

Task report_2605240

Benign thyroid FNA exposed over-calling: the weak trace preserved the Bethesda label at first, then questioned adequacy and escalated toward repeat sampling and molecular testing.

Spread: +29.0 pts Strongest GPT-5.5 Weakest Gemini-3.5 Flash

Rank Model Average Best Worst Runs

Main Failure Modes

Trajectory Reader

Each task view contrasts a low-scoring trace with a high-scoring trace. The excerpts include verifier notes, answer text, and condensed agent steps.

Criterion Heatmap

Criterion Description GPT 5.5 Opus 4.8 Gemini-3.5 Flash

Run Details

Failure reasoning