1. Why we built this

Contract redlining is judgment-dense and strategically complex. It is closer to poker than to math. There are early game moves, countermoves, tradeoffs, and end games, all undertaken with incomplete information. Party leverage, business priorities, counterparty tolerance, and the value of each move are often uncertain. There is rarely a single right move.

That creates two challenges for benchmark design. First, the benchmark has to reflect the complexity of the workflow itself. A useful redline is not just a legally correct clause edit. It is a move in a negotiation, shaped by deal context, party posture, timing, and the need to preserve momentum toward execution.

Second, the judgment behind a strong redline is often only implicit in the attorney's work product. A golden response may show what an attorney changed, but not why the issue mattered or how the attorney weighed it against the rest of the negotiation. A useful benchmark therefore has to preserve attorney output in its native form while also converting the most important redlining judgments into structured evaluation criteria.

The Crosby-micro1 RedlineBench is designed around those challenges. It uses multi-turn SaaS MSA negotiations, document-native redlines, attorney-authored golden responses, and rubrics tied to the decisions attorneys considered most important at each stage. By collecting data in the real workflow of contract negotiation, the benchmark evaluates models as negotiation participants, not merely drafting assistants.

2. Summary of Findings

GPT-5.5 has the highest overall turn-weighted rubric score, but the spread across models is narrow, suggesting that the benchmark remains challenging across the frontier model set.

No.	Finding	Detail
1	Issue prioritization	Issue prioritization is a shared weakness. Models struggle to identify the issues attorneys collectively treat as most important, especially when initiating redlines on a clean template.
2	Over-acceptance	Models exhibit a systematic over-acceptance bias when forced to accept or reject counterparty redlines. This pattern suggests that models lack a genuine understanding of the commercial stakes behind redlined terms and instead default to agreement regardless of substance.
3	Surgicalness	Claude Fable 5 leads on surgicalness. Among the models, Fable 5 comes closest to attorney drafting behavior, with the lowest reliance on block edits and the shortest average edit length.
4	The gap	Current models remain meaningfully short of attorney-grade redlining. The gap is not limited to legal correctness. Models remain weaker on strategic issue selection, vendor-side commercial judgment, drafting precision, and adaptive position management across turns.

3. Designing RedlineBench

3.1. Simulating Multi-turn SaaS MSA Negotiations

The benchmark is structured as a multi-turn simulation rather than a set of isolated redlining tasks. Each negotiation proceeds through four alternating attorney turns, requiring each side to respond to the evolving contract, prior counterparty redlines, and its own legal and commercial objectives. This design allows attorneys to develop granular rubrics that capture how positions shift, tradeoffs are managed, and deal strategy develops over the course of a negotiation, rather than offering only a static view of redlining judgment.

The SaaS MSA scenarios operationalize this design through three simulated technology transactions involving AgentCo, a Series A HR technology company offering an AI-powered product called TalentFlow, and larger enterprise counterparties. The scenarios share a common commercial foundation but vary the initiating document, negotiating posture, and deal stakes to test how redlining decisions change under different transaction conditions.

No.	Scenario	Description
1	Scenario 1	Scenario 1 begins with LargeCo sending its SaaS MSA template to AgentCo. AgentCo reviews the customer-side template and initiates the first round of redlines.
2	Scenario 2	Scenario 2 reverses the paper. AgentCo sends its own SaaS MSA template, and LargeCo initiates the first round of redlines.
3	Scenario 3	Scenario 3 increases both the complexity and commercial pressure of the transaction. The deal is scaled from a pilot to a production deployment that is approximately ten times larger in size, and AgentCo is instructed that the contract is a must-win opportunity. Instead of receiving a clean SaaS MSA, AgentCo receives a services agreement from GiantCo and must adapt it to fit the SaaS transaction while avoiding excessive redlining that could jeopardize the deal.

Across these scenarios and turns, the attorney-authored redlines and corresponding rubrics create the basis for evaluating model outputs against the legal, commercial, and strategic judgments reflected in the simulated negotiations.

3.2 Evaluation Dimensions

The five evaluation dimensions provide the organizing framework for scoring model-generated redline outputs using attorney-authored rubrics. Each rubric is designed to correspond to the attorney-authored golden response redline, translating that response into evaluation criteria for model outputs. The dimensions reflect the core considerations in contract redlining: legal correctness, adherence to commercial context, negotiation quality, counterparty acceptance prediction, and deal-closing orientation. Together, they allow the benchmark to evaluate model behavior with greater granularity and identify specific failure modes across key aspects of the redlining task.

No.	Dimension	What it captures
1	Legal correctness	Misstates the law; introduces unenforceable language; creates ambiguity or conflicts elsewhere in the contract.
2	Adherence to commercial context	Contradicts explicit business instructions, such as budget caps, go-live dates, or critical deal breakers; proposes fallbacks outside stated guardrails.
3	Negotiation quality	Over- or under-aggressive relative to leverage and stage; concedes key terms too easily; over-lawyers immaterial issues; misses trade-off opportunities.
4	Counterparty acceptance prediction	Proposes positions that are obvious non-starters to counterparty; fails to recognize when language is already favorable; accepts extreme positions without justification.
5	Deal-closing orientation	Optimizes for "winning" every term rather than closing; unnecessarily prolongs the markup with minor, low-impact edits.

3.3 Rubric Distribution

The simulated negotiations produced a rubric dataset consisting of criteria authored by participating attorneys as they worked through each negotiation turn. Because the rubrics are tied to the redlines attorneys identified as most important at the moment of decision, the dataset reflects how senior practitioners framed, weighted, and categorized the key legal, commercial, and strategic issues arising throughout the benchmark.

Rubrics are distributed across evaluation dimension, negotiation turn, party side, and scenario:

No.	Dimension of split	Distribution
1	Party side	48% provider-side / AgentCo, 52% customer-side / LargeCo or GiantCo.
2	Scenario	39.1% Scenario 1; 28.6% Scenario 2; 32.3% Scenario 3.
3	Evaluation dimension	25.7% legal correctness, 33.4% adherence to commercial context, 17.0% negotiation quality, 10.2% counterparty acceptance prediction, 13.7% deal-closing orientation.
4	Negotiation turn	14.7% Turn 1; 34.7% Turn 2; 28.7% Turn 3; 21.9% Turn 4.

Overall, the distribution shows coverage across the benchmark's core evaluation dimensions and negotiation conditions, while preserving the natural variation that arises across scenarios, party positions, and stages of negotiation.

3.4 Branching Design and Turn-Weighted Scoring

The benchmark uses a branching design to reduce the influence of any single attorney's redlining preferences. Multiple attorneys independently initiate positions in each scenario and later turns branch as additional attorneys respond to prior redlines. This creates multiple attorney-authored rubric sets for the same scenario and stage of negotiation, allowing model performance to be evaluated across a range of practitioner judgments rather than against a single negotiation path.

Scores are reported on a turn-weighted basis. This prevents turns with more rubric items from having an outsized effect on aggregate results and accounts for the fact that later turns often involve fewer open issues as the negotiation narrows. Together, the branching design and turn-weighted scoring make the evaluation less dependent on individual attorney subjectivity or the uneven distribution of rubric items across negotiation stages.

3.5 Attorney Consensus Intensity

The branching design also allows us to measure where attorneys substantively converge. We define consensus clusters within the same scenario and turn by grouping rubric items that address the same contract section or clause family and move in the same general redlining direction. Each cluster receives a Substantive Consensus Intensity score based on attorney recurrence, rubric weight, and directional consistency.

The Substantive Consensus Intensity heatmap shows the strongest consensus among attorneys when they are initiating opening moves on a clean template. For example, high-consensus clusters appear around Section 6.1, Fees, in Scenario 2, Turn 1 and Section 15.12, Limitation of Liability, in Scenario 3, Turn 1 where multiple attorneys independently identified the same section and redlining direction as material to the initial negotiation posture.

Consensus becomes more diffuse in later turns as attorneys respond to counterparty redlines and make context-specific decisions about concessions, tradeoffs, and deal strategy. In Scenario 1, for example, Exhibit A, the service level agreement, becomes heavily negotiated around uptime commitments, but attorney rubrics show less agreement on whether that issue is central to the turn-level decision.

The movement from stronger early-turn consensus to more diffuse later-turn judgment reflects the context density of redlining work. The task is not only to identify contract issues, but to determine whether, when, and how those issues matter in an evolving negotiation. By grounding evaluation in that complexity, the benchmark tests whether model outputs are useful in realistic redlining workflows, not merely whether they can spot isolated issues.

FIG. 1 Each cell refers to the cluster intensity (attorney recurrence × rubric weight × directional consistency) for that (scenario, section, turn). The higher the value, the more attorneys converged on the same redline.

4. Agent's Environment

Each component of the agent’s environment and its role

No.	Component	Role
1	/app/contract.docx	Both input and output. Turn 1 = clean template; turn N >= 2 = counterparty's previous-turn docx. Edited in place via the scripts; Harbor's verifier reads the same path after the agent exits.
2	/app/grounding/	Per-task playbook for the agent's side + scenario brief. The counterparty's playbook is deliberately absent.
3	SKILL.md + four scripts	Party- and turn-agnostic tool surface, invoked from the shell.
4	instruction.md	The system prompt, baked per-task with representation block ("You are AgentCo Legal, Side A") and turn framing.

Agent's Sandbox:

FIG. 2 Agent's Sandbox.

5. Evaluation Methodology

We use two evaluation modes:

No.	Mode	Description
1	Turn-level	Turn-level evaluation measures model performance on attorney-anchored negotiation turns. The model receives a contract state with attorney-produced redlines and returns a redlined .docx. We convert that output into markdown tied to the model's document edits, so LLM judges can score it consistently. A three-LLM-judge panel scores the markdown against attorney-authored rubrics for the golden response. Each rubric item receives a pass/fail score by majority vote. We report results by overall performance, party side, turn, evaluation dimension, and substantive consensus intensity.
2	Behavioral	Behavioral evaluation measures how models redline by comparing model redlined .docx files against attorney-authored golden response .docx files.

No LLM judges are used for behavioral-level evaluation. All metrics in those sections are computed directly from the redlined .docx files. Scores were calculated by averaging the results from three model rollouts on the benchmark, with the exception of Fable (only one rollout) given the access constraints.

6. Turn-level Findings

6.1 Overall score

GPT-5.5 ranks first on the turn-weighted, cross-scenario score at 50.5%, followed by Claude Fable 5* at 47.3%, Gemini 3.5 Flash at 45.1%, and Claude Opus 4.8 at 44.4%. The narrow spread suggests that GPT-5.5 performs marginally better overall, but that the benchmark is similarly challenging for all models, with no model separating decisively from the field.

FIG. 3 The 12 (scenario × turn) cells are averaged equally so later turns don't dominate the headline.

6.2 Score by side

All of the models share a weakness on the negotiating side that requires more complex commercial judgment.

Every model scores lower on Side A, the AgentCo vendor-side role, than on Side B, the large enterprise buyer role represented by LargeCo or GiantCo. The gap is smaller for GPT-5.5 and Gemini 3.5 Flash, at −4.7 and −4.3 points (48.2% vs. 52.9% and 43.0% vs. 47.3%), and larger for Claude Fable 5 and Claude Opus 4.8, at −8.1 and −8.4 points (43.2% vs. 51.3% and 40.3% vs. 48.7%). The model results align with the real-world difficulty of vendor-side redlining, where the attorney's objectives are often in tension: protect the company's playbook position, negotiate from a weaker leverage point, and avoid redlines that could slow or jeopardize deal closure.

Model	AgentCo (Side A)	Customer (Side B)	A - B
GPT-5.5	48.2%	52.9%	-4.7 pts
Claude Fable 5*	43.2%	51.3%	-8.1 pts.
Gemini 3.5 Flash	43.0%	47.3%	-4.3 pts
Claude Opus 4.8	40.3%	48.7%	-8.4 pts

FIG. 4 Score by side, comparing AgentCo (Side A) against LargeCo / GiantCo (Side B).

6.3 Score by turn

The models struggle the most with opening redline strategy.

The turn-level score chart shows that Turn 1 is the lowest-scoring stage for every model: GPT-5.5 scores 30.3%, Claude Fable 5 22.6%, Gemini 3.5 Flash 21.9%, and Claude Opus 4.8 17.9%. The earlier Substantive Consensus Intensity analysis shows that Turn 1 is also where attorney-authored rubrics have the strongest consensus around key opening moves. Models therefore perform worst at the stage where senior attorneys most consistently agree on what matters. Scores rise sharply in later turns, clustering mostly in the 50%-60% range once the negotiation record has developed. This suggests that models are better at responding within an established negotiation context than independently identifying, prioritizing, and executing the initial redline strategy.

Model	Turn 1	Turn 2	Turn 3	Turn 4
GPT-5.5	30.3%	55.3%	58.5%	58.3%
Claude Fable 5*	22.6%	54.8%	53.7%	58.8%
Gemini 3.5 Flash	21.9%	50.7%	55.7%	52.2%
Claude Opus 4.8	17.9%	50.0%	51.4%	58.8%

FIG. 5 Turn-level scores by model, pooled across scenarios.

6.4 Score v. substantive consensus intensity

The consensus-intensity curve reinforces the turn-level finding that models struggle with the issues attorneys most consistently identify as important.

The consensus-intensity curve shows that higher attorney consensus does not translate into higher model performance. At low and mid levels of consensus intensity, scores fluctuate substantially across models, but as consensus intensity rises, model scores tend to compress and drift downward into the 30% to 40% range. This suggests that models are not merely losing points on idiosyncratic attorney preferences; they also struggle on issues where attorneys more consistently agree. In other words, high-consensus rubric items are often not “easy” issues. They may reflect the most material redlining decisions, where attorneys agree the issue matters but models still fail to execute the judgment required to address it correctly.

FIG. 6 Each rubric joined to its (scenario, section, turn) consensus-intensity. Lines = rolling weighted pass-rate as you sweep the rubric set from low- to high-consensus.

6.5 Score by evaluation dimension

The evaluation-dimension breakdown reinforces the turn-level finding that models perform better once the negotiation issues are already defined.

Scores are highest on deal-closing orientation, where rubrics often assess whether a model appropriately accepts, rejects, or preserves an existing redline in light of deal momentum. All four models score above 80% on this dimension: Claude Opus 4.8 reaches 86.2%, GPT-5.5 84.4%, Claude Fable 5 83.2%, and Gemini 3.5 Flash 82.5%.

This result should not be read to mean that models are broadly strong at closing negotiations end to end. Rather, models perform well on a narrower form of deal-closing judgment: when forced into a binary choice between accepting or rejecting a counterparty's redlines, they show an overwhelming bias toward acceptance.

Model	Legal	Commercial	Negotiation	Counterparty	Deal-closing
GPT-5.5	49.0%	49.9%	50.9%	51.0%	84.4%
Claude Fable 5*	44.9%	47.0%	45.2%	48.3%	83.2%
Gemini 3.5 Flash	45.2%	51.5%	45.0%	57.1%	82.5%
Claude Opus 4.8	44.2%	44.7%	41.2%	45.4%	86.2%

FIG. 7 Weighted pass rate per evaluation dimension, pooled across every trial and weighted by rubric weight.

Acceptance Bias — why deal-closing scores cluster so high

Once a counterparty has redlined, rubric criteria fork into two kinds of judgment call:

“Accepts X” rubrics. Expected behavior: keep the counterparty's edit.
“Rejects X” rubrics. Expected behavior: push back on the counterparty's edit.

If models are trained toward agreement (“yes-man” behavior), we'd see them pass “Accepts” rubrics more reliably than “Rejects” rubrics, accepting both what they should accept and what they should reject. Only Scenario 3 explicitly instructs the model toward agreement (must-win deal), so the bias should be larger there by design. Scenarios 1 + 2 are the natural test case.

The bias is large and consistent. Every model passes 80-99% of accept-rubrics across every turn, but only 6-50% of reject-rubrics. In Scenarios 1 + 2, where there's no explicit agreement incentive, this gap shows the “yes-man” posture: models accepting counterparty positions they should be pushing back on. In Scenario 3, the gap is even larger because accepting more freely is rewarded by the scenario instruction. Across both panels, GPT-5.5 has the smallest gaps, staying closest to attorney-style pushback patterns. Claude Opus 4.8 has the largest gaps, passing 98% of accept-rubrics but only 6.6% of reject-rubrics in Scenario 3, Turn 2.

7. Surgicalness

Attorneys redline more surgically than the models, minimally changing the phrases and words necessary to accomplish their goals. To measure surgical-ness of models, we use two indicators: the form of the edit and the size of the edit.

The first indicator compares inline edits, which make targeted changes within existing text, against block edits, which replace larger units of drafting. Attorney redlines are nearly evenly split, with 48.6% inline edits and 51.4% block edits.

Every model relies more heavily on block edits. GPT-5.5 is the least surgical by this measure, with 81.0% block edits and 19.0% inline edits. Gemini 3.5 Flash and Claude Opus 4.8 fall in between, at 72.8% and 68.2% block edits, respectively. While Claude Fable 5 comes closest to the attorney baseline, it still produces block edits about eleven percentage points higher than the attorney rate.

Actor	Inline share	Block share
Expert (attorneys)	48.6%	51.4%
GPT-5.5	19.0%	81.0%
Claude Fable 5*	37.7%	62.3%
Gemini 3.5 Flash	27.2%	72.8%
Claude Opus 4.8	31.8%	68.2%

Indicator 1: fraction of tracked-change events that are inline (touches less than 30% of paragraph) vs. block (greater than or equal to 30% of paragraph). Pooled across tasks per actor.

The second indicator measures edit size by comparing edits per touched paragraph and average characters per edit. Attorneys make more edits per touched paragraph than any model, averaging 3.10 edits, but each edit is much shorter, averaging 101 characters. Models make fewer edits per touched paragraph, between 1.06 and 1.20, while producing substantially longer edits, from 318 to 518 characters on average. Claude Fable 5 is the least verbose model by average edit length at 318 characters, while Gemini 3.5 Flash is the most verbose at 518.1 characters, followed by GPT-5.5 at 509.9 and Claude Opus 4.8 at 464.1.

Actor	Redlines / task	Edits / touched	Avg edit length (chars)
Expert (attorneys)	29.5	3.10	101
GPT-5.5	27.0	1.11	509.9
Claude Fable 5*	21.6	1.20	318
Gemini 3.5 Flash	15.1	1.06	518.1
Claude Opus 4.8	10.8	1.06	464.1

Indicator 2: edit frequency and size per edit.

8. Implications for Training Legal AI Agents

Legal AI agents for contract redlining should be trained as negotiation participants, not drafting assistants. A useful agent must decide which issues matter, how aggressively to pursue them, when to concede, and how to preserve deal momentum across turns.

The strongest training signal is the need for better issue prioritization. Models performed worst when initiating redlines on a clean template, even where attorneys showed strong consensus about what mattered. Future training regimes must reward strategic issue selection, not just accurate edits.

Over-acceptance bias compounds the problem. Models don't just struggle to raise the right issues. They also fail to hold the line once challenged. Training must instill principled pushback and side-specific commercial reasoning, not just compliance.

The benchmark also highlights the importance of surgical drafting. Models tend to make fewer, larger edits than attorneys. Surgical drafting will likely improve once models develop true issue prioritization capabilities. The tendency toward wholesale block edits appears linked to a lack of issue prioritization, though further research is needed to confirm this connection. Training should emphasize smaller, more precise interventions.

More broadly, this research shows why legal AI evaluation must move beyond static clause review. In real redlining, usefulness depends on whether an output advances the negotiation in context. Multi-turn, document-native benchmarks offer a more realistic way to measure whether legal AI can support the judgment-dense work of commercial contract negotiation.

*We were unable to generate multiple rollouts for Fable 5 given the access constraints.

Access the public dataset

Crosby-micro1 RedlineBench