Back
Crosby-micro1 RedlineBench
The standard for evaluating AI systems on contract redlining and SaaS negotiation
.png)
.png)
%20(1).webp)
1. Why we built this
Contract redlining is judgment-dense and strategically complex. It is closer to poker than to math. There are early game moves, countermoves, tradeoffs, and end games, all undertaken with incomplete information. Party leverage, business priorities, counterparty tolerance, and the value of each move are often uncertain. There is rarely a single right move.
That creates two challenges for benchmark design. First, the benchmark has to reflect the complexity of the workflow itself. A useful redline is not just a legally correct clause edit. It is a move in a negotiation, shaped by deal context, party posture, timing, and the need to preserve momentum toward execution.
Second, the judgment behind a strong redline is often only implicit in the attorney's work product. A golden response may show what an attorney changed, but not why the issue mattered or how the attorney weighed it against the rest of the negotiation. A useful benchmark therefore has to preserve attorney output in its native form while also converting the most important redlining judgments into structured evaluation criteria.
The Crosby-micro1 RedlineBench is designed around those challenges. It uses multi-turn SaaS MSA negotiations, document-native redlines, attorney-authored golden responses, and rubrics tied to the decisions attorneys considered most important at each stage. By collecting data in the real workflow of contract negotiation, the benchmark evaluates models as negotiation participants, not merely drafting assistants.
2. Summary of Findings
GPT-5.5 has the highest overall turn-weighted rubric score, but the spread across models is narrow, suggesting that the benchmark remains challenging across the frontier model set.
3. Designing RedlineBench
3.1. Simulating Multi-turn SaaS MSA Negotiations
The benchmark is structured as a multi-turn simulation rather than a set of isolated redlining tasks. Each negotiation proceeds through four alternating attorney turns, requiring each side to respond to the evolving contract, prior counterparty redlines, and its own legal and commercial objectives. This design allows attorneys to develop granular rubrics that capture how positions shift, tradeoffs are managed, and deal strategy develops over the course of a negotiation, rather than offering only a static view of redlining judgment.
The SaaS MSA scenarios operationalize this design through three simulated technology transactions involving AgentCo, a Series A HR technology company offering an AI-powered product called TalentFlow, and larger enterprise counterparties. The scenarios share a common commercial foundation but vary the initiating document, negotiating posture, and deal stakes to test how redlining decisions change under different transaction conditions.
Across these scenarios and turns, the attorney-authored redlines and corresponding rubrics create the basis for evaluating model outputs against the legal, commercial, and strategic judgments reflected in the simulated negotiations.
3.2 Evaluation Dimensions
The five evaluation dimensions provide the organizing framework for scoring model-generated redline outputs using attorney-authored rubrics. Each rubric is designed to correspond to the attorney-authored golden response redline, translating that response into evaluation criteria for model outputs. The dimensions reflect the core considerations in contract redlining: legal correctness, adherence to commercial context, negotiation quality, counterparty acceptance prediction, and deal-closing orientation. Together, they allow the benchmark to evaluate model behavior with greater granularity and identify specific failure modes across key aspects of the redlining task.
3.3 Rubric Distribution
The simulated negotiations produced a rubric dataset consisting of criteria authored by participating attorneys as they worked through each negotiation turn. Because the rubrics are tied to the redlines attorneys identified as most important at the moment of decision, the dataset reflects how senior practitioners framed, weighted, and categorized the key legal, commercial, and strategic issues arising throughout the benchmark.
Rubrics are distributed across evaluation dimension, negotiation turn, party side, and scenario:
Overall, the distribution shows coverage across the benchmark's core evaluation dimensions and negotiation conditions, while preserving the natural variation that arises across scenarios, party positions, and stages of negotiation.
3.4 Branching Design and Turn-Weighted Scoring
The benchmark uses a branching design to reduce the influence of any single attorney's redlining preferences. Multiple attorneys independently initiate positions in each scenario and later turns branch as additional attorneys respond to prior redlines. This creates multiple attorney-authored rubric sets for the same scenario and stage of negotiation, allowing model performance to be evaluated across a range of practitioner judgments rather than against a single negotiation path.
Scores are reported on a turn-weighted basis. This prevents turns with more rubric items from having an outsized effect on aggregate results and accounts for the fact that later turns often involve fewer open issues as the negotiation narrows. Together, the branching design and turn-weighted scoring make the evaluation less dependent on individual attorney subjectivity or the uneven distribution of rubric items across negotiation stages.
3.5 Attorney Consensus Intensity
The branching design also allows us to measure where attorneys substantively converge. We define consensus clusters within the same scenario and turn by grouping rubric items that address the same contract section or clause family and move in the same general redlining direction. Each cluster receives a Substantive Consensus Intensity score based on attorney recurrence, rubric weight, and directional consistency.
The Substantive Consensus Intensity heatmap shows the strongest consensus among attorneys when they are initiating opening moves on a clean template. For example, high-consensus clusters appear around Section 6.1, Fees, in Scenario 2, Turn 1 and Section 15.12, Limitation of Liability, in Scenario 3, Turn 1 where multiple attorneys independently identified the same section and redlining direction as material to the initial negotiation posture.
Consensus becomes more diffuse in later turns as attorneys respond to counterparty redlines and make context-specific decisions about concessions, tradeoffs, and deal strategy. In Scenario 1, for example, Exhibit A, the service level agreement, becomes heavily negotiated around uptime commitments, but attorney rubrics show less agreement on whether that issue is central to the turn-level decision.
The movement from stronger early-turn consensus to more diffuse later-turn judgment reflects the context density of redlining work. The task is not only to identify contract issues, but to determine whether, when, and how those issues matter in an evolving negotiation. By grounding evaluation in that complexity, the benchmark tests whether model outputs are useful in realistic redlining workflows, not merely whether they can spot isolated issues.
.png)
FIG. 1 Each cell refers to the cluster intensity (attorney recurrence × rubric weight × directional consistency) for that (scenario, section, turn). The higher the value, the more attorneys converged on the same redline.
4. Agent's Environment
Each component of the agent’s environment and its role
Agent's Sandbox:
.png)
FIG. 2 Agent's Sandbox.
5. Evaluation Methodology
We use two evaluation modes:
No LLM judges are used for behavioral-level evaluation. All metrics in those sections are computed directly from the redlined .docx files. Scores were calculated by averaging the results from three model rollouts on the benchmark, with the exception of Fable (only one rollout) given the access constraints.
6. Turn-level Findings
6.1 Overall score
GPT-5.5 ranks first on the turn-weighted, cross-scenario score at 50.5%, followed by Claude Fable 5* at 47.3%, Gemini 3.5 Flash at 45.1%, and Claude Opus 4.8 at 44.4%. The narrow spread suggests that GPT-5.5 performs marginally better overall, but that the benchmark is similarly challenging for all models, with no model separating decisively from the field.
.png)
FIG. 3 The 12 (scenario × turn) cells are averaged equally so later turns don't dominate the headline.
6.2 Score by side
All of the models share a weakness on the negotiating side that requires more complex commercial judgment.
Every model scores lower on Side A, the AgentCo vendor-side role, than on Side B, the large enterprise buyer role represented by LargeCo or GiantCo. The gap is smaller for GPT-5.5 and Gemini 3.5 Flash, at −4.7 and −4.3 points (48.2% vs. 52.9% and 43.0% vs. 47.3%), and larger for Claude Fable 5 and Claude Opus 4.8, at −8.1 and −8.4 points (43.2% vs. 51.3% and 40.3% vs. 48.7%). The model results align with the real-world difficulty of vendor-side redlining, where the attorney's objectives are often in tension: protect the company's playbook position, negotiate from a weaker leverage point, and avoid redlines that could slow or jeopardize deal closure.
FIG. 4 Score by side, comparing AgentCo (Side A) against LargeCo / GiantCo (Side B).
6.3 Score by turn
The models struggle the most with opening redline strategy.
The turn-level score chart shows that Turn 1 is the lowest-scoring stage for every model: GPT-5.5 scores 30.3%, Claude Fable 5 22.6%, Gemini 3.5 Flash 21.9%, and Claude Opus 4.8 17.9%. The earlier Substantive Consensus Intensity analysis shows that Turn 1 is also where attorney-authored rubrics have the strongest consensus around key opening moves. Models therefore perform worst at the stage where senior attorneys most consistently agree on what matters. Scores rise sharply in later turns, clustering mostly in the 50%-60% range once the negotiation record has developed. This suggests that models are better at responding within an established negotiation context than independently identifying, prioritizing, and executing the initial redline strategy.
FIG. 5 Turn-level scores by model, pooled across scenarios.
6.4 Score v. substantive consensus intensity
The consensus-intensity curve reinforces the turn-level finding that models struggle with the issues attorneys most consistently identify as important.
The consensus-intensity curve shows that higher attorney consensus does not translate into higher model performance. At low and mid levels of consensus intensity, scores fluctuate substantially across models, but as consensus intensity rises, model scores tend to compress and drift downward into the 30% to 40% range. This suggests that models are not merely losing points on idiosyncratic attorney preferences; they also struggle on issues where attorneys more consistently agree. In other words, high-consensus rubric items are often not “easy” issues. They may reflect the most material redlining decisions, where attorneys agree the issue matters but models still fail to execute the judgment required to address it correctly.
.png)
FIG. 6 Each rubric joined to its (scenario, section, turn) consensus-intensity. Lines = rolling weighted pass-rate as you sweep the rubric set from low- to high-consensus.
6.5 Score by evaluation dimension
The evaluation-dimension breakdown reinforces the turn-level finding that models perform better once the negotiation issues are already defined.
Scores are highest on deal-closing orientation, where rubrics often assess whether a model appropriately accepts, rejects, or preserves an existing redline in light of deal momentum. All four models score above 80% on this dimension: Claude Opus 4.8 reaches 86.2%, GPT-5.5 84.4%, Claude Fable 5 83.2%, and Gemini 3.5 Flash 82.5%.
This result should not be read to mean that models are broadly strong at closing negotiations end to end. Rather, models perform well on a narrower form of deal-closing judgment: when forced into a binary choice between accepting or rejecting a counterparty's redlines, they show an overwhelming bias toward acceptance.
FIG. 7 Weighted pass rate per evaluation dimension, pooled across every trial and weighted by rubric weight.
Acceptance Bias — why deal-closing scores cluster so high
Once a counterparty has redlined, rubric criteria fork into two kinds of judgment call:
- “Accepts X” rubrics. Expected behavior: keep the counterparty's edit.
- “Rejects X” rubrics. Expected behavior: push back on the counterparty's edit.
If models are trained toward agreement (“yes-man” behavior), we'd see them pass “Accepts” rubrics more reliably than “Rejects” rubrics, accepting both what they should accept and what they should reject. Only Scenario 3 explicitly instructs the model toward agreement (must-win deal), so the bias should be larger there by design. Scenarios 1 + 2 are the natural test case.
The bias is large and consistent. Every model passes 80-99% of accept-rubrics across every turn, but only 6-50% of reject-rubrics. In Scenarios 1 + 2, where there's no explicit agreement incentive, this gap shows the “yes-man” posture: models accepting counterparty positions they should be pushing back on. In Scenario 3, the gap is even larger because accepting more freely is rewarded by the scenario instruction. Across both panels, GPT-5.5 has the smallest gaps, staying closest to attorney-style pushback patterns. Claude Opus 4.8 has the largest gaps, passing 98% of accept-rubrics but only 6.6% of reject-rubrics in Scenario 3, Turn 2.
7. Surgicalness
Attorneys redline more surgically than the models, minimally changing the phrases and words necessary to accomplish their goals. To measure surgical-ness of models, we use two indicators: the form of the edit and the size of the edit.
The first indicator compares inline edits, which make targeted changes within existing text, against block edits, which replace larger units of drafting. Attorney redlines are nearly evenly split, with 48.6% inline edits and 51.4% block edits.
Every model relies more heavily on block edits. GPT-5.5 is the least surgical by this measure, with 81.0% block edits and 19.0% inline edits. Gemini 3.5 Flash and Claude Opus 4.8 fall in between, at 72.8% and 68.2% block edits, respectively. While Claude Fable 5 comes closest to the attorney baseline, it still produces block edits about eleven percentage points higher than the attorney rate.
Indicator 1: fraction of tracked-change events that are inline (touches less than 30% of paragraph) vs. block (greater than or equal to 30% of paragraph). Pooled across tasks per actor.
The second indicator measures edit size by comparing edits per touched paragraph and average characters per edit. Attorneys make more edits per touched paragraph than any model, averaging 3.10 edits, but each edit is much shorter, averaging 101 characters. Models make fewer edits per touched paragraph, between 1.06 and 1.20, while producing substantially longer edits, from 318 to 518 characters on average. Claude Fable 5 is the least verbose model by average edit length at 318 characters, while Gemini 3.5 Flash is the most verbose at 518.1 characters, followed by GPT-5.5 at 509.9 and Claude Opus 4.8 at 464.1.
Indicator 2: edit frequency and size per edit.
8. Implications for Training Legal AI Agents
Legal AI agents for contract redlining should be trained as negotiation participants, not drafting assistants. A useful agent must decide which issues matter, how aggressively to pursue them, when to concede, and how to preserve deal momentum across turns.
The strongest training signal is the need for better issue prioritization. Models performed worst when initiating redlines on a clean template, even where attorneys showed strong consensus about what mattered. Future training regimes must reward strategic issue selection, not just accurate edits.
Over-acceptance bias compounds the problem. Models don't just struggle to raise the right issues. They also fail to hold the line once challenged. Training must instill principled pushback and side-specific commercial reasoning, not just compliance.
The benchmark also highlights the importance of surgical drafting. Models tend to make fewer, larger edits than attorneys. Surgical drafting will likely improve once models develop true issue prioritization capabilities. The tendency toward wholesale block edits appears linked to a lack of issue prioritization, though further research is needed to confirm this connection. Training should emphasize smaller, more precise interventions.
More broadly, this research shows why legal AI evaluation must move beyond static clause review. In real redlining, usefulness depends on whether an output advances the negotiation in context. Multi-turn, document-native benchmarks offer a more realistic way to measure whether legal AI can support the judgment-dense work of commercial contract negotiation.
*We were unable to generate multiple rollouts for Fable 5 given the access constraints.
.webp)
.webp)
GPT-5.5
Claude Fable 5*
Gemini 3.5 Flash


.webp)
.webp)
.webp)
.webp)