Powering Role-Specific Enterprise AI Agents at Box

To support AI agents operating across distinct business functions, Box leveraged micro1’s Cortex offering to evaluate performance in real enterprise workflows

Deploy customized AI agents for your use case

Introduction

Box, an AI content management platform, was expanding its AI agent ecosystem to support enterprises across functions such as legal, finance, HR, and clinical operations. These agents needed to operate directly on documents within Box and perform complex, judgment-driven tasks within real enterprise workflows.

Box AI is an enterprise-grade AI layer built directly into Box that helps organizations securely understand, analyze, and act on their most critical content. Box AI harnesses the power of leading AI models—including those from OpenAI, Google, Anthropic, and others—and applies them to documents where they already live, enabling users to ask questions, generate summaries, extract key information, and automate insights across large volumes of unstructured data. All Box AI interactions are permissions-aware and governed by Box’s industry-leading security, compliance, and privacy controls, ensuring it only accesses content a user is authorized to see and that customer data is never used to train AI models without explicit consent.

As a model-agnostic platform, Box required more than standard benchmarks or generic test prompts. The team needed high-quality, expert-designed evaluation data that reflected real customer use cases, enabled meaningful model comparisons, and ensured agent reliability across diverse enterprise scenarios.

Solution

micro1 partnered with Box to design and deliver enterprise-realistic evaluation data grounded in expert judgment and real-world workflows, enabling Box to continue to benchmark, refine, and scale its AI agent platform with confidence.

Process

Expanded Box’s evaluation dataset with expert-designed data points across specialized domains

Modeled tasks directly on real enterprise agent workflows involving multi-document synthesis and complex reasoning

Covered high-value enterprise use cases involving analysis, review, and report generation

Ensured all tasks reflected the expectations, constraints, and decision-making patterns of real professionals by leveraging top expert knowledge

micro1 delivered:

Recruited and managed domain experts across specialized fields to ensure domain-accurate evaluation data
Produced enterprise-realistic prompts, expert-created source documents, and structured scoring rubrics for every task
Worked in close collaboration with Box’s technical team, enabling fast iteration and rapid turnaround cycles
Created a customized deployment of our human data platform to fit Box’s human data needs
Delivered data and evaluations designed to be reusable across models and agent configurations

Outcome

micro1 supported Box to further build and evaluate enterprise AI agents with greater realism, consistency, and confidence across workflows.

Key Results

Expanded an expert-grounded evaluation database for role-specific enterprise agents

Broadened meaningful benchmarking across models using consistent, high-quality evaluation data

Improved agent reliability on complex, multi-document enterprise tasks

Accelerated iteration cycles by aligning evaluation directly with real customer workflows

Enterprise Impact

Box gained a scalable, repeatable evaluation dataset that supports the ongoing development of reliableAI agents — grounded in expert judgment rather than abstract benchmarks.

Why This Worked

Expert-Designed, Not Synthetic

Every task, prompt, and rubric was created by domain experts, ensuring evaluations reflected real enterprise expectations.

Workload-First Evaluation

Instead of isolated prompts, micro1 modeled end-to-end enterprise workloads that agents must perform in production.

Model-Agnostic by Design

The evaluation framework allowed Box to benchmark and iterate across models without re-engineering its data foundation.

Conclusion

By partnering with micro1, Box furthered its human-grounded evaluations for its enterprise AI agents to ensure reliable performance across roles, domains, and real customer workflows, while preserving flexibility across models and agent architectures.