June 19, 2026

Enterprise AI Won't Scale Without Human Data

Ava Fitoussy

Member of Technical Staff at micro1

There's a version of the AI story that goes like this: models get bigger, benchmarks improve and eventually the systems become capable enough that human oversight becomes more of a formality than a necessity. You swap in the new model, performance goes up, and life gets easier.

That story isn't wrong exactly. Models have gotten dramatically better. But it's missing something, and the gap becomes obvious the moment you try to actually deploy an AI agent inside a real organization.

The missing piece is human data. Not in the training sense most people think of, but as the ongoing infrastructure that tells you whether any of this is actually working.

What people mean when they say "human data" and why that definition is too narrow

Most people still have a fixed idea of what "human data" means, and it's a bit dated. They picture annotation pipelines: labelers working through queues, ranking outputs, flagging errors, generating the signal that goes into fine-tuning and RLHF. Foundation models keep improving partly because that pipeline keeps running.

But something has been added on top of it that doesn't really fit into this same frame. As enterprises are moving from using models to deploying agents, a new and distinct need has emerged alongside the training work: figuring out whether the system you just deployed actually does what you think it does, in your environment, on your workflows, with your users. That's not a fine-tuning problem. It's an evaluation problem. And it requires a different kind of human judgment than labeling queues typically provide.

Agents are a fundamentally different evaluation problem

A single-turn language model is relatively straightforward to evaluate. You give it a prompt, it produces an output, you assess the output. The unit of analysis is one response. An agent is different. It makes a sequence of decisions, choosing which tool to use, deciding what information to retrieve, determining what step comes next, synthesizing across multiple sources before producing a result. By the time it returns an answer, it has made dozens of micro-decisions that aren't visible in the final output.

This creates failure modes that don't exist in simpler systems. An agent can reach the right answer through the wrong process. It can complete 90% of a workflow correctly and fail on the one step that matters most. It can perform well on the kinds of tasks that are easy to test while quietly struggling on the ones that are harder to catch.

A human expert looking at the final output often can't tell any of this happened. The answer looks fine. The agent looks capable. The problem only surfaces later, in production, when it costs something.

The gap between general capability and enterprise readiness

Foundation models are trained to be broadly useful. That's their job, and they're increasingly good at it. But broad usefulness and enterprise readiness are not the same thing.

A law firm deploying a contract review agent doesn't need a system that can discuss any topic intelligently. It needs a system that catches specific clause types, flags particular risk patterns, and behaves predictably under the regulatory environment that governs their practice. A financial services company needs an agent that reflects their compliance requirements. A healthcare organization needs one that can be trusted with clinical reasoning.

None of that comes automatically from scaling a foundation model. It comes from defining what "good" looks like in your context and building the feedback loops that let you measure against that definition.

Even if you build on top of the best model available, the intelligence you layer on top of it through evaluation and domain-specific data is yours. The model is your pre-training substrate. What you build above it, the benchmarks tuned to your workflows, the expert feedback loops, the failure modes you've identified and corrected for, that's the IP. That's what makes your agent the best performing system in your specific domain, not because you trained a better base model, but because you understand your space well enough to measure against it precisely.

Two companies can run the same foundation model and end up with agents that perform completely differently. One has invested in understanding what good looks like for their team and their workflows. The other hasn't. The model is a commodity input. The evaluation layer is where differentiation lives.

This is what human data actually does in a production enterprise AI system. It's not primarily about improving the model. It's about building the institutional knowledge layer.

The compounding loop most companies are missing

Satya Nadella wrote something this past week that I keep coming back to. He framed the real opportunity in enterprise AI not as picking the best model, but as building a learning loop where human capital and token capital compound together. His argument is that you can offload a task, or even a job, but you can never offload your learning. The firms that win will be the ones that turn their workflows, accumulated judgment, and domain knowledge into AI systems that get better with each use.

That framing reframes human data entirely. It's not a cost center or a necessary tax you pay to get a model working. It's the raw material of institutional knowledge. Every expert evaluation, every corrected output, every judgment call captured in a private benchmark is a deposit into a compounding asset that belongs to the organization, not to the model vendor.

Nadella's point about sovereignty is worth thinking about also. A company should be able to swap out the underlying model without losing the expertise encoded in its evaluation and learning systems. If your AI strategy lives entirely inside a vendor's product, you don't own much. If you've built the private evals, the feedback loops, and the benchmarks tuned to your actual workflows, those travel with you regardless of which foundation model you're running on top of.

Evaluation is where the leverage actually is

There's a common misconception that once a foundation model is powerful enough, evaluation becomes less important but that opposite has turned out to be true.

As models have gotten better, the bottleneck has shifted. The question is no longer whether a model can perform a task in principle but It's whether your specific deployment of that model performs reliably in your specific environment, on your specific task distribution, for your specific users etc.

That's an empirical question. You can't answer it from the model's benchmark scores but you have to measure it directly.

This is why more and more of the human data work at serious AI companies has shifted from training toward evaluation. Building the benchmarks, running the red teams, doing the expert review passes. These are increasingly where resources go, because that's where the decisions get made.

Can we ship this to 10,000 users? Did this model update actually improve things or did it introduce new failure modes? Is the agent behaving consistently across the edge cases that matter for our industry? Human evaluation is how you find out.

The real competitive advantage

The AI landscape has no shortage of people chasing improvements at the model layer. Bigger parameters, better architectures, more compute, newer training techniques. That work will keep producing better foundation models, and everyone will benefit from it.

The less obvious opportunity is at the evaluation layer. Understanding your specific task distribution well enough to build meaningful benchmarks. Establishing systematic ways to get expert judgment on agent behavior. Knowing with real confidence whether your agents are improving, regressing, or staying flat as the underlying models evolve.

Nadella put it well when he described this as a "hill climbing machine." The loop compounds. Every improved workflow generates better training signal. Better signal accelerates the accumulation of tacit knowledge that's unique to the firm. That knowledge is genuinely proprietary in a way that access to a frontier model is not, because every company on earth can access the same frontier model.

The organizations that build this infrastructure early will have a clearer picture of what their systems actually do, and a corpus of institutional knowledge that took real time and real expertise to build. In an environment where AI is moving fast and failures are expensive, that's a more durable advantage than it might look.