Most agent demos won't survive a real codebase — here's what does

I’ve watched a lot of agent demos. Most of them are genuinely good: a chat box, a couple of tool calls, a clean answer that lands on the first try. The demo is the part everyone remembers. It’s also the easy part.

The hard part starts when you try to put the thing into a real company’s systems. Not because the model gets worse, but because a demo runs in a world you fully control and production doesn’t. Below are three walls a recent agent project hit on the way to being live. None of them were about the agent. All of them decided whether it shipped.

It couldn’t reach the data

In the demo, everything sat in one place. The agent, the data it queried, the credentials, all on one machine and all reachable.

Production split that apart. The agent ran on an internet-facing server. The data it needed lived behind an internal API that, by design, refuses calls from the public internet. So the first time the agent tried to do the one thing the demo made look trivial, it got nothing back. The model was fine. It simply had no path to the information.

The fix wasn’t a better prompt. It was setting up an API gateway and configuring it to bridge the two networks safely: authenticating the agent’s requests, exposing only the handful of endpoints it was allowed to touch, and refusing everything else. That’s infrastructure work, and no demo ever shows it. Skip it and you have a very articulate program that can’t see anything.

Setting that up is real work, so we didn’t keep it to ourselves. We registered a shared organization on the gateway that other teams could route their own services through, and documented the connector so the next team that needed to reach the same internal systems didn’t have to rediscover how.

Shipping it once isn’t shipping it

The first deployment was done by hand. That works exactly once. By the third change you’ve forgotten a step, pushed the wrong config, or deployed something nobody can reproduce, and now you’re debugging the deployment instead of the feature. The agent had to deploy itself the same way every time.

The company already had a pipeline for exactly that, built by another team. That should have been the easy win, and it wasn’t. The tool was hard to use and barely documented, so getting our agent onto it meant tracking down the people who built it and pulling the knowledge out of their heads, because almost none of it was written down. Automating the deployment turned out to be less about pipelines than about archaeology.

That kind of work, integrating with the half-documented platform the rest of the company already runs on, never shows up in a demo. But inside a real organization it’s most of what “deploy the agent” actually means.

The UI had to know who was asking

The last wall showed up when a chat component went into an application’s interface. A box on a screen, the most demo-like piece of the whole project. Except a real UI sits in front of real users, and the agent behind it can’t just answer anyone who types into it. It has to know who’s asking and whether they’re allowed to ask.

That meant the chat component had to authenticate against the agent, and the agent had to trust that authentication. Rather than wire it up once and move on, we built a small auth service that handles it cleanly, and built it so other teams plugging their own UI into the same agent could reuse it instead of reinventing it. Same story as the gateway: the model didn’t change, the system around it grew.

The agent was the easy part

Look at the three walls together. A network gateway, shared so other teams could route through it. An internal deployment pipeline we had to reverse-engineer to use. A reusable auth service. Not one of them is “AI work.” It’s ordinary full-stack systems work: networking, deployment, identity. The kind a demo is specifically designed to hide, because a demo’s job is to show the 20% that’s fun and skip the 80% that’s load-bearing.

This is why so many impressive agent demos never become anything. The demo proves the model can do the task. It says nothing about whether the task can survive a network boundary, a second deploy, or a real user. The features that make it to production get there because someone treated the agent as one component in a larger system and did the unglamorous work of fitting it in.

If you’re building these: judge an agent feature by its blast radius into the rest of your system, not by how clean the demo looked. The demo is a hypothesis. Everything around it is the experiment that tells you whether the hypothesis holds.

And if you’re hiring someone to ship AI features, that’s the actual job. The prompt is the part anyone can copy. The person who can build the boring layer around it, and do it so the next team doesn’t have to, is the one whose work is still running a year later.