June 12, 2026

Your agent's bill isn't the model. It's the query path.

The 2026 AI cost reckoning trains your attention on the token meter. When I traced where an agent's spend actually went, the model was the easy part. The leak was the warehouse the agent queries.

The 2026 AI cost reckoning arrived on schedule and got loud this week. Uber capped its engineers at $1,500 a month per AI coding tool after reportedly burning its annual AI budget in four months. GitHub Copilot moved to usage-based credits. The model labs shipped a new top capability tier and priced it at fifty dollars a million output tokens, which is the kind of number that makes a finance team ask questions it wasn’t asking last quarter. Simon Willison flagged the Uber cap as a real-world cost-governance datapoint, and he’s right that it is one.

All of it points your attention at the same place. The token meter. How many tokens is this thing burning, at what per-token rate, and how do we make that number smaller. That’s where the news is, and it’s where most cost conversations about agents end up, because it’s the legible part. The labs publish the rates. The dashboards show the burn. It feels like the thing to optimize.

I went looking for where an agent’s cost was actually going, and the token meter turned out to be the wrong place to look.

The thing I assumed, and the thing that was true

I’ll be honest about the starting assumption, because the gap between it and reality is the whole point. I went in believing the cost story was a model-serving story. The agent does expensive reasoning, the reasoning is tokens, the tokens are the bill. So the obvious move was to attack model serving: cheaper model where possible, fewer round-trips, tighter prompts.

That work was real and it mostly landed. Calling the model provider directly instead of routing through a heavier serving layer did what it was supposed to, and the model-serving line came down. If the story ended there, this would be a post about prompt efficiency, and you’ve read that post a hundred times.

The story didn’t end there, because when the model-serving line came down, the total didn’t come down nearly as much as it should have. The money was leaking somewhere I hadn’t been looking, one layer below the model, in the data the agent reaches for to do its job.

A cache that worked for dashboards and bypassed for the agent

Here is the part that surprised me, and that I think generalizes.

The agent queries a data warehouse. That warehouse already had a cache in front of it, and the cache had been doing its job quietly for a long time, because the original consumers were dashboards. Dashboards ask the same tidy, low-cardinality questions over and over. Show me this metric for this period. The same shape, repeatedly, keys cleanly into a cache, hits warm, and barely touches the underlying compute. The cache looked healthy because for the workload it was built for, it was healthy.

The agent asks a different kind of question. It asks high-cardinality questions, “these specific several dozen things, filtered and joined against those,” where the exact set changes from call to call because it depends on what the user asked and what the agent decided it needed. Those queries don’t key the way the dashboard’s aggregates do. They miss the cache almost every time, fall straight through to the warehouse, and scan large volumes. Per call. On a surface that runs a lot of calls.

So the picture was a cache reporting reasonable health on its original traffic while the agent’s traffic was bypassing it almost entirely and scanning the warehouse over and over. The cost wasn’t in the model thinking. It was in the agent repeatedly asking a database expensive questions that nothing was absorbing.

And there was a second, dumber problem sitting next to the first. The agent was sharing warehouse compute with the dashboards instead of running on its own. So beyond the cache misses, the agent’s heavy scans and the dashboards’ regular load were contending for the same resource, each making the other slower and more expensive. Two workloads with completely different access patterns, stapled to one piece of compute, neither sized for the other.

Neither of these shows up on the token meter. You can stare at token usage all day and never see “the agent’s queries miss a cache built for a different access pattern” or “the agent is squatting on compute meant for dashboards.” Those live in a different system entirely, and if your mental model of agent cost stops at the model, you will never go look at that system.

The general lesson

The reason I think this generalizes is that the conditions that produced it are extremely common. You have a data layer that predates the agent. It was built for human-driven, regular-shaped access, dashboards and reports and scheduled queries. It has caching and capacity tuned for that. Then you put an agent in front of it, and the agent introduces an access pattern nobody designed for: irregular, high-cardinality, user-driven, bursty, and indifferent to the assumptions baked into the cache.

The agent doesn’t know any of this. It’s doing exactly the right thing given what it was asked. The model is innocent. The cost is an emergent property of a new access pattern hitting old infrastructure, and it is invisible from inside any single agent run. You only see it when you stop looking at the run and start looking at the distribution of what the agent is doing to the system underneath it.

That reframes the optimization. The token-level work is still worth doing. Caching the static part of a heavy system prompt so you stop paying for the same couple-dozen-tool schema on every single turn is real money and I did that work too. But the token work has a floor, and the labs publish the rate, so everyone is optimizing the same legible thing. The query-path work is where the surprising money is, because almost nobody is looking at it, and because the failure modes are specific to your data layer and your agent’s access pattern, so there’s no published rate to anchor on. It’s yours to find.

What I’d actually tell someone

If you’re putting an agent in front of a real data system, do the boring profiling before you touch the model contract. Pull the agent’s actual queries, not the ones you imagine it makes, and look at the cache hit rate on its access pattern specifically, separate from whatever the cache reports overall. The overall number will lie to you if the agent is a minority of traffic riding on top of a well-behaved majority. Then check whether the agent has its own compute or is sharing with a workload that has nothing in common with it. Those two checks found more cost than any prompt I rewrote.

And keep the token work, but put it in its right place. It’s the legible optimization, the one with a published rate and a clear ceiling. The query path is the illegible one, and the illegible costs are where the leaks live, because legible costs get optimized by everyone and illegible ones get optimized by whoever bothers to look.

The honest framing

I should be precise about what’s mine in this story, because it matters. I didn’t discover this by being smarter than the agent in the moment. I built a system that watches my own infrastructure spend and lets me trace it back to a cause, and that system is what surfaced the finding, the cache it could see was bypassing, the warehouse the agent was hammering, the compute it was sharing that it shouldn’t have been. The work that compounds isn’t the individual fix. It’s having a way to see the distribution instead of the anecdote, so the next surprising cost has somewhere to show up before it becomes a quarterly line item someone in finance asks about.

The token meter is the thing the whole industry is staring at this week, because it’s the part everyone can see. The bill that actually bites is one layer down, in the query path the agent triggers, and it stays invisible exactly as long as you keep looking at the model.

#agents#ai#infrastructure#building

Pre-drafted copy for each platform. X opens with the post pre-filled. LinkedIn requires a paste — the button copies the text to your clipboard and opens the composer in one click.

// X / Twitter

The AI cost reckoning got loud this week. Uber capped engineers at $1,500/mo per tool. Copilot moved to usage-based credits. Everyone's staring at the token meter.

When I traced where an agent's spend actually went, the model serving was the easy part. The leak was one layer down, in the warehouse the agent queries.

https://acidlemon.com/posts/2026-06-12-agent-bill-query-path/

Post to X ↗382 chars

// LinkedIn

The 2026 AI cost reckoning got loud this week. Uber capped its engineers at $1,500 a month per AI coding tool after burning its annual budget in four months. GitHub Copilot moved to usage-based credits. The labs priced a new top model tier at fifty dollars a million output tokens. All of it trains your attention on the same place: the token meter.

When I actually went looking for where an agent's cost was going, the token meter was the wrong place to look.

Model serving turned out to be the easy, already-solved part. The money was leaking one layer down, in the data warehouse the agent queries to do its job. A cache that had been quietly working for the dashboards was bypassing almost entirely on the agent's queries, because the agent asks high-cardinality questions that don't key the way a dashboard's tidy aggregates do. So it was scanning enormous volumes on nearly every call. On top of that, the agent was sharing compute with the dashboards instead of running on its own, so the two were fighting over the same resource.

Neither of those shows up when you stare at token usage.

If you're putting an agent in front of a real data system, profile the queries it triggers before you renegotiate the model contract. The token bill is legible and the labs make it easy to optimize. The cost that actually bites is in the query path, and it's invisible until you go looking.

1390 chars