The most common reaction to a scary LLM bill is to blame the model.
“This model is too expensive.”
Maybe. But I have seen enough AI cost reviews to know the model is often just the visible part of the invoice. The expensive part is usually the workflow around it: repeated context, unnecessary calls, bad routing, missing caches, retry loops, and tasks sent to a premium model because nobody defined a cheaper path.
The model is the taxi meter.
The workflow is the person asking the driver to circle the block seventeen times.
Cost Is a Trace
An LLM bill is not just a finance artifact. It is an architecture trace.
It tells you which workflows run too often, which tasks are over-modeled, which prompts carry too much baggage, which users retry because outputs are unclear, and which systems lack deterministic steps that should have happened before the model was called.
If every request sends the full ticket history, entire document, previous conversation, tool logs, policy text, and a motivational speech from the product manager, the cost problem is not mysterious.
You built a context buffet and gave every request a plate.
Route by Difficulty
Not every AI task deserves the best model.
Classification, formatting, extraction, routing, and simple checks can often use smaller models, deterministic code, or existing search. Complex reasoning, ambiguous decisions, synthesis, and high-risk actions may deserve stronger models.
The mistake is treating “use AI” as one routing decision.
It should be several decisions: can deterministic logic handle this, can a small model handle this, does this need retrieval, does this require a stronger model, and should a human review it before anything expensive happens?
That is not just cost optimization. It is system design.
Cache Boring Intelligence
AI workflows often regenerate the same intelligence repeatedly.
The same repo summary. The same customer profile. The same policy explanation. The same document outline. The same “what is this ticket about” step repeated across planning, coding, review, and handoff.
If the intermediate result is stable enough, cache it. If it is too large, summarize it. If it is stale-sensitive, version it. If it is user-specific, scope it.
Caching is not glamorous, which is how you know it might work.
Do not pay a model to rediscover context your system already knows.
Watch the Retry Loop
Retry logic is where cost goes to become folklore.
The model returns a weak answer. The system retries with more context. Then retries with a stronger instruction. Then retries with the bigger model. Then a human clicks “try again” because the UI made that easier than giving useful feedback.
Now the workflow has spent money and still has no idea what went wrong.
Retries need budgets, stop conditions, and failure labels. If the system cannot distinguish bad input, missing context, model uncertainty, and tool failure, retrying is just burning cash with optimism.
This connects directly to AI reliability and weirdness budgets: if failures are not classified, they cannot be improved.
The Takeaway
Yes, model pricing matters.
But runaway LLM cost is usually a workflow smell.
Before downgrading every model or yelling at the finance dashboard, inspect the path: routing, context, caching, retries, deterministic steps, and review loops.
Your AI system spends money where your workflow lacks discipline.