Every production workflow needs a place for the weird cases to go.
AI workflows need it even more.
A ticket is ambiguous. A document contradicts itself. A tool fails halfway through. A user asks for something the system is not allowed to do. The model is uncertain, but the orchestration layer was written by an optimist and the retry loop keeps trying like a golden retriever with a keyboard.
This is how work disappears into logs, Slack threads, and “we will look into it.”
Your AI workflow needs a dead letter queue.
Failure Should Have an Address
In message systems, a dead letter queue catches messages that cannot be processed successfully.
AI workflows need the same concept: a designed place for failed, unsafe, ambiguous, or unprocessable tasks to land.
Not hidden in logs.
Not retried forever.
Not summarized as “encountered an issue” and then abandoned like a chair in a conference room.
The workflow should preserve the input, context, attempted actions, failure reason, tool outputs, model uncertainty, and recommended next step.
Not Every Failure Deserves a Retry
Retrying is useful when failure is temporary.
Timeout? Retry. Rate limit? Retry later. Flaky integration? Maybe retry with a cap.
But many AI workflow failures are not temporary. They are structural: missing requirements, unclear authority, unsafe action, contradictory sources, low confidence, or no valid tool path.
Retrying those failures is just doing the wrong thing with persistence.
The dead letter queue says: stop, preserve evidence, classify the failure, and route it to repair.
Classify the Broken Work
A dead letter queue is not a trash can.
It is a learning system.
Each failed item should get a reason code: missing context, policy conflict, tool failure, prompt ambiguity, retrieval failure, permission boundary, unsupported task type, human approval required.
Those reason codes become engineering backlog, eval cases, runbook updates, and product requirements.
If twenty tasks land in the queue because acceptance criteria are unclear, you do not have an AI problem. You have a requirements problem wearing a model bill.
Failed AI work should become evidence. If it only becomes noise, the system cannot improve.
Make Repair Boring
The queue should support repair.
A human should be able to inspect the failed item, add missing context, change the task type, approve a risky action, mark the source as stale, or convert the failure into a test case.
The goal is not to celebrate failure. The goal is to make failure useful.
This is closely related to weirdness budgets: you cannot manage what you do not capture.
The Takeaway
AI workflows will fail.
The question is whether failure becomes a controlled lane or a haunted retry loop.
Give failed work a place to land. Preserve context. Classify the cause. Route it for repair.
A dead letter queue is not a pessimistic feature.
It is how production systems learn without losing the evidence.