How We Made AI Systems Deterministic
The final article in a three-part series on how our team took non-deterministic LLMs and built agents to be more reliable.
In the previous articles, we explored two learnings from building Friday, an AI agent orchestration platform that I cofounded.
First, AI systems behave more reliably when they plan work before executing it — when the system understands the steps involved ahead of time, many common failures become easier to avoid.
Second, once agents begin running work across tools and environments, reliability becomes a systems problem. Execution introduces partial failures, evolving context, and coordination challenges that look more like distributed systems than prompts.
Those observations led to a practical question: how should an AI system represent work so it can run reliably over time?
The Code Generation Trap
Our first instinct was the same one many teams have today. We asked the model to generate the system directly.
Large language models are remarkably good at writing code. Give them an example and describe what you want, and they can often produce code that compiles. So our early approach looked straightforward. A user would describe the work they wanted done. The model would generate the TypeScript required to build the workspace and its associated jobs. That code would then execute inside a sandboxed worker environment.
At first, this approach seemed promising. Most of the time, the generated code worked. But the remaining cases quickly became painful. Sometimes the model would hallucinate step identifiers that didn’t exist. Other times it would wrap the generated code in Markdown fences despite being explicitly instructed not to. Small formatting variations would break downstream parsing.
We added retries. If the generated code failed validation, the system would ask the model to regenerate it. Two retries solved many cases, but those retries were hiding a deeper problem.
The system was relying on nondeterministic code generation to produce infrastructure. When your orchestration layer depends on outputs that can vary between runs, debugging becomes extremely difficult. A workspace that works today might generate slightly different code tomorrow.
Eventually we realized something uncomfortable. We were asking the model to do the wrong job.
What Models Are Good At
Large language models excel at certain kinds of problems. They are extremely good at understanding intent — they can classify information, interpret vague instructions, and map human language into structured meaning. But they are not good at producing deterministic infrastructure.
Code generation feels appealing because it gives the model a lot of freedom. But that same freedom introduces the very nondeterminism we were trying to eliminate.
Once we reframed the problem, the architecture started to change. Instead of asking the model to generate the system itself, we began asking it to make decisions about the system. Everything else could be handled by deterministic code.
The Compiler Pattern
This shift led us toward a pattern that looks surprisingly similar to compiler architecture. In a traditional compiler, the front-end parses human input into structured representations. The back-end then transforms those structures into executable code. We applied the same idea to agent orchestration.
The model acts as the front-end. Its job is to understand intent and produce structured data that describes the work. From there, the rest of the system behaves like a compiler — a deterministic pipeline transforms that structured representation into the configuration required to run the workspace.
The model no longer generates infrastructure directly. It generates typed descriptions of what the infrastructure should do. The compiler handles the rest.
Structured Generation
One of the most useful patterns that emerged from this approach was splitting generation into multiple constrained stages.
First, the model generates human-readable job names and descriptions — things language models are naturally good at producing. The system then converts those names into stable identifiers programmatically.
Next, the model generates the relationships between those jobs, but it is constrained by a schema that only allows references to identifiers that already exist. If the model produces an invalid reference, the schema validation fails immediately.
Instead of discovering hallucinated step references during execution, the system catches them during planning. This approach turns what would normally be runtime failures into simple validation errors. In effect, the type system becomes a guardrail around the model’s output.
Parallel Enrichment
Once the system produces a structured blueprint of the workspace, additional analysis can happen in parallel.
Different reasoning passes enrich the workspace plan in different ways. One pass may classify signals or triggers. Another may determine the appropriate agent behavior. A third may analyze how information should flow between jobs.
Because these steps operate on structured data instead of generated code, they can run independently and concurrently. This reduces latency significantly and keeps the pipeline easier to reason about.
The Compiler
At the end of the pipeline sits the compiler itself.
The compiler takes the fully enriched workspace description and deterministically generates the configuration required to execute it. This stage contains no model calls. It is a pure function — the same input always produces the same output.
Because the compiler is deterministic, it becomes easy to test and debug. Workspaces can be inspected before execution, and engineers can reason about the transformation process the same way they would reason about any other piece of software infrastructure.
In many ways, this stage is intentionally boring. And that turns out to be exactly what you want in the part of the system responsible for orchestration.
The Lesson
Looking back, the biggest lesson from this process was surprisingly simple.
Most teams building with large language models are fighting the properties of the tool. They ask models to produce deterministic outputs. They expect them to follow exact syntax. They attempt to prevent hallucinations through instructions alone.
But language models are inherently probabilistic systems. Reliable architectures embrace that fact instead of trying to suppress it.
Models are excellent at understanding ambiguity and making contextual decisions. Traditional software systems are excellent at determinism and execution. The most reliable AI systems combine both — they let models reason about intent, and they let deterministic systems handle everything that requires precision.
That combination turned out to be the key to making agent systems behave reliably.
Check out the first two articles in this series, What Does It Mean for AI to Do Work and Building AI Agent Systems is a Management Problem, or try out Friday AI today.
This article was originally published on March 22, 2026 on Medium.



