Building Reliable AI Agents: The 12-Factor Approach Explained
Many developers have been on the journey of building AI agents. The initial phase is often marked by rapid progress using existing libraries. It's common to reach an 80% quality threshold, enough to get stakeholders excited and secure more resources for the project. However, pushing past that 80% quality bar proves to be a significant challenge.
You might find yourself deep within a call stack, trying to reverse-engineer how a prompt is constructed or how tools are passed in. This frustration often leads to discarding the initial work and starting from scratch.
A crucial realization is that not every problem is a good fit for an agent. For example, one might try to build a DevOps agent to handle a project's build process. After hours of meticulously detailing every step in the prompt, you might realize that a simple bash script could have accomplished the same task in under two minutes.
The Emergence of New Patterns
After discussions with over a hundred founders, builders, and engineers, certain patterns in building reliable agents became clear. Most production-grade "agents" were not purely agentic; they were sophisticated software systems. The most successful applications were not greenfield rewrites but were built by applying small, modular concepts to existing code.
This approach doesn't require a deep AI background; it's rooted in solid software engineering principles. Just as Heroku once defined the principles for building cloud-native applications, a similar set of principles is needed for building AI-native applications. This led to the creation of the "12 Factors of AI Agents," a concept that has resonated with many in the development community.
This article is not an anti-framework manifesto. Instead, think of it as a wish list—a set of feature requests for how frameworks can better serve the needs of builders who require high reliability while still moving fast. Let's rethink from first principles how we can apply decades of software engineering wisdom to the practice of building truly reliable agents.
Factor 1: Structured Output is Magic
The most powerful capability of Large Language Models (LLMs) isn't about loops or tools; it's about turning unstructured text into structured data. The ability to transform a simple sentence into a clean JSON object is a foundational piece you can integrate into your applications today.
For example, an LLM can take this input:
"I need to schedule a meeting with the design team for tomorrow at 3 PM to review the new mockups."
And convert it into this structured output:
{
"intent": "schedule_meeting",
"participants": ["design team"],
"time": "tomorrow at 3 PM",
"topic": "review new mockups"
}
What you do with that JSON is what the other factors are for, but this transformation is the first, most critical step.
Factor 4: "Tool Use" is Harmful (As a Concept)
This is a controversial but crucial point. The idea of "tool use" as a magical concept where an ethereal entity interacts with its environment is making our systems harder to build. In reality, what's happening is that our LLM is outputting JSON. We then feed that JSON to deterministic code that performs an action.
If you can get the LLM to generate structured output, you can pass it into a standard programming construct like a loop or a switch statement.
A loop might look like this:
javascript
for (const toolCall of response.tool_calls) {
const result = await runTool(toolCall);
context.add(result);
}
A switch statement provides even more control:
javascript
switch (response.tool_name) {
case "create_user":
await createUser(response.arguments);
break;
case "send_email":
await sendEmail(response.arguments);
break;
}
There is nothing special about "tools." It's just JSON and code.
Owning Your Control Flow and State
We've been writing Directed Acyclic Graphs (DAGs) in software for a long time. Every if
statement is a directed graph. Code is a graph. DAG orchestrators like Airflow and Prefect have given us reliability guarantees by breaking processes into nodes.
The promise of agents was that you wouldn't have to write the DAG. You just give the LLM a goal, and it finds its way there. This is often modeled as a simple loop where the LLM determines the next step until it decides the task is complete.
However, this naive approach often fails, especially with longer workflows, due to the limitations of long context windows. While you can now input millions of tokens into models like Gemini, you will almost always get tighter, more reliable results by carefully controlling and limiting the tokens you put in the context window.
A better abstraction for an agent includes: 1. A Prompt: Instructions on how to select the next step. 2. A Switch Statement: To handle the model's JSON output. 3. A Context Builder: A system for constructing the context window. 4. A Loop: To determine when, where, and how to exit.
By owning your control flow, you can implement essential features like breaking, summarizing, and using an LLM as a judge. This also allows you to separate the agent's execution state (current step, retries) from its business state (messages, data, approvals).
You can put your agent behind a REST API, and when a request comes in, you load the context. If the agent needs to call a long-running tool, you can interrupt the workflow, serialize the context to a database, and resume later when a callback provides the result. The agent doesn't even need to know it was paused. Agents are just software, so let's build them that way.
Factor 2: Own Your Prompts and Context
One of the first things most developers learn is the need to own their prompts. While prompt-generation primitives are a good starting point, you will eventually need to write every single token by hand to surpass a certain quality bar.
LLMs are pure functions. The only thing that determines the reliability of your agent is the quality of the tokens you get out, which is determined by the tokens you put in. The more you can experiment with your prompt structure, the more likely you are to find a highly effective solution.
You should also own how you build your context window. Instead of the standard messages format, you can represent the entire state of the world in a single user message and ask the LLM what to do next.
A custom trace might look like this:
EVENT: User wants to deploy the new feature.
STEP 1: Deploy frontend.
- ACTION: deploy(service="frontend")
- RESULT: In progress...
STEP 2: AWAIT Human Approval
- ACTION: human_in_loop(prompt="Frontend deployed. Approve backend deployment?")
- RESULT: "No, do the backend first."
STEP 3: Deploy backend.
- ACTION: deploy(service="backend")
- RESULT: Success.
If you aren't optimizing the density and clarity of the information you pass to an LLM, you are missing out on potential quality gains. Everything in making agents good is context engineering.
Handling Errors and Human Interaction
When a model makes a mistake, like calling an API incorrectly, a common approach is to feed the error back into the context and let it try again. This can often lead to the agent getting stuck in a loop, losing context, and failing.
A better approach is to own your context window. When an error occurs, don't just blindly append the full stack trace. Summarize it. If a subsequent tool call is valid, clear the previous errors. You decide what to tell the model to get a better result.
Furthermore, it's critical to design for human interaction. Instead of having the model choose between a tool call and a message to the user, push the emphasis to a natural language token. This allows the model to express different intents like "I'm done," "I need clarification," or "I need to escalate to a manager."
Meet users where they are. People don't want to manage seven different ChatGPT-style tabs. Let them interact with your agents via email, Slack, Discord, or SMS.
The Power of Small, Focused Agents
The most effective systems use "micro-agents." The overall workflow is a mostly deterministic DAG, with very small, focused agentic loops (3-10 steps) embedded within it.
For instance, consider a bot that manages deployments. Most of the pipeline is deterministic CI/CD code. But at the point where a GitHub PR is merged and tests are passing, an agent takes over. It might propose deploying the front end. A human can intervene via natural language and say, "Actually, do the back end first." The agent turns this into a JSON command, and the workflow proceeds. Once the agent's task is complete, it hands control back to the deterministic system to run end-to-end tests.
This approach gives you manageable context windows and clear responsibilities.
Final Thoughts: Engineer for Reliability
Even as LLMs get smarter, the principles of engineering for reliability will remain critical. You can start with a mostly deterministic workflow and gradually sprinkle in LLMs, allowing them to handle bigger and more complex tasks over time.
A developer from a major AI product team shared a similar perspective: find something right at the boundary of what the model can do reliably, and then engineer a system to make it reliable anyway. That is how you create something magical that is better than what everyone else is building.
Key Takeaways: * Agents are Software: You already have the skills. If you've written a switch statement and a while loop, you can build a reliable agent. * Own Your State and Control Flow: It gives you the flexibility needed to build robust systems. * Find the Bleeding Edge: Create better systems by carefully curating what you put into the model and how you control what comes out. * Agents are Better with People: Find ways to let agents and humans collaborate seamlessly.
There are hard parts to building agents, but the tools we use should take away the other hard parts of software development, so we can focus our time on the hard AI parts: getting the prompts, the flow, and the tokens right.