As AI agents move from experimental demos to production systems, engineering teams are running into a familiar problem: reliability. Large language models are inherently probabilistic. A prompt that succeeds once may fail the next time, even under identical conditions. To compensate, developers often surround agent workflows with retries, fallbacks, and branching logic designed to catch unpredictable model behavior.
Over time, this creates an architectural mess. Business logic becomes tightly intertwined with inference strategies, making systems harder to maintain, test, and evolve. New research from Asari AI, MIT CSAIL, and Caltech argues that scalable agent systems require a cleaner separation of responsibilities.
Their proposed solution introduces a programming model that separates what an agent should do from how uncertainty is handled at inference time. The result is a more modular approach to building AI agents that are both reliable and easier to scale.
The reliability challenge in production agents
Most enterprise AI agents are built using a “program-in-control” model, where developers define a structured workflow and invoke LLMs for specific subtasks. This approach offers predictability and auditability, which are critical for real-world applications.
However, managing uncertainty within this structure is difficult. Developers frequently bake strategies like retries, best-of-N sampling, or self-verification directly into the workflow. Each new strategy requires restructuring the agent’s control flow, increasing complexity and technical debt.
As agents grow more sophisticated, this tight coupling between logic and inference makes experimentation costly. Teams often avoid improving reliability simply because doing so would require rewriting large portions of code.
When logic and inference get tangled
At the heart of the problem is entanglement. Agent code typically mixes two concerns: the core workflow that defines the task, and the inference-time strategy that handles uncertainty. These are fundamentally different layers, yet they often live in the same functions.
Adding a simple improvement like beam search can mean wrapping the entire workflow in loops or managing complex state machines. Switching strategies later becomes painful, discouraging iteration and locking teams into suboptimal designs.
The researchers behind the new framework argue that this pattern limits scalability. If changing how an agent reasons requires changing what the agent does, long-term maintenance becomes unsustainable.
A new model for agent execution
To address this, the researchers introduce a programming model called Probabilistic Angelic Nondeterminism, along with a Python framework named ENCOMPASS. The idea is straightforward: developers write the “happy path” of an agent as if every LLM call succeeds, while explicitly marking points where uncertainty exists.
These markers, implemented via a branchpoint() primitive, signal where execution could diverge. At runtime, the framework constructs a search tree over possible execution paths and applies a chosen inference strategy externally.
This means developers can swap strategies like depth-first search, beam search, or Monte Carlo tree search without modifying the underlying business logic. The workflow stays readable, linear, and easy to reason about.
Treating inference as a search problem
By framing inference as a search over execution paths, ENCOMPASS turns reliability into a runtime concern rather than an application-level one. The agent remains defined by code, while the system explores different ways of executing that code under uncertainty.
This separation enables fine-grained control over performance and cost. A low-risk internal tool might use a cheap, greedy strategy, while a customer-facing workflow could apply a more exhaustive search — all without duplicating or rewriting logic.
The result is a cleaner architecture that aligns with established software engineering principles like modularity and separation of concerns.
Real-world impact on legacy workflows
The researchers demonstrated this approach using a Java-to-Python code translation agent. The task involved translating files, generating inputs, and validating outputs through execution and testing.
In a traditional implementation, adding advanced inference strategies required building explicit state machines and manually managing workflow state. This obscured the logic and made the system difficult to maintain.
Using ENCOMPASS, the team achieved the same functionality by inserting branch points before LLM calls. Advanced strategies like fine-grained beam search could be applied at both file and method levels without restructuring the code. The results showed improved accuracy and better scaling behavior as inference cost increased.
Cost efficiency through smarter strategies
Inference cost remains a major concern for organizations deploying AI at scale. The research shows that smarter search strategies can outperform brute-force approaches like repeated self-refinement while using fewer resources.
In experiments involving reflective agent patterns, best-first search achieved similar or better outcomes than multiple refinement loops, but at a lower overall cost. This highlights how inference strategy selection directly impacts both performance and budget.
By externalizing inference logic, teams can tune cost and accuracy independently, adapting to different use cases without rewriting applications.
Engineering considerations and limitations
While the framework reduces complexity, it doesn’t eliminate the need for thoughtful agent design. Developers still need to identify meaningful branch points and define reliable success metrics.
Scoring execution paths works well when objective validation is possible, such as running unit tests or checking structured outputs. In subjective tasks like summarization or creative generation, evaluation remains challenging.
There are also practical concerns around managing side effects. Because the framework may explore multiple execution paths, developers must ensure that external actions like API calls or database writes are handled safely to avoid duplication.
What this means for scalable AI systems
As AI agents become embedded in enterprise workflows, architectural discipline will matter more than clever prompts. Hardcoding probabilistic behavior into application logic creates systems that are brittle, opaque, and difficult to govern.
Separating workflow logic from inference strategy enables cleaner testing, easier auditing, and more consistent upgrades. It also supports better governance, allowing teams to adjust AI behavior globally without touching individual agents.
This research suggests that the future of scalable AI agents lies not in making models smarter alone, but in building software architectures that manage uncertainty deliberately and transparently.

