Design trade-offs for conversational AI agents

Building reliable agentic systems involves the same trade-offs that affect any distributed system. You have to navigate a complex interplay between performance, latency, resilience, cost, and complexity.

Conversational agents bring specific restrictions to bear because you need to support a responsive user interface. You can’t make a user wait for several minutes while a question passes through your exotic network of collaborating agents that gradually reason and verify their way to a reliable response. You have a pretty limited timeframe to return a response, which places severe limitations on your design choices.

Your biggest constraint is time

Agent applications are among the most IO bound you will ever build. The vast majority of processing time is spent waiting for responses from external services, mainly the iterative calls to the large language models (LLMs) that provide reasoning. After that there will be calls to the data stores that support retrieval augmented generation (RAG) and the tool implementations that integrate with remote services and databases.

All of this contributes to a vast well of latency that you have little direct control over. You can try to execute tool calls in parallel, use faster models, and throw effort into optimising your RAG knowledge bases. This will buy you a second here and a second there, but it won’t change the fundamental performance characteristics of conversational agents: they are always going to be slow as hell.

This forces you into making some very careful choices. Every extra reasoning step involves a call to an expensive and slow LLM. More exotic reasoning capabilities such as query expansion or prompt routing need to be considered in the light of an already strained latency budget. A network of specialised agents collaborating together to solve a problem might sound exciting, but this will cause performance and latency challenges.

This doesn’t matter for many agentic applications. For any batch or unattended processes time is much less of a constraint because nobody is waiting on a response. For conversational experiences you have no more than a few seconds before interest starts to wane. Ten seconds may be the absolute maximum wait before people lose faith in a busy UI spinner. That’s not very long when you are working with LLMs,

There are UX tricks you can play to hold the user’s attention. You should stream tokens directly to the user as they are received from models as this will help to hold the user’s attention. You can also stream “thinking” messages returned from LLMs as your agent reasons its way to a response. This requires some careful instruction in a system prompt as you don’t want an agent to reveal much about its internal workings and the tools available to it. Long-running operations in tools appear as “dead air" to users, so if you cannot avoid them then you should at least consider streaming hard-coded “work in progress” messages to break up the waiting time.

These interventions don’t speed up processing, but they do help to create a more fluid and interactive experience for the user. Long delays punctuated by final answers don’t promote a sense of collaboration in the same way as showing the agent’s reasoning. The overall effect is the same - it can still take 15-20 seconds for the answer to finally land - but the user is more likely to feel that they are taking part in a conversation.

Cost and token usage

LLMs are getting cheaper all the time, but they are never likely to be “cheap”. You have to pay by the word (or token) and engineers aren’t always aware of just how quickly the costs can stack up. After all, LLMs are completely stateless services that can only understand what you tell them. It often turns out that you need a whole lot of tokens to tell them properly.

A typical tool-using “react” agent will work through a number of iterations as it reasons its way to a response. Each of these iterations requires a request to an LLM where you send the full context, which can include the system prompt, the user request, and any results from tool calls. It should also include “memory” which should contain recent conversational interactions and any longer term memories that may be relevant.

This can give rise to a pretty chunky context that has to be sent with every iteration of reasoning. The costs can really add up and the impact on response time can be ruinous. There can even be a risk of running out of runway and bumping up against model token limits. Although this has been alleviated by models that support ever growing context windows, there remains a potentially difficult trade off between context size, performance, and cost.

Some agentic systems offset cost by automatically routing simpler queries to cheaper models. However you need a mechanism to be able to judge what constitutes a “simpler request”, which may require another call to an LLM. This approach can only help for very straightforward use cases, as in reality most conversational agents tend to lean on the advanced reasoning available in more recent, and expensive, models.

A larger context may help you to provide more information to a model, but it will cost more and take longer to execute. Bear in mind that this extra information is only useful if it is relevant. Some contextual information might add little more than “noise” that can undermine the reasoning capabilities of a model. Irrelevant memories and overly verbose tool results are often the main culprits for inflating context, though it can be challenging to ensure the right balance of relevance and size.

How do you judge what’s relevant without compromising on detail? Techniques that trim context are often blunt instruments that risk losing nuance. More sophisticated techniques for summarising long context tend to require extra model calls. Most solutions in this area rely on asynchronous operations to manage memory, which inevitably involves a trade off with complexity.

Trading off complexity and maintainability

Agentic systems usually start simply enough, but they can be prone to creeping complexity that can accelerate over time. The Pareto principle tends to apply here in that 80% of the functionality of an agent can be implemented with 20% of the overall effort. The majority of work is in an occasionally frustrating process of optimisation as you iterate towards an agent that is reliable enough for production.

The actual code that handles agentic reasoning and calls to an LLM is often the smallest part of an enterprise-grade agent architecture. You also need solutions for concerns such as data ingestion and enrichment, chunking and embedding, prompt management, guardrails, access control, tracing, logging, the user experience, and deployment automation. All of this gives rise to a complex system with a lot of different moving parts,

Given this background, improving the performance and of agents inevitably involves a trade off with complexity and maintainability. Techniques such as query rewriting, few-shot prompting, and prompt routing can help to improve the reliability of agents, but they all add weight to the system in terms of its maintainability and cost.

It is easy to fall into the trap of devoting a huge amount of optimisation effort for marginal gains. You should only pursue optimisation in response to genuine problems observed in production. In most cases it makes more sense to start simple and build in the capacity for change.

For example, implementing specialised and collaborating agents can help you to accomplish more complex tasks, while improving the scalability and resilience of an agentic system. This comes at the cost of considerable complexity. You need to consider how these agents will communicate and coordinate while avoiding repetition, conflicting actions, and resource contention. The overall effect may be to greatly inflate both processing costs and the maintenance burden.

Reliability and how to measure it

Any optimisation decisions should be driven by a keen understanding of the trade offs, ideally supported by hard evidence. Metrics such as token usage, error rates, and latency are simple enough to capture, but you also need to evaluate how correct your agents are. Are they using the right data sources, returning appropriate answers, behaving consistently, or inventing facts?

This kind of evaluation is not straightforward for conversational agents. Approaches based on frameworks like RAGAS or “LLM as judge” can track the accuracy of responses, but they are predicated on having a ground truth data set of questions and answers. This can be difficult to gather in practice, and it may struggle to capture the ebb-and-flow of a reactive and collaborative conversational exchange.

Capturing user sentiment can help to determine how “useful” an agent is. Simple voting buttons can provide an easy type of feedback, albeit a fairly noisy one that is prone to participation bias. A more reliable metric might be around how long users spend working with conversational agents. Do they successfully complete tasks using agents? How often do they become embroiled in extended conversations with agents? What proportion of the target user base engage with the agents?

These metrics may provide a more genuine measure of the performance of a conversational agent. The focus here is less on achieving absolute accuracy, and more on being able to serve as an assistant and guide that can help to make users more productive and support them in making better decisions.

It may go without saying here, but exposing agents to real-world users is absolutely critical. There is a tendency among engineers and development stakeholders to ask relatively “soft” questions of agents that are aligned to the core use cases. Real world users can open up more dynamic edge cases and unforeseen challenges that provide a more rigorous test of agents.

Ideally an evaluation should always be tied back to value, i.e. the real world benefits that can be directly attributed to the agent. It’s easy to be overwhelmed by the sheer novelty of the technology and lose sight of exactly what value the agent is bringing to the table. Is it saving users time? Is it helping them to be more accurate? Is it reducing costs? Are people actually using these agents, or are they just shallow parlour tricks…?