Table of Contents

Tracing, Evaluating, and Scaling AI Agents in Production

Building AI agents that scale? Learn how to trace, evaluate, and control LLM systems in production with real examples and sharp engineering insights.

Author

Boudhayan Ghosh
Boudhayan GhoshTechnical Content Writer

Date

Aug 14, 2025
Tracing, Evaluating, and Scaling AI Agents in Production

Book a Discovery Call

Recaptcha Failed.

Tracing, Evaluating, and Scaling AI Agents in Production

Editor’s Note: This blog is adapted from a talk by Rajesh Kumar Mishra, Senior Engineering Manager at Cloudera, delivered during the Build with AI meetup hosted by GeekyAnts at NIMHANS. Rajesh explored the evolving landscape of LLM agents, focusing on observability in AI workflows and how to evaluate agent-driven systems at scale. Drawing from his work on cloud-native data platforms, he shared practical insights and live demos that highlighted the challenges of understanding what AI agents are doing.

From Metro Rides to Multi-Agent Systems

Hi, I am Rajesh Kumar Mishra, also known as RUM or Ram on LinkedIn. I lead the engineering team at Cloudera, primarily focused on our data engineering platform and services across both public and private cloud deployments.
I have spent years working in large-scale systems, but the past year has marked a clear shift: a paradigm change driven by AI, LLMs, and the emerging world of agent-based applications. This talk is the second chapter in my ongoing exploration of observability in AI systems, building on an earlier session I presented at ThoughtWorks.
I like to begin with a question. How many of you think your job is at risk in the AI era? Fewer hands go up than expected. That confidence is reassuring, because I do not believe jobs are vanishing—they are evolving. The tools are different, but our desire for structure and determinism remains the same.
Humans, after all, lean toward deterministic systems. If I take the metro, I expect to reach my destination on time. But in traffic, even the best predictive models fall short. That unpredictability, that non-determinism, is the nature of AI today. Just like how I myself become harder to predict after a couple of drinks, LLMs can generate different outputs for the same prompt depending on context, timing, or even randomness. That unpredictability makes observability critical.

Why Agent Observability Matters

Until the recent AI boom, software was built on the principle of predictability. You gave an input—say, five—and you always knew what the output would be. Whether you were tired, focused, or distracted, the software responded the same way. But in the LLM world, that changes. You give the same input, and the system might respond differently every time.
I realised that LLMs are more like me. The more I drink, the harder it is to predict my behaviour. These models behave the same way. Sometimes they are sharp, sometimes they drift. That is why observability becomes so important. You need to trace what the system is doing, monitor its responses, and evaluate whether it made the right call.
My name is Rajesh Kumar Mishra. I lead a team at Cloudera focused on data engineering platforms. We work across public cloud and private cloud deployments, and I spend a significant part of my time working on systems that support platform-as-a-service workloads. This talk was focused on observability and evaluation in agent systems—especially the kind built on top of LLMs.

A Brief Recap and the Road Ahead

This was actually my second talk on the subject. The first one—Chapter One—was presented at ThoughtWorks earlier this year. But you do not need to have seen that one to follow along here. I have tried to follow what I call the OVN property. It means that it does not matter where you have come from—what matters is your current state. So we start fresh.
We are in the middle of a major shift. The sooner we embrace it, the easier it becomes to adapt. We are entering the AI era, and tools like LLMs and agents are going to reshape how we work. That shift is not just technical—it is cultural. The way we think about software is changing.

Understanding LLM Apps and Agents

Let us take a moment to understand what we mean by LLM apps. If you have used tools like ChatGPT or Gemini, you already know how they work. You give them a task, they respond. Whether or not you like the response, you still keep using them—because they often get close to what you need. Hallucinations happen, but they are part of the game.
Now compare that to a real-world example. If you have ever worked with a property consultant, you know the drill. You tell them what you need, and they come back with options. You review them and make a decision. That is what an LLM app is like—it gives suggestions. But it is still you who takes action.
An agent is different. An agent is like your housemaid. You give the task—make sambar—and they handle the rest. They know the ingredients, the timing, the seasoning. They are autonomous. In AI terms, agents are powered by LLMs, but they operate independently. They decide which tools to use and in what order.

LLMs, Agents, and Tools

In a typical agent setup, the LLM acts as the brain. Prompts become the input instructions. The tools are the arms that carry out tasks—whether it is data lookup, analysis, or visualisation. And the router coordinates everything. The creativity lies in how you write your prompt and how clearly the task is defined.
The key difference between LLM apps and agents is control. An LLM app will respond to your prompt, but the final decision is yours. An agent will take the prompt, interpret the task, and execute it without asking you again. That is where complexity begins.

Multi-Agent Systems and the Need for Coordination

Now imagine not just one agent, but many. Each with a different skill set. That is where multi-agent systems come in. You have a central LLM and multiple agents working together to achieve a broader goal. But for that to work, there needs to be coordination. The agents must cooperate, understand their tasks, and execute them in a meaningful sequence.
Jaya had just spoken about a system where multiple agents were involved in generating and validating answers. These setups work, but only when agents pass control smoothly from one to another. If the handoff is broken, the entire task can fail. That is why evaluating each step of an agent’s behaviour becomes so important.

Why AI Projects Fail

Let me ask a rough question. Say there are a hundred AI projects running today. How many of those do you think will fail? Some say 50 percent. Others say 90 or more. As per Gartner, the number is about 85 percent. That is not a small number.
One major reason is herd mentality. People see a trend and jump in without understanding the requirements. AI projects are not like traditional software builds. If you do not understand the flow, the training data, or the infrastructure, your chances of failure increase.
The biggest problem is not the lack of data. It is feeding the wrong data. If you take unfiltered logs or irrelevant datasets and push them into your model, you are going to end up with either a rogue model or a hallucinating one. Both are dangerous.
Even large companies with massive datasets often do not know what to feed. If the data is not properly selected, the system will either train forever or converge into something meaningless. That is why so many AI experiments never reach production.

Observability in Model Development

Model development does not stop with training. You have to monitor it continuously. During development, we all deal with hyperparameter tuning. But even after deployment, that monitoring has to continue. The model needs to adapt to variations in the data.
We need tools that can trace what the model is doing—tools that help us monitor behaviour, catch mistakes, and identify drifts. That becomes even more critical when you are using hosted services. If you are running on a cloud LLM, you also need to track your token consumption, or you will burn through budgets quickly.

Thinking Like a Team Captain

Let me give a simple analogy. If you were building a volleyball team, what would you look for in your players? Agility, reaction time, and awareness—qualities that help the team function as a unit. AI agents are no different. You need to evaluate how well each tool performs, how each agent interacts, and whether the overall plan is being followed.
As developers, we often stop at output. Input goes in, response comes out, done. But in an agent-based system, the process matters just as much as the result. The way the tools were invoked, the sequence they followed, and the decisions the router made—all of that needs to be checked.

Use Case: Building a Data Analyst Agent

Let me walk you through a specific use case where we built a single-agent system to perform basic data analysis. The task was simple: receive a sales dataset, interpret it, extract specific insights, and generate a visual output.
To achieve this, we built a router that could coordinate three tools. The first was a data lookup tool to access the raw file. The second was an analysis tool that would process the dataset. The third was a visualisation tool that would generate the graph. All three tools had to be called in the correct order for the system to function.
The challenge was that this order was not always followed. Sometimes the router would jump to visualisation before the data was analysed. Sometimes it skipped a step. These mistakes were not caused by broken code, but by how the LLM interpreted the task. That is the risk when working with probabilistic models.

Evaluation: Understanding What Went Wrong

To evaluate the system, we looked at four dimensions.
First, we examined route evaluation—whether the right tools were being selected for the task. Second was skill evaluation—whether each tool was performing correctly. Third was plan evaluation—whether the task sequence was being followed. And finally, we looked at path evaluation, especially in cases where multiple agents were interacting and handing off responsibilities.
In this case, we focused on tool calling accuracy. We used a dashboard to track the percentage of errors. In one run, we observed a 12.5% error rate in tool selection. That meant the LLM was choosing the wrong tool in one out of every eight cases.
The evaluation itself was done using another model. After the traces were collected, we passed them to a judge model to score and annotate the results. If the right tools were used in the correct order, it received a score of one. If there was a mistake—such as choosing a line chart instead of a bar chart—the score dropped to zero.
This tracing mechanism allowed us to identify exactly where the system was going wrong, without needing to manually inspect every output.

How We Used LLM-as-Judge for Evaluation

To evaluate the output of our agent system, we used a second LLM as a judge. After each agent interaction was traced, the collected data was passed to this judge model, which reviewed the sequence and provided feedback. It would analyse whether the right tool was called, whether the plan was followed, and whether the final output matched the prompt.
In the successful cases, the trace showed the correct flow: lookup → analysis → visualisation. These received a clean score of one, along with an explanation. In the failed cases, the model had either called the wrong tool or misconfigured a visualisation—such as generating a line chart instead of a bar chart when the prompt clearly asked for a bar chart. These were scored zero.
This approach worked well because the judge LLM could read the traces at scale and provide a second opinion—one that was consistent and explainable. That removed the need for manual validation and made the system more scalable.

Designing for Cost and Control

In the demo, we used GPT-3.5 as the agent and GPT-4 as the judge. But the principles apply to any combination. What matters is having traceability, a capped cost model, and the ability to monitor token usage. For example, you can set a hard limit of 10,000 tokens for an agent loop. Once that cap is hit, the process stops automatically. This prevents runaway costs and infinite loops.
One thing I learned was that using large-scale agent frameworks like LangChain or CrewAI can make it harder to control these fine details. These tools abstract away the logic, which sounds convenient, but often comes at the cost of visibility. For observability, it helps to write your own orchestration logic. That way, you know exactly how many tokens are consumed, which model is being invoked, and what decisions the router is making.

Looking Ahead: Smaller Models, Sharper Focus

In the demo, we used GPT-3.5 as the agent and GPT-4 as the judge. But the principles apply to any combination. What matters is having traceability, a capped cost model, and the ability to monitor token usage. For example, you can set a hard limit of 10,000 tokens for an agent loop. Once that cap is hit, the process stops automatically. This prevents runaway costs and infinite loops.
One thing I learned was that using large-scale agent frameworks like LangChain or CrewAI can make it harder to control these fine details. These tools abstract away the logic, which sounds convenient, but often comes at the cost of visibility. For observability, it helps to write your own orchestration logic. That way, you know exactly how many tokens are consumed, which model is being invoked, and what decisions the router is making.

Related Articles

Dive deep into our research and insights. In our articles and blogs, we explore topics on design, how it relates to development, and impact of various trends to businesses.