You Can’t Improve What You Can’t See

We’ve spent the last year or two proving that agents work. They can hold a conversation, call a tool, chain a few steps together and get to an answer, and tackle a wide category of problems. The demos are great, we get quite a few pilots running, but still the conversion into production is low, or when they do go the expected gains seen from a small scale PoC just don’t get realised; or rather, maybe you don’t actually know what’s going on.

When you put your shiny new agent(s) into production at scale, across thousands of real conversations a day, helpfully someone on the business side asks some questions: is it actually any good, is it performing as well as the PoV, are users happy, am I getting value for money? And do you know the answer, or can you just say ‘well we’re spending this many tokens so it must be good’?

Those questions are a large part of the work my team and I have been tackling lately, because the honest answer, for most agentic systems running in the wild today, is “we think so.” And “we think so” is not a happy path to growth!

The next leap in agentic services isn’t just better models (indeed most use cases do not need the latest and greatest models to do well), or one more tool integration; it’s being able to see, in real detail, what your agents are actually doing and whether they’re doing it well. You cannot improve what you can’t see.

Logs tell you it’s alive, not that it’s good

Traditional observability was built to answer “is the system up?”: latency, error rates, traces, etc. All of that is still necessary for platform health for sure, but it doesn’t tell you about your ‘AI health’. An agent can be healthy by every infrastructure metric and still be quietly giving customers confident, fluent, completely wrong answers while you get a nice green dashboard.

The observability layer for an agentic platform has to operate on a completely different level and set of data. We’re no longer looking at ‘did the request succeed‘, but ‘was the response correct, was it grounded, did it actually resolve what the customer came for, and would a human have done it better?”

Understanding your Agents

Moving from infrastructure level observability, there are a few key things to track to understand how your agents are doing.

Prompt effectiveness. Every prompt is a hypothesis, and the real world, as we know, often manages to break a lot of our expectations! We should be treating prompts like code: versioned, A/B tested, measured on real traffic. Which variant resolves faster? Which one hallucinates less? What may seem a “small wording tweak” can make significant improvements to performance, and potentially a prompt that is great for one model may not do so well on another even. Being able to see that data, and understand the results is critical.

Accuracy and grounding. Did the answer match reality, and was it backed by something the agent could actually cite? An agent that spews nonsense will rapidly fall out of favour with your users, and even worse could cause a headache for your company.

Hallucinations. I’d say in my experience, this is something most people I talk to outside of the IT space seem to have heard about – AI’s make things up. Being able to evaluate if your AI is hallucinating and being able to address it is fundamental to success.

Satisfaction. This is the metric we all know – CSAT scores, and ultimately the one that will signal success or failure. If your uses are not satisfied, then they won’t come back to use the agentic system again. There’s a lot of ways we can track this, from the language responses of users, to the conversion rate of orders, hand off to human agents, successful resolution of incidents. But key is a way to bring this together and get a grounded view.

Getting this data lets you move your agentic system from a black box to a service that you can see in real time how it is improving (or not) your business.

Closing the loop

Measurement on its own though does not do anything other than inform you: we must convert our findings back into the system for continual improvemnt. Every conversation can become a labelled example, every failure is a test case, and evaluations run continuously and not just a static exercise before a release.

This is a big shift to the next stage of real time improvement, based on what the real world throws at your system, and it can be come self learning and healing. The agents you ran yesterday can be making the agents you run tomorrow demonstrably better, and you should have the numbers to prove it.

Agents improving agents

By capturing rich data on what works (and doesn’t), you can look at self optimisation and improvement.

An agent that reads the eval results and proposes better prompts.

A routing layer that learns which model handles which intent best and sends traffic accordingly.

A system that notices a particular query type is drifting and flags it before a human would have spotted it.

By capturing all this data across your whole ecosystem, only then can we get to this point.

Cost versus performance is a real-time decision

Not every job needs your most expensive model – indeed by far the majority of scenarios I have seen in fact have maybe only a 1-5% improvement in performance between smaller cheaper models, and a newer more expensive one (a 500% increase in price for a 5% increase in performance is something that would rarely fly, but you need the data to know! I’m not saying there aren’t cases where only a frontier model will do, but those are a special category of problems).

Without per-interaction visibility into cost and quality together, you’re potentially either vastly overspending, or conversely not knowing a bit more money could drastically increase quality.

With A/B testing, the ability to seamlessly swap between models and tuned prompts, and indeed automatic model routing you are empowered to ensure you get the performance per price you need for your solution.

Humans aren’t just a fallback

There’s a lazy framing where the human-in-the-loop is just a safety net for when the agent fails – that should not be the framing in my view. There are still many occasions where a human touch will always be needed, but an empowered one that understands what has happened up to that handoff, why that handoff was needed, what gaps are there. The metrics can help reduce the need for future handoffs, and ensure that when they are needed the human has the data they need.

Every correction, override and escalation should be a training signal of enormous value that is captured.

Memory is part of the observable surface

Empowerment of your agents, both AI and human, and being able to observe into that too is the final piece of this puzzle I wanted to look at.

Agents can be stateful – they can carry memory across turns and sessions, pass data between agents, and can pull on a wealth of data or memory around previous transactions. It’s worth pointing out that memory is as capable of being subtly wrong as any other component as well, so correctness is key.

If you can’t see what an agent remembered, why it remembered it, and how that shaped a later decision, you don’t really understand your own system. Consistent, inspectable memory has to be in scope from day one to get the full picture.

In the case of a human in the loop, understanding that agent memory, the previous interactions, etc. help to empower that human agent to make the right decisions.

The line between a demo and a platform

Pull all of this together: prompt effectiveness, accuracy, hallucination tracking, satisfaction, a real feedback loop, agents tuning agents, cost-aware routing, humans as signal, observable memory. With all this, we have described the gap between a thing that demos well and a platform you can actually run a business on.

Visibility and understanding are perhaps generally viewed as one of the less glamorous parts of any platform, including agentic systems, but for me it is as important as the LLMs themselves. The teams that win the next round won’t be the ones with the flashiest agents and biggest models. They’ll be the ones who can see what their agents are doing clearly enough to make them better, every single day.

You can’t improve what you can’t see, so we’re building not just to see everything across all your systems, but to understand what it means and use it to drive improvement.

You Can’t Improve What You Can’t See

Logs tell you it’s alive, not that it’s good

Understanding your Agents

Closing the loop

Agents improving agents

Cost versus performance is a real-time decision

Humans aren’t just a fallback

Memory is part of the observable surface

The line between a demo and a platform

Share this:

Comments

Leave a comment Cancel reply

More posts

You Can’t Improve What You Can’t See

(Another) Week Open Source Got Owned

Clouds got a bit dark this week

So what’s all this about Sovereignty?