Adam's Thoughts

Category: Uncategorized

You Can’t Improve What You Can’t See

We’ve spent the last year or two proving that agents work. They can hold a conversation, call a tool, chain a few steps together and get to an answer, and tackle a wide category of problems. The demos are great, we get quite a few pilots running, but still the conversion into production is low, or when they do go the expected gains seen from a small scale PoC just don’t get realised; or rather, maybe you don’t actually know what’s going on.

When you put your shiny new agent(s) into production at scale, across thousands of real conversations a day, helpfully someone on the business side asks some questions: is it actually any good, is it performing as well as the PoV, are users happy, am I getting value for money? And do you know the answer, or can you just say ‘well we’re spending this many tokens so it must be good’?

Those questions are a large part of the work my team and I have been tackling lately, because the honest answer, for most agentic systems running in the wild today, is “we think so.” And “we think so” is not a happy path to growth!

The next leap in agentic services isn’t just better models (indeed most use cases do not need the latest and greatest models to do well), or one more tool integration; it’s being able to see, in real detail, what your agents are actually doing and whether they’re doing it well. You cannot improve what you can’t see.

Logs tell you it’s alive, not that it’s good

Traditional observability was built to answer “is the system up?”: latency, error rates, traces, etc. All of that is still necessary for platform health for sure, but it doesn’t tell you about your ‘AI health’. An agent can be healthy by every infrastructure metric and still be quietly giving customers confident, fluent, completely wrong answers while you get a nice green dashboard.

The observability layer for an agentic platform has to operate on a completely different level and set of data. We’re no longer looking at ‘did the request succeed‘, but ‘was the response correct, was it grounded, did it actually resolve what the customer came for, and would a human have done it better?”

Understanding your Agents

Moving from infrastructure level observability, there are a few key things to track to understand how your agents are doing.

Prompt effectiveness. Every prompt is a hypothesis, and the real world, as we know, often manages to break a lot of our expectations! We should be treating prompts like code: versioned, A/B tested, measured on real traffic. Which variant resolves faster? Which one hallucinates less? What may seem a “small wording tweak” can make significant improvements to performance, and potentially a prompt that is great for one model may not do so well on another even. Being able to see that data, and understand the results is critical.

Accuracy and grounding. Did the answer match reality, and was it backed by something the agent could actually cite? An agent that spews nonsense will rapidly fall out of favour with your users, and even worse could cause a headache for your company.

Hallucinations. I’d say in my experience, this is something most people I talk to outside of the IT space seem to have heard about – AI’s make things up. Being able to evaluate if your AI is hallucinating and being able to address it is fundamental to success.

Satisfaction. This is the metric we all know – CSAT scores, and ultimately the one that will signal success or failure. If your uses are not satisfied, then they won’t come back to use the agentic system again. There’s a lot of ways we can track this, from the language responses of users, to the conversion rate of orders, hand off to human agents, successful resolution of incidents. But key is a way to bring this together and get a grounded view.

Getting this data lets you move your agentic system from a black box to a service that you can see in real time how it is improving (or not) your business.

Closing the loop

Measurement on its own though does not do anything other than inform you: we must convert our findings back into the system for continual improvemnt. Every conversation can become a labelled example, every failure is a test case, and evaluations run continuously and not just a static exercise before a release.

This is a big shift to the next stage of real time improvement, based on what the real world throws at your system, and it can be come self learning and healing. The agents you ran yesterday can be making the agents you run tomorrow demonstrably better, and you should have the numbers to prove it.

Agents improving agents

By capturing rich data on what works (and doesn’t), you can look at self optimisation and improvement.

An agent that reads the eval results and proposes better prompts.

A routing layer that learns which model handles which intent best and sends traffic accordingly.

A system that notices a particular query type is drifting and flags it before a human would have spotted it.

By capturing all this data across your whole ecosystem, only then can we get to this point.

Cost versus performance is a real-time decision

Not every job needs your most expensive model – indeed by far the majority of scenarios I have seen in fact have maybe only a 1-5% improvement in performance between smaller cheaper models, and a newer more expensive one (a 500% increase in price for a 5% increase in performance is something that would rarely fly, but you need the data to know! I’m not saying there aren’t cases where only a frontier model will do, but those are a special category of problems).

Without per-interaction visibility into cost and quality together, you’re potentially either vastly overspending, or conversely not knowing a bit more money could drastically increase quality.

With A/B testing, the ability to seamlessly swap between models and tuned prompts, and indeed automatic model routing you are empowered to ensure you get the performance per price you need for your solution.

Humans aren’t just a fallback

There’s a lazy framing where the human-in-the-loop is just a safety net for when the agent fails – that should not be the framing in my view. There are still many occasions where a human touch will always be needed, but an empowered one that understands what has happened up to that handoff, why that handoff was needed, what gaps are there. The metrics can help reduce the need for future handoffs, and ensure that when they are needed the human has the data they need.

Every correction, override and escalation should be a training signal of enormous value that is captured.

Memory is part of the observable surface

Empowerment of your agents, both AI and human, and being able to observe into that too is the final piece of this puzzle I wanted to look at.

Agents can be stateful – they can carry memory across turns and sessions, pass data between agents, and can pull on a wealth of data or memory around previous transactions. It’s worth pointing out that memory is as capable of being subtly wrong as any other component as well, so correctness is key.

If you can’t see what an agent remembered, why it remembered it, and how that shaped a later decision, you don’t really understand your own system. Consistent, inspectable memory has to be in scope from day one to get the full picture.

In the case of a human in the loop, understanding that agent memory, the previous interactions, etc. help to empower that human agent to make the right decisions.

The line between a demo and a platform

Pull all of this together: prompt effectiveness, accuracy, hallucination tracking, satisfaction, a real feedback loop, agents tuning agents, cost-aware routing, humans as signal, observable memory. With all this, we have described the gap between a thing that demos well and a platform you can actually run a business on.

Visibility and understanding are perhaps generally viewed as one of the less glamorous parts of any platform, including agentic systems, but for me it is as important as the LLMs themselves. The teams that win the next round won’t be the ones with the flashiest agents and biggest models. They’ll be the ones who can see what their agents are doing clearly enough to make them better, every single day.

You can’t improve what you can’t see, so we’re building not just to see everything across all your systems, but to understand what it means and use it to drive improvement.

June 8, 2026
(Another) Week Open Source Got Owned

It’s been a rough week for the open source security ecosystem, the kind of week that makes you step back and really question some fundamental assumptions about the trust model that underpins modern software development. And if you’re running any kind of CI/CD pipeline, AI agent framework, or Kubernetes environment, it might have made your week rather tough too!

You might have seen the recent Veritasium video, delving in to the XZ hack that, if it wasn’t for some geeky levels of attention, could have been one of the worst security incidents in history! I think did a great job of showing to the masses what was otherwise perhaps only known in the security community – but these kind of attacks seem to be getting more frequent, and in the era of GenAI, the ability of human reviewers to defend against them getting harder and harder.

Security turning against you

On March 19th, a threat actor group calling themselves TeamPCP compromised Aqua Security’s Trivy – one of the most widely used open source vulnerability scanners in the industry; a tool that thousands of organisations use to scan for vulnerabilities and trust implicitly in their pipelines.

The attack was sophisticated but followed a familiar playbook. TeamPCP had actually gained an initial foothold weeks earlier, back in late February, exploiting a misconfiguration in Trivy’s GitHub Actions environment to steal a privileged access token. Aqua discovered the first compromise around March 1st and rotated credentials – but the rotation wasn’t comprehensive enough, and the attackers retained residual access and waited a bit.

On March 19th they struck and force-pushed all but the latest version tags in the trivy-action GitHub Action repository to point at malicious commits, published a backdoored Trivy binary as v0.69.4, and compromised the setup-trivy action as well. Effectively, if you were referencing Trivy by tag, as most people do, there was a good change your CI/CD pipeline was silently running a credential stealer. The malware harvested SSH keys, cloud provider credentials, Kubernetes configs, GitHub tokens – basically anything it could find on the CI runner – encrypted it all with AES-256 and RSA-4096, and exfiltrated it to a typosquatted domain that looked enough like the real thing to avoid a casual glance at network logs.

But here’s the most concerning part, and I suspect we are still at the tip of the iceberg here – they managed to steal GitHub access tokens to who knows how many other repositories, and now we are seeing the next wave(s) of this attack.

The cascade hits Checkmarx

By March 23rd, TeamPCP had moved on. This time the target was Checkmarx’s KICS – another infrastructure-as-code security scanner. In the space of about four hours, they hijacked 35 version tags and pushed credential-stealing payloads using a new typosquatted domain, checkmarx.zone. Same playbook, different target, different exfiltration domain – specifically designed so that anyone who had updated their monitoring after the Trivy incident and was looking for the Trivy indicators of compromise would miss this one entirely.

The evidence suggests that credentials harvested from the Trivy attack provided the access needed to compromise the Checkmarx actions. One poisoned action harvests credentials that enable the poisoning of further actions. It’s a self-sustaining supply chain worm, and the attack surface expands with every victim.

They also went after Checkmarx’s VS Code extensions on the OpenVSX marketplace, and defaced all internal repositories in Aqua Security’s aquasec-com GitHub organisation with “TeamPCP Owns Aqua Security” messages – seemingly just to show off how much access they had accumulated.

There have even been targeted attacks on specific geographies: a payload was discovered that checks if the victim system is in an Iranian timezone, and if it is, deploys a wiper that destroys Kubernetes clusters and runs rm -rf / on non-containerised hosts. Non-Iranian systems get the standard backdoor instead.

And then LiteLLM

Which brings us to today, March 24th, and the one that hits closest to home for anyone working in the AI platform space.

LiteLLM – a great tool that serves as an API gateway to over 100 LLM providers, with millions of monthly downloads – has been compromised in exactly the same fashion. Versions 1.82.7 and 1.82.8 on PyPI contain credential-stealing payloads. The 1.82.8 variant uses a .pth file, which is particularly nasty because .pth files execute automatically on every Python process startup, not just when you import the library. Simply having the package installed is enough.

The entire package has now been pulled from PyPI – not just the compromised versions, but everything.

The target selection here is strategic in a way that should really make people sit up. LiteLLM is, by definition, the package that has access to every LLM API key in your organisation. If you’re using it as your gateway to OpenAI, Anthropic, Azure, Bedrock, and everything else, then a compromise of LiteLLM means every single one of those keys is potentially in the attacker’s hands. And because LiteLLM is a transitive dependency for a growing number of AI agent frameworks and MCP servers, plenty of people who never installed it directly were still pulling it in through something else, potentially not even realising. Thankfully it is now quarantined, and the attack window was open only for a short time.

What does this all mean?

I’ve been thinking about this a lot today, and a few things stand out.

First, the pattern here is clear: TeamPCP is deliberately targeting the security and infrastructure tools that organisations trust implicitly: vulnerability scanners. IaC security tools. API gateways that by their nature have access to yet more credentials.

Second, we have a fundamental problem with how GitHub Actions tags work. Mutable version tags can be force-pushed to point at arbitrary commits, and most of the ecosystem references actions by tag rather than by SHA hash. This has been a known risk since the tj-actions/changed-files compromise back in March 2025, a full year ago, and yet the industry still hasn’t widely adopted SHA pinning. It’s the sort of thing that’s easy to say “we should do this” but in practice rarely gets prioritised until something like this happens.

Ultimately though, supply chain security is really hard. Pinning to specific versions helps, but only if you pinned before the compromise. Lockfiles help, but only if you don’t blindly update them. A single compromised maintainer account can cascade through thousands of downstream projects in minutes, and the window between compromise and detection, while shrinking, is still wide enough for lots of damage to occur, as we are seeing. We build our platforms on top of stacks of open source dependencies, and the trust model assumes that each one of those is maintained and published securely. As we’ve seen this week, that assumption can be shattered in moments.

Thankfully in my case, I avoided issues both by SHA pinning, but also by keeping away from the latest and greatest release! Although I firmly believe in regular LCM, I always stay a few versions behind. However that only goes to your direct control, and nested dependencies can mean you have exposure you haven’t even realised!

For those of us building platforms in this space (and I count myself very much in this group), this is a moment to pause and audit. Check your CI/CD pipelines. Check what versions of Trivy, KICS, and LiteLLM are installed in your environments. Pin your GitHub Actions to known, trusted, commit SHAs. And if any system has had contact with the compromised versions, rotate everything – SSH keys, cloud credentials, API keys, the lot. But also, audit your dependencies – is that trusted package you have pulling something that itself could be compromised?

It’s been a bad week for open source security, and as I opened with, I don’t think it’s going to get any easier in our brave new world! But incidents like this will hopefully force the industry to take a deeper look at how we can improve supply chain integrity, and perhaps something good comes out of it. Of course we said the same thing last time, and the time before that…

March 24, 2026
Clouds got a bit dark this week

It’s not been a good month for hyperscalers, or beds!

With two recent cloud outages, Microsoft’s Front Door outage globally that impacted not just customer services but Microsoft themselves, and this week’s AWS outage in US East 1, where a simple DNS issue on one service cascaded to take down huge numbers of services in a region even if not directly related, including most critically my own bed, I was thinking a bit…

Yes, I’m one of those with the now somewhat famous smart mattress topper that has made itself stupidly cloud centric. I don’t even have the ‘AI powered insights’ subscription, I just use it in dumb mode where I just have to turn it on at night, set the temp, and turn it off in the morning. I can’t even set a timer without the cloud! But I couldn’t turn it off the other morning unless I just pulled the plug…

When designing highly resilient services, looking at potential points of failure and understanding their impact is crucial. I’ve spoken in the past about how on the Olympics we worked to a 5x 9s SLA, and built in layers and layers of redundancy, removing single points of failure, and at every level ensuring there were layers of contingency, and then testing every single one of them – repeatedly. The same should apply in designing your IT services, and the scale of that effort depends on the criticality of your solutions, and of course the impact to your brand.

From multi- AZ approaches, multi-region, to multi-cloud there are levels of resilience that can be built in to cloud environments to mitigate the risks of outages, and of course let’s not forget your own data centres as well – hybrid approaches can also help in such scenarios, both with distributed critical systems where the cloud is used for scalability but the core can remain on prem for example (of course outages in your own data centre are also a thing, not just the realm of hyperscalers).

The Front Door outage the other week – that impacted globally, so no matter how well engineered your solution was, it was going down if you relied on Front Door. But a multi-CDN approach, where Application Gateways are fronted by multiple CDNs would have remained up. Increased cost and complexity, vs higher SLA.

The Amazon Outage, which also impacted the AWS control plane, perhaps made it hard for people to be able to understand the full impact, and decide if a DR would be necessary or not (or indeed even if they were able to). For clients in active multi-region set ups, certainly it didn’t work as planned, for some of them at least!

But for my bed? Is it a mission critical system? Definitely not (well… perspective is everything!). But, one would argue the brand damage here could be quite major (Saying that, a lot of people are talking about it, and perhaps are curious…). However, from a design perspective, I think the solution they have done simply makes no sense whatsoever – the unit in my bedroom has by all accounts a quad core ARM processor… yet all it does is connect back to AWS USE1 for instructions. Indeed users who do have the subscription, it apparently sends a full 16GB of data a day!! (Anyone in the Edge AI/ML world, when all it’s doing is collating sample points from sensors, would perhaps find this quite startling that there is zero logic being applied at the edge here).

I can see no good reason that there can’t be certain functions that just are processed locally, like turning it on or off – sure I can turn it on when I’m the other side of the world too… but not sure that’s useful! Limiting functions like a timer is a business decision, but again technically also odd! Not only do these decisions impact costs for the company – they are processing lots more data than they need to in the cloud when edge processing could do plenty of that lifting, we’ve now seen the impact of an outage on an architecture that, perhaps rightly so, is not highly redundant and multi-region active active.

So… my point? When designing a cloud service, you need to think hard about ‘what would happen if…’ and balance the risks and costs against the very rare chance of a major outage. Yes it’s been a bad month, they are very rare, but they do happen. And being prepared for when they do is crucial, and needs to start with a fundamentally understanding of your business, its application and users, and the impact on them. A distributed cloud solution can keep you edge working in an outage (and optimise data flows with edge computer!), a resilient hybrid/multi cloud solution can reduce the impact on your critical services, and it might not be as hard to achieve as you think!

October 22, 2025
So what’s all this about Sovereignty?
Sovereign Cloud is certainly not a new topic, but one that in recent months has made a lot more noise than usual, especially in Europe.

I’ve worked in infra and cloud for many years, and sovereign cloud itself isn’t anything that new – indeed with the rise of the hyperscalers in the 2010s, the topic of ‘who has my data/compute’ has always been there, as has the question of sovereignty.

So let’s start with the ‘simple’ question – what does sovereign mean (to you)? Because in my experience over the last 10 years or so of companies wanting sovereign cloud, it means something different to everyone.
- No one wants to give up control of their data to someone else, so therefore I need sovereign cloud? Not really!!
- I want to ensure that my data stays in my country, therefore I need sovereign cloud? Also not really (a bit more nuanced perhaps).
- I don’t want anyone outside of Europe knowing anything about my estate? The Hyperscalers are working to address this one now too.
- I simply don’t want a non-European company having anything to do with my cloud? That’s a tricky one for the hyperscalers, but they are making moves to address it now, and enter the realm that was traditionally served by companies like OVH for example.
All of these are ‘starting points’ a company might have, and as you drill down some of them may have more grounding than others in to why a business thinks they need sovereign cloud. It’s no secret that I am a Microsoft guy mainly so this post will be quite Microsoft centric, although I have worked extensively with AWS, GCP and Alibaba Cloud in my time, but the story isn’t that different for any of the hyperscalers really.

The next point I want to look at is if a company is so concerned about sovereignty, why are they looking at public cloud in the first place? Historically, private clouds don’t offer the breadth of services (especially in areas like AI capabilities), certainly can’t offer the scalability (well, you can keep buying more hardware, but that takes time), and of course tend to come with significant upfront costs. All of these things are changing, and a recent Gartner study shows that a decent proportion of CIOs are now investing in Private Cloud, bucking a trend of decreasing investment over the previous years. Private Cloud has come a long way from when it started – we just called it virtualisation back then, implementing clusters of ESX3 or Windows Server 2012 + HyperV, consolidating physical infrastructure and trying to get the most out of your servers. I recall one of my first VMWare projects, taking racks of servers down to a single blade centre and thinking ‘how cool is this’. I digress though!

So companies are looking at private cloud, and things like Azure Local (and Azure Stack in the past), AWS Outposts, Google Distributed Cloud are all trying to let the hyperscalers play in this market (although coming from a hybrid perspective mostly), and VMWare themselves have positioned themselves in the same space and not just to be seen as a virtualisation platform. I’ve seen a huge pick up in interest in Azure Local for example, both as a hybrid solution, but also as a disconnected solution where a customer wants to use the Azure API they know already, and take advantage of the scale of Microsoft, while remaining entirely disconnected from public Azure.

But let’s go back to the question – what does someone really mean when they say sovereign? In my experience, with a few guardrails, public cloud actually does satisfy most companies’ needs – at least up till now. The primary concern was always around data residency, and who can have access to your data. But this was a problem long solved with guardrails and encryption to a level that would satisfy easily 90%+ of customers. The support for ‘bring your own key’, and more recently ‘bring your own HSM’ has further strengthened that by ensuring that you could easily render any data useless as well. Despite the recent noise, Microsoft Cloud for Sovereign for example has existed for years, mostly as a set of policies. Of course, the further down this rabbit hole you go, the more expensive things get!

As we look forward, the desire to have more European centric solutions certainly changes the field for the future, with concerns raised that a foreign court could order a company to hand over information of a European company, or at least shut it down if that isn’t possible (and indeed in a well implemented public sovereign cloud, the hyperscalers cannot ever access your data). That is why all of the major hypervisors have made announcements in the last few months around how they are going to be ‘more European’ in some way or another. In the Microsoft world, that is the new Data Guardian solution that ensures a European Sovereign Cloud customer will be exclusively managed and supported in Europe, and support for your own hardware HSMs. Then they have gone a step further in France and Germany, letting local companies run a subset version of Azure Cloud (like Azure China for example, or for those of us who were there ten years ago, the original Azure Germany instance!). These offerings are aiming to be direct competitors of companies like OVH (once they get all their government certifications), but trying to offer ‘more’ – an API compatible cloud that can co-exist with Azure and offer the broader catalogue of Azure Services. The question is, would that be enough to tempt someone over who was convinced they needed OVH in the first place?

For sure we are going to see a growth in sovereign cloud demands moving forward now, as we enter a new era of trust. I touched on AI being a major reason for cloud, with the costs of entry prohibitive otherwise; what I hope this drive to more european sovereignty doesn’t lead to is a ‘two tier’ cloud, where the major hyperscalers offer ‘less’ in their sovereign clouds to non-sovereign. I don’t see this happening myself, and of course it also represents a opportunity for the smaller European players to become more significant – till now their use cases with enterprise have always been niche, fitting more with SMEs who couldn’t invest in their own private cloud.
July 21, 2025
From the archives… It’s getting Cloudy – part 3/3 of my look back

Originally posted 16th June 2022

Moving back to the UK and starting in Cloud Engineering was another big change for me – up till this point I hadn’t worked in a formal Agile team, taking a more classic project approach, and I hadn’t done a huge amount on public Azure cloud either! What I did have was a track record of building and growing highly successful teams and delivering complex infrastructure projects.

Coming in to Cloud Services gave me the chance to learn from my team and develop my skills in Agile and SCRUM, while helping to bring fresh eyes and knowledge. A key part of all of my journey has been the opportunity to surround myself with great people and learn from them, while hopefully imparting some knowledge back! Here I had another highly multinational team spanning a good portion of the globe to work with, along with the challenges and possibilities of a fully remote team. I guess this gave me a good head start for what would come in 2020, having had 4 years of experience of working from home and managing a team spread across multiple countries and time zones who didn’t get to meet face to face very often!

It was now that I finally fully immersed myself in cloud, developing my skills from on-prem infra and virtualisation in to cloud, as well as containerisation and Kubernetes. Getting to work across a multitude of customers with different requirements, as well as hugely differing views of cloud and what it meant to them, helped hone my own skills in how to view and shape cloud strategy for customers. Here we really embraced the Infrastructure as Code, data driven approach, building on the knowledge the team bought together from their own previous projects and roles to take it to a new level of automation and repeatability, allowing for rapid delivery of complex, secure cloud environments.

Over my time in the team my job title changed a few times as I took on wider scopes of responsibilities, including working with AWS and GCP. I had the opportunity to engage with a wide variety of customers across multiple industries, learning the challenges each see with the cloud, and perhaps most importantly the solutions they needed solving that cloud might be able to help with. It can be far too easy to look at what we can do with the technology available and sometimes lose track of what the actual problems we needed to solve in the first place. Starting without clearly defined criteria of the real needs and problem to be solved can easily result in a solution that just keeps growing in scope and complexity that in the end doesn’t solve the initial problem.

Taking those skills I moved to our CTO team, as part of the Manufacturing Industry CTO. Here I got to look at a new industry for me, and the industry (as well as customer) specific challenges around Cloud, Edge and IoT. It’s here I started to delve in to Digital Platforms and how to take the next step in a digital transformation journey, embracing data and APIs at the heart.

And that brings me to my final role in Atos – in 2021 I took the opportunity to move back to Major Events to head up their Cloud, Infrastructure and DevOps team. This period of my career lasted just over a year, but I hope in that year I managed to make a big impact, bringing back the knowledge and ideas I had gained. I’ve said this for each part of my journey, but in each case it is true – yet again I had a great team to work with, share ideas with and learn from. It also gave me the chance to work on two further Olympic deliveries – Tokyo 2020 and Beijing 2022.

For me it was great to see the evolution since Rio 2016 and the continuation of what was started there, with the move not just to cloud, but also the shift in the applications to microservices and containerisation. Beijing also gave me the opportunity to work with a new cloud provider to me, Alibaba Cloud. The data driven, IaC approach we had adopted made it easy to consume cloud, whether it be Alibaba or anyone else, ensuring a secure and standardised approach to deployment in a multi-cloud environment.

So that’s a quick look at my path through Atos, and some of the technological evolution I’ve seen over that time. As I’ve kept this focused more on the changing technologies, there’s a few areas I didn’t touch on such as my tenures in the Scientific Community and Expert Communities. These bodies gave me the opportunity to work with some of the most brilliant people in Atos, and provide my own thought leadership to our strategy and contributing to white papers; another highly formative experience for me, and yet again really based around the people.

But this brings to an end my 17 year journey in Atos: from configuring a Cisco 7200 router via a console cable to automated deployment of complex multi-cloud environments. I am hugely grateful to each and every person who I have got to work with and learn from in that time, and I hope I have left a bit of a legacy here and there!

April 15, 2025