SRE4AI: AI for SRE Is Not SRE for AI

AI for SRE Is Not SRE for AI

Most software, when it breaks, has the decency to tell you. It throws an error. It returns a 500. It pages someone at 3 a.m. A stack trace points straight at the wound.

A system built on a large language model usually does none of that. It fails the way it succeeds - fluently. Same confident tone, same clean formatting, same 200 status, same healthy latency. The answer is wrong and not one signal in the stack says so. No error rate moves. No dashboard turns red. The only thing in the entire system capable of noticing is the human reading the answer.

That’s the uncomfortable property of GenAI in production: a wrong answer looks exactly like a right one. The failure leaves no trace in any of the places you’ve been trained to look.

That gap - between “the system is healthy” and “the system is right” - is what I want to talk about. It’s where most AI projects quietly die. And it’s where two ideas that sound like the same idea turn out to point in opposite directions.

Two phrases that aren’t the same thing

People say “AI” and “SRE” in the same breath now and assume they mean one thing. They mean two, and the difference is not academic.

AI for SRE is AI as the tool in the operator’s hands. It’s the wave you’ve already met: anomaly detection that flags a weird metric before you would, a copilot that reads ten thousand log lines and hands you the three that matter, natural-language queries over your observability stack, an assistant that drafts the incident summary while you’re still reading the page. It’s real and it’s useful. But notice what’s being kept alive: an ordinary software system. A payments API. A cluster. A database. The AI never has to be trustworthy itself - it just has to make a human a little faster. If it’s wrong, the engineer catches it. The reliability target hasn’t moved an inch.

SRE for AI inverts the whole relationship. Here the model, the agent, the pipeline is the production system. The thing serving customers. The thing that has to be observable and accountable when it starts going wrong. The thing that has to be trustworthy is now the AI itself.

AI for SRE makes people faster at running software. SRE for AI makes software you’d actually dare to run. One is a tool you can buy. The other is work you have to do.

Where the old playbook breaks

I spent years inside companies where reliability had a playbook, and the playbook worked. The trouble is that it assumes things AI no longer respects.

It assumes determinism. Same input, same output, every time - the quiet contract that every debugging session is built on. Ask a model the same question twice and you can get two different answers. Reproducibility stopped being free.

It assumes up and down is a binary you can build a dashboard around. AI quality is a spectrum. A model can be fully up and fully wrong at once. Availability stopped meaning correctness, and most monitoring only watches the one that no longer tells the truth.

And it assumes cost is roughly fixed - some CPU per request, give or take. AI runs on tokens, and tokens are money, per request, every request. A bug that used to waste a few cycles now writes itself an invoice. An agent stuck in a loop isn’t a performance problem. It’s a bill.

There’s one more, and it’s the one I keep coming back to: you don’t own the whole stack anymore. I’ve written before about models that get quietly worse underneath you, with no deploy on your side and no way to roll back. That’s the same disease, seen from the operations chair. Your system’s behavior can shift while your code stays byte-for-byte identical.

So you take a discipline built for loud, deterministic, fully-owned systems and you point it at something silent, probabilistic, and rented. Of course it leaks.

What I think SRE for AI actually is

Reliability engineering has its canonical signals. Google’s four golden ones - latency, traffic, errors, saturation. The RED method, the USE method, whatever your team swears by. They’re all good, and the moment you point them at a model they all share the same blind spot: every one of them can read perfect while the answer is perfectly wrong.

So the job is to watch the things traditional operations never had to. Is the answer right. Is it grounded in the source it claims, or invented. Has the behavior drifted since last week. What did this cost. Did it try to do something it shouldn’t. None of those have a place on a standard dashboard, and all of them are now reliability signals.

Which means the health check changes shape. “99.9% available” turns into “95% of answers pass our evals.” Continuous evaluation becomes the heartbeat - the same question you already know the answer to, asked over and over, watching for the day the answer changes. Observability has to go semantic: you stop tracing only the request and start tracing the reasoning, the context that was retrieved, the tools that were called, the path the thing took to get to its wrong answer. You canary on quality, not just on errors. You treat guardrails - validation, injection defense, cost ceilings - as load-bearing infrastructure rather than a security side-quest. And you write runbooks for failures that will never throw a stack trace, because the worst time to figure out what to do about a hallucinating model in production is while it’s happening.

You can’t page on a stack trace that was never thrown. So you learn to page on the answer being wrong.

Why this is worth caring about now

The distance between an AI pilot and AI that runs reliably in production is not a model gap. It’s an operations gap. The model is rarely why these projects fail. The absence of any discipline around the model is.

Regulation is starting to say the same thing out loud. In May the EU provisionally agreed to push the AI Act’s high-risk deadlines out to 2027 and 2028 - not because the problem shrank, but because almost nobody was ready. The standards weren’t finished. The tooling wasn’t there. The discipline wasn’t there. A delay like that isn’t a reprieve - it’s a measure of how far behind everyone still is. When the deadlines do land, “show that your AI behaves, and prove it” stops being a virtue and becomes table stakes - and proving it is an operations problem, not a paperwork one.

What’s next

This is the first of a series, and on purpose it only drew the map. The rest walks it. I want to get into observability for AI and why those familiar signals start to mislead you the moment a model is involved. SLOs you can’t hang on uptime. Incident response when there’s no stack trace and you can’t roll back the model. Cost as a first-class reliability signal. Drift, and how to notice the model changing before your users do it for you.

Every one of those is the same argument from a different angle: AI is infrastructure now, and infrastructure has a discipline.

Classic SRE earned us the right to trust the silence. Green meant good. You could sleep. AI took that back - it can be broken and quiet at the same time, and it will let you sleep right through it.

The work now is making the failures loud again. On our terms, in our dashboards, before a customer does it for us.

That’s the job. The rest of this series is how.