← Back to portfolio

The Agentic Infrastructure Stack: What's Real and What's Hype

AI AgentsInfrastructurePlatform EngineeringHype

Everyone is talking about AI agents for infrastructure. Having built several of them, here is an honest assessment of what works, what is hype, and what is next.

What Works Today

Anomaly detection plus known remediation. If the failure mode is known (OOMKill, crash loop, certificate expiry, connection pool exhaustion), an agent can detect it from metrics and logs and apply the fix with high confidence. This covers the majority of production incidents.

Report generation. Turning structured data (coverage reports, build logs, deployment history) into actionable recommendations is a strong AI use case. The input is well-defined, the output is structured, and the quality can be validated.

Migration scaffolding. Mapping resources between cloud providers is a lookup table problem enhanced by LLM reasoning for edge cases. Automated tools generate useful Terraform that saves hours of mechanical work.

What Is Hype

Fully autonomous production management. No responsible team should let an AI agent make unconstrained changes to production infrastructure. The safety engineering required to prevent catastrophic mistakes is harder than the AI engineering. Every well-designed agent has multiple safety layers precisely because full autonomy is dangerous.

"Just describe your infrastructure in English." Natural language to infrastructure code sounds amazing in demos. In practice, infrastructure decisions require precision that natural language does not provide. "Create a database" has a thousand implicit decisions: engine, version, instance size, backup policy, network placement, encryption, IAM permissions. The agent cannot guess all of them.

What Is Next

Hybrid human-AI incident response. The agent handles the first minute: correlation, enrichment, initial RCA. Then it presents a context-rich summary to the human, who makes the final decision. The goal is not replacing SREs but giving them superpowers.

Continuous optimization. Agents that continuously analyze and improve infrastructure configuration, not just respond to incidents. Tracking whether optimizations actually improved performance over time closes the feedback loop.