Professional Project: Software Engineer Co-Op·Ribbon Communications·September 2025 — December 2025
Autonomous SRE Agent System
AI-Driven Fault Detection, Localization, and Mitigation on Kubernetes
< 30s
Failure detection latency
15+
Fault scenarios simulated
4-phase
Autonomous OODA loop
Overview
At Ribbon Communications, I was brought on to research and build an agentic AI system for Site Reliability Engineering: a system that could autonomously detect failures in a cloud-native microservice environment, localize the root cause, and initiate mitigation, without requiring a human to be paged first.
The work split into two tightly coupled deliverables: the agent itself, which formed the core of the system, and the fault injection framework I built to evaluate it (a chaos engineering harness that could simulate real-world failures on demand, measure the agent's detection latency precisely, and validate its diagnostic output against known ground truth).
The agent achieved sub-30-second failure detection across all tested failure scenarios, a substantial improvement over the existing reactive detection approach, and the fault injection framework became a reusable team asset that continued in use after my co-op ended.
The Problem
Site Reliability Engineering at scale is fundamentally a reaction-time problem. When a microservice fails in production (a pod crashes, an authentication token expires, a network partition degrades inter-service communication), the clock starts immediately. Every second between failure onset and detection is a second of degraded service. Every additional second between detection and mitigation is a second of impact on users.
Traditional SRE relies on a human-in-the-loop: monitoring dashboards surface anomalies, on-call engineers are paged, engineers diagnose the failure, engineers execute remediation. Even with excellent tooling, this loop takes minutes. For a telecommunications infrastructure company like Ribbon, where service continuity is not a product quality metric but a contractual obligation, that latency is unacceptable.
The promise of an autonomous SRE agent is to compress the detect-diagnose-mitigate loop from minutes to seconds, and to do it without a human in the critical path. An AI agent that can observe telemetry continuously, recognize failure signatures immediately, reason about probable root causes, and execute a remediation playbook can respond faster than any human on-call rotation.
The hard part is building a system that does this reliably. A false positive (the agent mistakenly identifying a failure and executing mitigation on a healthy service) can cause more damage than the original failure. A misdiagnosis (correct detection, wrong root cause) leads to the wrong remediation. The agent has to be right, not just fast.
Goals
- Design and implement an agentic AI pipeline capable of detecting failures in a cloud-native Kubernetes environment within 30 seconds of onset.
- Integrate with Microsoft's AIOpsLab framework as the observability and agent evaluation substrate.
- Build a fault injection framework that could simulate 15+ real-world failure scenarios in isolated Kubernetes namespaces, serving as both the agent's testing ground and a standalone team resource.
- Validate the agent's detection latency, classification accuracy, and self-healing capability across all injected failure types.
- Ensure zero risk of production interference: all fault injection runs in strictly isolated, non-production cluster namespaces with automated guard rails.
Technical Architecture
Agent Design: Observe, Orient, Decide, Act
The autonomous SRE agent is structured around a continuous four-phase loop that mirrors the OODA decision cycle used in operational systems: Observe, Orient, Decide, Act.
Observe. The agent continuously polls telemetry from the Kubernetes cluster through AIOpsLab's observability interface. The telemetry stream includes: pod health status, container restart counts, CPU and memory utilization per pod, inter-service request error rates, request latency distributions (p50/p95/p99), and Kubernetes event logs (crash loops, OOM kills, scheduling failures, node pressure events). Polling runs on a tight interval tuned to the 30-second detection target. The interval is short enough to catch fast-onset failures, but aggregates metrics over a window to filter transient noise from genuine anomalies.
Orient. Raw telemetry is passed to an LLM-backed reasoning layer built with LangChain. The model receives a structured snapshot of current system state (formatted telemetry, recent event history, and the service dependency graph) and orients itself by identifying which signals are anomalous and which services are affected. This is the classification step: is this a pod crash, a network degradation, an authentication failure, a resource exhaustion event?
The LangChain pipeline is designed as a tool-use agent. Rather than asking the model to produce a free-form diagnosis, it invokes structured diagnostic tools to gather additional signal before rendering a classification: get_pod_logs(service, namespace), describe_pod_events(pod_name), check_service_endpoints(service), query_error_rate(service, window_seconds). This grounding in real cluster data significantly reduces hallucination in the diagnostic output.
Decide. Based on the classification, the agent selects a remediation action from a predefined playbook. The playbook maps failure types to remediation procedures:
| Failure Classification | Remediation Action |
|---|---|
| Pod crash loop | Restart pod, analyze logs for root cause signal |
| Authentication token expiry | Trigger token refresh procedure for affected service |
| Memory exhaustion (OOM kill) | Scale up pod memory limits, alert for capacity review |
| Network partition / packet loss | Reroute traffic via healthy endpoints, quarantine affected node |
| Service dependency timeout | Enable circuit breaker, redirect to fallback service |
| Disk pressure on node | Evict low-priority pods, trigger volume cleanup |
The decision layer also performs a confidence check before acting. If the model's confidence in its classification falls below a threshold (measured by the consistency of the diagnostic tool outputs), the agent flags the anomaly for human review rather than executing an automated remediation. This prevents low-confidence misdiagnoses from triggering incorrect actions.
Act. Remediation actions are executed through the Kubernetes API: pod restarts, resource limit updates, replica scaling, namespace-scoped network policy changes. All actions are logged with a full audit trail: timestamp, triggering anomaly, classification, confidence score, action taken, and post-action health check result.
AIOpsLab Integration
Microsoft's AIOpsLab framework provides the infrastructure for agent evaluation in cloud-native environments. It defines a standardized interface for injecting workloads, observing system behavior, and measuring agent responses. Integrating with AIOpsLab meant the agent's evaluation was methodologically rigorous: detection latency was measured from the moment a fault was injected to the moment the agent's first diagnostic action fired, using AIOpsLab's timestamped event log rather than wall-clock approximations.
AIOpsLab also provides a multi-service microservice benchmark environment (a reference application with realistic inter-service dependencies, traffic patterns, and failure modes) which the fault injection framework targets.
Fault Injection Framework
The fault injection framework is a Python-based harness I built to serve two purposes simultaneously: to give the SRE agent realistic failures to detect, and to give the team a reusable tool for testing any future reliability tooling against a controlled set of failure scenarios.
Scenario Library
The framework implements 15+ parameterized fault scenarios organized into four categories:
Pod-level faults:
- Pod crash (immediate kill via
kubectl delete pod) - Pod crash loop (repeated restart with increasing backoff simulation)
- OOM kill (memory pressure injection via resource limit reduction)
- Container image pull failure (corrupt image reference injection)
Network-level faults:
- Inter-service packet loss (injected via network chaos tooling on specific cluster network interfaces)
- Latency spike (artificial delay injection on service-to-service calls)
- Network partition (network policy rules blocking traffic between specific service pairs)
- DNS resolution failure (CoreDNS manipulation for target service)
Authentication faults:
- JWT token expiry (clock skew injection causing token validation failures)
- Invalid credential propagation (secret rotation simulation with stale references)
- Service account permission revocation
Resource faults:
- CPU throttling (cgroup limit reduction on target pod)
- Disk pressure (volume fill to trigger node disk pressure condition)
- Replica scale-down (forced reduction of deployment replicas below minimum viable count)
Each scenario is implemented as a repeatable, parameterized script. Any engineer can trigger a specific fault type against a target service with a single command, specifying duration and intensity. The framework automatically tears down the injected fault after the configured window and verifies that the cluster returns to a healthy baseline before marking the run complete.
Safety Architecture
The most important design constraint on the fault injection framework was: it must be impossible to accidentally affect production. This is a non-negotiable requirement when working with a chaos engineering toolset.
The safety architecture operates at three levels:
-
Namespace isolation. All fault injection runs execute exclusively within dedicated
fault-injection-*namespaces in a separate non-production cluster. The injection scripts refuse to run if the target namespace does not match the approved pattern. This check is in the script itself, not just in documentation. -
Guard rail validation. Before any fault is injected, the framework validates that the target cluster context is the designated test cluster (checked against a known cluster ID), that the target namespace is on the approved list, and that no other injection run is currently active in that namespace. Any failed check aborts the run with an explicit error.
-
Automatic teardown. Every injected fault has a maximum TTL. If the agent or the engineer does not manually terminate the fault, the framework automatically removes it after the TTL expires. This prevents orphaned fault states from persisting in the test environment.
Evaluation Harness
Beyond injecting faults, the framework measures agent performance. For each scenario run:
- Detection latency. Time from fault injection to agent's first diagnostic action (millisecond precision, sourced from AIOpsLab event logs).
- Classification accuracy. Whether the agent's stated root cause matches the injected fault type (verified against the scenario's ground truth record).
- Remediation correctness. Whether the agent's chosen action matches the expected remediation for the classified fault type.
- Recovery time. Time from agent action to cluster health restoration.
Results are written to a structured JSON log per run, and a summary report is generated after each batch of scenarios showing aggregate performance across all fault types.
Key Technical Challenges
Challenge 1: Balancing Detection Speed Against Classification Quality
The 30-second detection target created a direct tension with classification accuracy. A faster polling interval catches failures sooner but produces noisier telemetry: transient spikes that look like failures but aren't. A slower polling interval with more aggregation produces cleaner signal but adds latency.
The first implementation used a fixed 5-second polling interval with no aggregation, which hit the detection target consistently but produced an unacceptable false positive rate: transient CPU spikes from legitimate traffic bursts were being misclassified as resource exhaustion events.
The solution was a two-tier detection architecture. The first tier runs at a 5-second interval and uses simple threshold-based anomaly detection: if error rate exceeds X% or pod restart count increases, flag it. The second tier runs immediately on any first-tier flag and performs the full LangChain diagnostic tool sequence (pulling logs, checking events, verifying error rates over a longer window) before classifying. The first tier is fast but low-fidelity. The second tier is slower but high-fidelity. The combination hits the 30-second target (first-tier detection is fast) while keeping classification accuracy high (second-tier classification is thorough).
Challenge 2: Preventing the Agent from Acting on Bad Diagnoses
An autonomous agent that can restart pods, modify resource limits, and change network policies in a Kubernetes cluster is genuinely dangerous if it acts on incorrect diagnoses. A misclassified authentication failure that triggers a pod restart on a healthy service causes an outage, not a recovery.
The solution was the confidence threshold mechanism in the Decide phase. Before any remediation action is executed, the agent's classification is scored by cross-referencing the outputs of multiple diagnostic tools. If the pod logs, the error rate query, and the Kubernetes events all point to the same failure type, confidence is high. If the diagnostic tools produce inconsistent signals (pod logs suggest OOM, but memory metrics are normal), confidence is low, and the agent routes the anomaly to a human-review queue rather than acting.
This required designing the diagnostic tools to produce outputs that could be compared for consistency. That was not a trivial task, since pod logs are unstructured text and metric queries return numerical timeseries. The consistency check compares the failure type implied by each tool's output (mapped to a fixed taxonomy of failure classes) rather than comparing raw outputs directly.
Challenge 3: Building Fault Injection That Doesn't Break the Thing It's Testing
Chaos engineering tooling that can damage the environment it runs in is counterproductive. The fault injection framework needed to be aggressive enough to produce realistic, detectable failures, but safe enough that an engineer could run any scenario without reviewing its implementation first.
The safety architecture described above (namespace isolation, guard rail validation, automatic teardown) addresses the production risk. But there was a subtler problem: some faults, if left running too long, could corrupt state in the test environment that would affect subsequent runs. For example, a DNS manipulation fault that wasn't cleanly torn down could leave name resolution broken for the next scenario.
Each fault scenario was built with an explicit cleanup procedure that goes beyond simply reversing the injection. It actively validates that the affected subsystem has returned to a known-good baseline before declaring the teardown complete. For DNS faults, this means querying the affected service name and verifying resolution succeeds. For network partition faults, this means checking that inter-service requests succeed end-to-end. Teardown is not complete until the verification passes, not just until the injection command is reversed.
Outcome and What It Demonstrates
The autonomous SRE agent achieved sub-30-second failure detection across all 15+ tested fault scenarios, the primary performance target. Classification accuracy was high across pod-level and authentication fault types, with network-level faults showing more variability due to the inherently ambiguous telemetry signals they produce (packet loss and latency spikes look similar at the metrics layer). The confidence threshold mechanism routed the ambiguous cases correctly to human review rather than acting on uncertain diagnoses.
The fault injection framework was recognized during project review as a significant team contribution. It became the standard tool for testing reliability tooling within the team and continued in use after my co-op ended.
From an engineering standpoint, the project demonstrates:
Autonomous AI Agent Architecture. Designing a system that observes, reasons, decides, and acts in a continuous loop (with explicit confidence gating before any irreversible action) is the architecture pattern used in production AI agents across SRE, security, and operations domains. This is not a chatbot with tools; it's a closed-loop autonomous system.
Production Safety Engineering. The safety architecture of the fault injection framework (namespace isolation enforced in code, guard rail validation before every run, automatic TTL-based teardown, and post-teardown baseline verification) reflects the engineering discipline required to build tooling that is genuinely safe to operate, not just documented as safe.
LLM Integration at the Infrastructure Layer. Most LLM integrations operate at the application layer. This project integrates LLM reasoning at the infrastructure layer: the model makes decisions that directly affect Kubernetes workloads. Designing the tool schema, the confidence mechanism, and the consistency check to make that integration safe required thinking carefully about where AI judgment is reliable and where it isn't.
Chaos Engineering Rigor. Building a fault injection framework with measurable, reproducible scenarios and a structured evaluation harness (not just "inject a fault and see what happens") is the difference between ad-hoc testing and systematic reliability validation.
Tech Stack Summary
| Layer | Technology |
|---|---|
| Agent Language | Python |
| Agent Framework | LangChain |
| LLM Provider | OpenAI API |
| Observability & Eval | Microsoft AIOpsLab |
| Container Orchestration | Kubernetes |
| Fault Injection | Custom Python harness + network chaos tooling |
| Cloud Platform | AWS |
| Cluster Management | kubectl · Kubernetes Python client |
| Output Format | Structured JSON evaluation logs |