← Back

Professional Project: Software Engineer Co-Op·Ribbon Communications·September 2025 — December 2025

Enterprise Infrastructure Automation Platform

Multi-Stack CloudFormation IaC Framework for Active Directory Lab Provisioning

Stack

AWS CloudFormationAWS CodePipelineAWS CodeBuildPythonJSON Schema

Domain

Cloud InfrastructureInfrastructure as CodeCI/CDDevOps

4 wks → 30 min

Provisioning time

5 stacks

Multi-stack IaC architecture

20+

AWS resources automated

Enterprise Infrastructure Automation Platform: system architecture

Overview

At Ribbon Communications, provisioning a new Active Directory lab environment for development or testing took approximately four weeks. Not four weeks of compute time, but four weeks of calendar time, consumed by manual configuration steps, cross-team coordination dependencies, and hand-off delays between engineers who each owned different parts of the infrastructure.

I replaced that process with a fully automated multi-stack Infrastructure as Code framework built on AWS CloudFormation and AWS CodePipeline. The framework spans 20+ AWS resources across five logical stacks (network, security, compute, data, and observability) with cross-stack dependency resolution, JSON schema validation on all input parameters, idempotent deployments, automated rollback on failure, and scheduled drift detection to catch configuration divergence between runs.

The result: provisioning time from four weeks to under thirty minutes. One configuration file, one pipeline trigger, zero manual steps.


The Problem

Every engineer at Ribbon who needed a new development or testing environment went through the same four-week ordeal. The process involved: submitting a provisioning request, waiting for the infrastructure team to schedule time, manually configuring VPC settings, setting up Active Directory domain controllers, configuring security groups and IAM policies, provisioning EC2 instances and attaching the correct profiles, setting up RDS instances with the right parameter groups, and wiring up CloudWatch for logging and alerting: each step in sequence, each requiring the previous step to be verified before proceeding, most steps owned by different people.

The four-week timeline had two components. The actual configuration work, done by an expert, might take two to three days of focused effort. The rest (three and a half weeks) was coordination overhead: scheduling, waiting for approvals, dependencies on other teams' availability, and the inevitable re-work when a step was misconfigured and had to be rolled back manually.

This wasn't a secret or an accepted cost. It was a recognized bottleneck that directly affected development velocity. Engineers waiting for environments couldn't test against realistic infrastructure. Teams developing reliability tooling (like the autonomous SRE agent) had to share environments, which introduced interference between workloads and made test results unreliable. The bottleneck was universally acknowledged and had persisted because no one had taken ownership of automating it end to end.

I inherited it as a side constraint of my primary SRE agent work (I needed consistent, reproducible environments to test the agent against) and decided the right move was to solve it properly rather than work around it.


Goals


Technical Architecture

The Multi-Stack Design Decision

The first and most consequential architectural decision was whether to implement the infrastructure as a single monolithic CloudFormation template or as a set of separate, coordinated stacks.

A single monolithic template is simpler to reason about in isolation. Everything is in one file, the dependency graph is implicit in the template's resource references, and there's no cross-stack coordination to manage. For small environments, it's often the right choice.

For an Active Directory lab environment spanning 20+ AWS resources across multiple domains (networking, IAM, compute, database, and observability), the monolith creates serious operational problems. A change to a security group rule requires updating and redeploying the entire template, including all compute and database resources, which CloudFormation may need to replace rather than update. A failure in any resource blocks the entire deployment. There's no way to isolate and retry just the failed layer. Different infrastructure layers change at different rates: network topology is stable for months; security policies update frequently; monitoring configuration changes with every new service. A monolith couples all these update frequencies together.

The multi-stack architecture splits the infrastructure along natural domain boundaries, with each stack owning a coherent set of resources and exposing well-defined outputs that downstream stacks consume:

Five-stack CloudFormation architecture: network foundation as the dependency root, security and compute in parallel, then data and observability
Five-stack CloudFormation architecture: network foundation as the dependency root, security and compute in parallel, then data and observability

Stack 1: Network Foundation
VPC, public and private subnets across multiple availability zones, internet gateway, NAT gateway, route tables, and VPC endpoints. This stack is the dependency root. Every other stack depends on it. Its outputs include the VPC ID, subnet IDs, and route table ARNs that all subsequent stacks reference. Network topology changes rarely, so this stack deploys infrequently once established.

Stack 2: Security and IAM
Security groups for each service tier (web, application, database, management), IAM roles, IAM policies, and EC2 instance profiles. This stack depends on Stack 1 (security groups need the VPC ID) and uses CloudFormation cross-stack references via Fn::ImportValue to pull the VPC ID from Stack 1's exports. IAM policies change more frequently than network topology (new services require new roles, policy changes are iterative), so isolating this layer means security changes don't touch network or compute resources.

Stack 3: Compute
EC2 instances for Active Directory domain controllers, application servers, and the jump host. Launch templates with the correct AMIs, instance types, user data scripts for AD domain configuration, and attachment to the instance profiles from Stack 2. This stack depends on Stacks 1 and 2, importing subnet IDs and instance profile ARNs. Compute is the most expensive layer to replace (CloudFormation resource replacement on EC2 instances causes downtime), so isolating it means compute only gets touched when compute configuration actually changes.

Stack 4: Data
RDS instances with the correct parameter groups and subnet groups, S3 buckets for environment artifacts and logs, and DynamoDB tables used by the AD management tooling. This stack depends on Stacks 1 and 2 for networking and security configuration. Data layer resources are the most sensitive to unintended replacement (an RDS replacement in CloudFormation deletes the existing instance), so this stack has the most conservative update policy.

Stack 5: Observability
CloudWatch log groups, metric filters, alarms, dashboards, and SNS topics for alert routing. This stack depends on Stacks 3 and 4, importing resource identifiers to configure log forwarding and alarm targets. Observability configuration changes frequently (new alarms for new services, dashboard updates, threshold tuning), and this stack can be updated completely independently without touching any of the infrastructure it monitors.

Cross-Stack Dependency Resolution

CloudFormation cross-stack references work through a two-part mechanism: a stack declares an Output with the Export property set to a named export key, and a downstream stack references it with Fn::ImportValue. This creates a hard dependency: CloudFormation prevents a stack from being deleted if another stack imports one of its exports, and it ensures exported values are available before the importing stack deploys.

The CodePipeline orchestration respects these dependencies by deploying stacks in order: Stack 1, then Stacks 2 and 3 in parallel (both depend only on Stack 1), then Stack 4 (depends on Stacks 1 and 2), then Stack 5 (depends on Stacks 3 and 4). This parallelization (Stacks 2 and 3 deploy simultaneously) shaves meaningful time off the total provisioning run.

Each stack's export names follow a strict naming convention: {environment}-{stack-name}-{resource-type}-{qualifier}. For example: prod-network-vpc-id, dev-security-web-sg-id, staging-compute-dc1-instance-id. This convention makes imports self-documenting and prevents name collisions when multiple environment instances (dev, staging, prod) coexist in the same AWS account.

JSON Schema Validation

Every deployment is driven by a single environment configuration file. It's a JSON document that specifies all the parameters that vary between environments: environment name, AWS region, VPC CIDR block, subnet CIDR ranges, EC2 instance types, RDS instance class, AD domain name, and alert email addresses.

Before the CodePipeline run begins, a validation stage runs this configuration file against a JSON Schema definition. The schema enforces:

Validation failures abort the pipeline immediately, before any AWS API call is made. The error output identifies the specific field, the actual value provided, and the constraint it violated. An engineer misconfiguring the VPC CIDR format gets a clear error in the pipeline log, not a CloudFormation stack failure 15 minutes into deployment after a dozen resources have already been created.

This pre-flight validation pattern eliminates an entire class of mid-deployment failures that previously required manual rollback.

Idempotent Deployments

CloudFormation's native update behavior is the foundation of idempotency: a stack update computes a change set between the declared template state and the current resource state, and applies only the delta. But idempotency requires more than just running update-stack: it requires the pipeline to correctly detect whether a stack exists and needs updating, needs creating from scratch, or is already in the target state and requires no action.

The CodeBuild stage that executes each stack deployment implements this check explicitly:

  1. Describe the stack to get its current status
  2. If the stack doesn't exist → create-stack
  3. If the stack exists and is in a stable state (CREATE_COMPLETE, UPDATE_COMPLETE) → compute a change set; if the change set is empty, skip; if non-empty, execute update-stack
  4. If the stack exists and is in a failed state (ROLLBACK_COMPLETE, UPDATE_ROLLBACK_COMPLETE) → delete the failed stack and recreate
  5. If the stack is in a transition state (CREATE_IN_PROGRESS, UPDATE_IN_PROGRESS) → wait for completion before proceeding
Idempotent deployment: per-stack state check branches to create, update, skip, recreate, or wait
Idempotent deployment: per-stack state check branches to create, update, skip, recreate, or wait

This logic means running the pipeline against a fully provisioned, unchanged environment is a no-op: it exits cleanly after the change set check, touching no resources. Running it after a configuration file change applies only the changed parameters. Running it against a fresh account provisions everything from scratch. Same pipeline, same trigger: correct behavior in all three cases.

Automated Rollback

CloudFormation's --on-failure ROLLBACK flag handles single-stack rollback on creation failure. For update failures, the stack automatically rolls back to its previous state. But in a five-stack deployment, a failure in Stack 3 (Compute) leaves Stacks 1 and 2 in their new state, creating a version mismatch between already-deployed stacks and the failed stack.

The CodeBuild execution script handles this by tracking deployment state across stacks. If any stack deployment fails, the script records which stacks have already been updated in this pipeline run and triggers their rollback in reverse dependency order: Stack 3 first (the failed stack, CloudFormation handles its rollback), then Stack 2, then Stack 1. This returns the entire environment to its pre-run state, not just the individual failed stack.

Multi-stack rollback: when a stack fails mid-deployment, deployed stacks roll back in reverse dependency order
Multi-stack rollback: when a stack fails mid-deployment, deployed stacks roll back in reverse dependency order

Scheduled Drift Detection

Infrastructure drift (the divergence between the declared IaC state and the actual resource state) happens when engineers make manual changes directly in the AWS console or via CLI outside the pipeline, when AWS applies automatic changes (certificate renewals, security patch updates), or when resources are modified by other automated processes.

A CloudWatch Events rule triggers a drift detection run against all five stacks on a daily schedule. The detection uses CloudFormation's native detect-stack-drift API, which compares each resource's current configuration against its last-known stack template. Detected drift generates a CloudWatch alarm and sends an SNS notification with a structured report: which stacks have drifted, which specific resources within those stacks, and what properties have changed.

Drift is not automatically remediated. Auto-remediation of infrastructure drift is dangerous, because the manual change may have been intentional and necessary. Instead, the alert surfaces the divergence to the team so a human can decide whether to accept it (update the template to match) or revert it (re-run the pipeline to restore declared state).


Key Technical Challenges

Challenge 1: The Monolith vs. Multi-Stack Decision Under Time Pressure

The multi-stack architecture was the harder choice. A single monolithic template would have been faster to write, simpler to explain, and sufficient to hit the provisioning time target. The argument for splitting into five stacks was entirely about operational quality after deployment: independent update cycles, isolated failure blast radius, and the ability to modify security policy without touching compute.

Making this call as a co-op engineer on a four-month rotation required some confidence in the judgment. The system would outlive my time at Ribbon. If I built a monolith that was hard to maintain, the engineers who inherited it would bear the cost. If I over-engineered the split and created unnecessary complexity, they'd also bear that cost. The five-stack split maps directly to the five infrastructure concerns that genuinely change independently (network, security, compute, data, observability), which made it defensible rather than arbitrary. That alignment between stack boundaries and natural domain boundaries was the deciding criterion.

Challenge 2: Cross-Stack Reference Naming at Multiple Environment Scales

The initial cross-stack reference implementation used simple export names like vpc-id and web-sg-id. This worked fine for a single environment but immediately broke when the team needed two environments (dev and staging) in the same AWS account. CloudFormation export names are account-scoped and must be globally unique within a region. Two environments exporting vpc-id produces a name collision.

The fix was the {environment}-{stack-name}-{resource-type}-{qualifier} naming convention described above, implemented as a CloudFormation parameter substitution: the environment name is injected as a stack parameter, and every export name uses !Sub "${EnvironmentName}-network-vpc-id" to generate the namespaced export. This required updating every Fn::ImportValue reference in every downstream stack to use the same substitution pattern, a mechanical but careful refactor that had to be done correctly or imports would silently fail at deploy time.

Challenge 3: Validation Errors That Actually Help

The first version of the JSON schema validation stage produced errors like "instance" does not match "^(10|172|192)\.". Technically accurate, but requiring the engineer to understand the regex to interpret the error. For a validation layer meant to be operated by engineers who didn't write the framework, cryptic errors defeat the purpose.

The solution was a custom validation wrapper around the JSON Schema library that maps each schema constraint to a human-readable error message template: "VPC CIDR block '192.168.0.0/16' is not in the 10.0.0.0/8 range required for internal lab environments. Use a 10.x.x.x/16 block." Every constraint in the schema has a corresponding error message that tells the engineer what to fix and why the constraint exists. This added a modest amount of authoring overhead per constraint but substantially reduced the time-to-fix when validation failed.

Challenge 4: Teardown Was Left Manual

One deliberate limitation: environment teardown was not automated to the same standard as provisioning. Deleting a CloudFormation stack that has resources with data (RDS instances, S3 buckets with content) requires explicit decisions about data retention that are environment-specific and shouldn't be made by a pipeline without human confirmation.

The teardown process required engineers to manually trigger stack deletions in reverse dependency order. This was a conscious scope decision: automating teardown with the same rigor as provisioning would require a separate pipeline with interactive approval gates, data backup confirmation steps, and environment-specific retention policies: a substantial additional workload for a co-op rotation that already had two other major deliverables. The right call was to scope teardown as a documented manual process rather than an automated one, rather than build a partial automation that handled the easy cases and failed silently on the data-sensitive ones.


Outcome and What It Demonstrates

Provisioning time dropped from four weeks to under thirty minutes. The pipeline runs end-to-end without human intervention: validates the configuration file, deploys all five stacks in dependency order, performs post-deployment health checks on key resources, and produces a structured deployment summary with all relevant resource identifiers and endpoints.

The framework was used to provision environments for the SRE agent testing work and became the standard provisioning mechanism for new Ribbon lab environments during the co-op term.

From an engineering standpoint, the project demonstrates:

IaC Architecture at Real Scale. Designing a five-stack CloudFormation framework with cross-stack dependencies, parameterized naming conventions, and coordinated pipeline orchestration is a materially different engineering problem than writing a single CloudFormation template. The architectural decisions (where to draw stack boundaries, how to manage cross-stack references across multiple environment instances, how to handle multi-stack rollback) are the decisions that matter in production infrastructure.

Pre-Flight Validation as a First-Class Concern. Building JSON schema validation with human-readable error messages as the first stage of a deployment pipeline reflects the understanding that failures caught before AWS API calls are categorically cheaper than failures caught mid-deployment. This is a production engineering principle, not a nice-to-have.

Idempotency by Design. The explicit stack-state detection logic (create, update, skip, or recreate depending on current state) makes the pipeline safe to run repeatedly without side effects. Idempotent infrastructure tooling is a requirement in any environment where the pipeline might run more than once against the same target, which is every real environment.

Engineering for Maintainability Over Cleverness. Several decisions reflect this: building the multi-stack architecture, documenting every cross-stack reference naming convention, writing human-readable validation errors, and scoping teardown as a deliberate manual process rather than a partial automation. All of these are engineering judgment calls that prioritize the system's operational quality over its build-time convenience. The automation had to work for engineers who didn't build it. Every design decision was evaluated against that standard.


Tech Stack Summary

LayerTechnology
IaC FrameworkAWS CloudFormation
Pipeline OrchestrationAWS CodePipeline · AWS CodeBuild
Input ValidationJSON Schema (custom error wrapper)
Compute ResourcesAmazon EC2 (Active Directory domain controllers, app servers)
Database ResourcesAmazon RDS
StorageAmazon S3
ObservabilityAmazon CloudWatch (logs, alarms, dashboards) · Amazon SNS
Drift DetectionCloudFormation Drift Detection · CloudWatch Events (scheduled)
ScriptingPython (pipeline logic, change set management, rollback orchestration)
Deployment TargetAWS (multi-environment: dev / staging / prod)