Infrastructure as Code Fundamentals

Learn Infrastructure as Code from the ground up: understand declarative vs imperative approaches, master tools like Terraform and CloudFormation, and implement best practices for managing cloud infrastructure at scale.

Infrastructure as Code Fundamentals

Infrastructure as Code (IaC) is the practice of managing and provisioning infrastructure through machine-readable configuration files rather than manual processes or interactive configuration tools. Instead of clicking through cloud consoles or running ad-hoc scripts, you define your entire infrastructure in code that can be versioned, tested, and deployed automatically.

This approach has become fundamental to modern cloud engineering. Every major cloud platform supports IaC, and it's a core requirement for most cloud engineering roles. This guide covers what Infrastructure as Code is, why it matters, the most popular tools, and how to implement IaC effectively in professional environments.


Why Cloud Engineers Need Infrastructure as Code

Manual infrastructure management doesn't scale. As your infrastructure grows, clicking through consoles becomes error-prone, inconsistent, and impossible to track. IaC solves these problems by treating infrastructure the same way we treat application code.

  • Consistency and repeatability: Define infrastructure once, deploy it identically across environments (dev, staging, production).
  • Version control: Track every change to your infrastructure in Git, see who changed what and when, and roll back mistakes instantly.
  • Automation: Integrate infrastructure changes into CI/CD pipelines, eliminating manual deployment steps.
  • Documentation: Your IaC files become living documentation of your infrastructure. New team members can read the code to understand the system.
  • Disaster recovery: If your infrastructure is destroyed, you can rebuild it completely from code in minutes rather than days.
  • Cost optimization: Review infrastructure before deployment, estimate costs, and catch over-provisioned resources in code reviews.
  • Testing: Test infrastructure changes in isolated environments before applying them to production.
  • Collaboration: Multiple engineers can work on infrastructure simultaneously, merging changes through pull requests.

Core IaC Concepts

Declarative vs Imperative

Declarative (what you want):

# Terraform - declarative
resource "aws_instance" "web" {
  ami           = "ami-12345678"
  instance_type = "t3.medium"
  count         = 3
}

You describe the desired end state. The tool figures out how to achieve it. Most modern IaC tools (Terraform, CloudFormation, Pulumi) are declarative.

Imperative (step-by-step instructions):

# Imperative - traditional script
aws ec2 run-instances --image-id ami-12345678 --count 3 --instance-type t3.medium
aws ec2 create-tags --resources $instance_id --tags Key=Name,Value=web
aws ec2 create-security-group --group-name web-sg

You specify exactly which actions to perform in order. Traditional scripting is imperative.

Declarative is usually better because:

  • The tool handles dependencies automatically
  • Running the same code multiple times is safe (idempotent)
  • You don't need to track current state manually

State Management

Most IaC tools maintain a state file that tracks:

  • What resources currently exist
  • How they map to your configuration
  • Metadata and relationships between resources

Example Terraform state:

{
  "version": 4,
  "resources": [
    {
      "type": "aws_instance",
      "name": "web",
      "instances": [
        {
          "attributes": {
            "id": "i-0123456789abcdef0",
            "instance_type": "t3.medium",
            "public_ip": "54.123.45.67"
          }
        }
      ]
    }
  ]
}

Critical state management rules:

  • Never edit state files manually
  • Store state remotely (S3, Terraform Cloud, Azure Storage) for team access
  • Lock state during operations to prevent conflicts
  • Back up state files regularly

Idempotency

Running the same IaC code multiple times produces the same result. If the infrastructure already matches your code, nothing changes.

# First run: creates 3 instances
terraform apply

# Second run: no changes needed
terraform apply  # Output: "No changes. Infrastructure is up-to-date."

This is a key advantage over imperative scripts, which often create duplicates or fail when run twice.

Modules and Reusability

Break infrastructure into reusable components:

infrastructure/
├── modules/
│   ├── vpc/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── eks/
│   └── rds/
└── environments/
    ├── prod/
    │   └── main.tf
    ├── staging/
    └── dev/

Each module is self-contained and parameterized. Environments use modules with different inputs.


Popular IaC Tools

Terraform

The most popular multi-cloud IaC tool, created by HashiCorp.

Pros:

  • Works with AWS, Azure, GCP, and 3,000+ providers
  • Large community and extensive documentation
  • Declarative HCL language (HashiCorp Configuration Language)
  • Strong state management and planning capabilities
  • Free and open source (OpenTofu fork available)

Cons:

  • Learning curve for HCL syntax
  • State file management can be complex
  • Some providers lag behind cloud platform features

Use case: Best for multi-cloud environments or when you need provider flexibility.

AWS CloudFormation

AWS's native IaC service.

Pros:

  • Deep AWS integration
  • No external state management (AWS handles it)
  • Supports all AWS services immediately upon release
  • Free to use (only pay for resources created)
  • StackSets for multi-account deployments

Cons:

  • AWS-only (not multi-cloud)
  • YAML/JSON can be verbose
  • Error messages can be cryptic
  • Slower rollbacks compared to Terraform

Use case: Best for AWS-only environments where native integration is a priority.

Pulumi

Modern IaC using general-purpose programming languages.

Pros:

  • Write infrastructure in Python, TypeScript, Go, C#, or Java
  • Full programming language features (loops, conditionals, functions)
  • Strong typing and IDE support
  • Multi-cloud support
  • Easier for developers who already know these languages

Cons:

  • Smaller community than Terraform
  • More complex to learn if you're not a developer
  • Pulumi service required for state management (free tier available)

Use case: Best for teams with strong programming backgrounds who want language flexibility.

Azure Resource Manager (ARM) and Bicep

ARM Templates: Azure's native IaC using JSON.

Bicep: Newer, simpler language that compiles to ARM templates.

Pros:

  • Native Azure integration
  • Bicep is much more readable than ARM JSON
  • No external state management
  • Free to use

Cons:

  • Azure-only
  • Smaller community compared to Terraform
  • Less mature than CloudFormation or Terraform

Use case: Best for Azure-only environments.

Ansible

Configuration management tool that can also provision infrastructure.

Pros:

  • Agentless (uses SSH)
  • Simple YAML syntax
  • Great for configuration management and application deployment
  • Can orchestrate both infrastructure and software

Cons:

  • Not purpose-built for IaC (better tools exist for infrastructure)
  • No native state management
  • Slower for large infrastructure deployments
  • Imperative by default, though can be used declaratively

Use case: Best for configuration management and orchestration, not primary infrastructure provisioning.


Getting Started with Terraform

Terraform is the most popular IaC tool and a safe bet for learning. Here's a practical introduction.

Installation

# macOS
brew install terraform

# Windows (Chocolatey)
choco install terraform

# Linux
curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
sudo apt-get update && sudo apt-get install terraform

# Verify installation
terraform --version

Basic Terraform Workflow

# 1. Initialize: Download provider plugins
terraform init

# 2. Plan: Preview changes before applying
terraform plan

# 3. Apply: Create/update infrastructure
terraform apply

# 4. Destroy: Remove all infrastructure
terraform destroy

Example: Creating an AWS EC2 Instance

Create main.tf:

# Configure the AWS provider
provider "aws" {
  region = "us-east-1"
}

# Create a VPC
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"

  tags = {
    Name = "main-vpc"
  }
}

# Create a subnet
resource "aws_subnet" "public" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.1.0/24"
  availability_zone = "us-east-1a"

  tags = {
    Name = "public-subnet"
  }
}

# Create a security group
resource "aws_security_group" "web" {
  name        = "web-sg"
  description = "Allow HTTP and SSH"
  vpc_id      = aws_vpc.main.id

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# Create an EC2 instance
resource "aws_instance" "web" {
  ami                    = "ami-0c55b159cbfafe1f0"
  instance_type          = "t3.micro"
  subnet_id              = aws_subnet.public.id
  vpc_security_group_ids = [aws_security_group.web.id]

  tags = {
    Name = "web-server"
  }
}

# Output the public IP
output "instance_public_ip" {
  value = aws_instance.web.public_ip
}

Run the workflow:

terraform init
terraform plan
terraform apply  # Type 'yes' to confirm

Terraform creates the VPC, subnet, security group, and EC2 instance in the correct order, handling dependencies automatically.

Variables and Outputs

variables.tf:

variable "region" {
  description = "AWS region"
  type        = string
  default     = "us-east-1"
}

variable "instance_type" {
  description = "EC2 instance type"
  type        = string
  default     = "t3.micro"
}

variable "environment" {
  description = "Environment name"
  type        = string
}

terraform.tfvars:

region        = "us-west-2"
instance_type = "t3.small"
environment   = "production"

outputs.tf:

output "vpc_id" {
  description = "ID of the VPC"
  value       = aws_vpc.main.id
}

output "instance_ip" {
  description = "Public IP of the instance"
  value       = aws_instance.web.public_ip
}

Variables make your infrastructure reusable across environments. Outputs expose values for use in other modules or for reference.

Remote State with S3

Store Terraform state remotely for team collaboration:

backend.tf:

terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

This stores state in S3 and uses DynamoDB for state locking, preventing concurrent modifications.


IaC Best Practices

Structure Your Code

Organize by environment and module:

infrastructure/
├── modules/
│   ├── vpc/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── README.md
│   ├── eks/
│   ├── rds/
│   └── s3/
├── environments/
│   ├── prod/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── backend.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   └── dev/
├── .gitignore
└── README.md

Each module is self-contained and testable. Environments are thin layers that compose modules.

Use Version Control

Treat infrastructure code like application code:

  • Commit all IaC files to Git
  • Use .gitignore to exclude sensitive files
  • Open pull requests for infrastructure changes
  • Require code reviews before merging

.gitignore for Terraform:

# Terraform files
.terraform/
*.tfstate
*.tfstate.backup
*.tfvars
.terraform.lock.hcl

# Crash logs
crash.log

# Sensitive files
*.pem
*.key
.env

Never Hardcode Secrets

Bad:

resource "aws_db_instance" "main" {
  password = "mysecretpassword123"  # NEVER DO THIS
}

Good:

resource "aws_db_instance" "main" {
  password = var.db_password  # Passed from environment variable
}

Use:

  • Environment variables: export TF_VAR_db_password=<secret>
  • AWS Secrets Manager, Azure Key Vault, HashiCorp Vault
  • Terraform Cloud encrypted variables
  • .tfvars files excluded from version control

Plan Before Apply

Always run terraform plan before apply:

# Review changes
terraform plan -out=tfplan

# If changes look good, apply
terraform apply tfplan

In CI/CD, require human approval after plan:

# GitHub Actions example
- name: Terraform Plan
  run: terraform plan -out=tfplan

- name: Wait for Approval
  uses: trstringer/manual-approval@v1

- name: Terraform Apply
  run: terraform apply tfplan

Use State Locking

Prevent multiple people from modifying infrastructure simultaneously:

  • S3 + DynamoDB (Terraform with AWS)
  • Azure Storage (Terraform with Azure)
  • GCS (Terraform with GCP)
  • Terraform Cloud (automatic locking)

Without locking, two engineers running terraform apply at the same time can corrupt state.

Tag Everything

Apply consistent tags to all resources:

resource "aws_instance" "web" {
  ami           = var.ami
  instance_type = var.instance_type

  tags = {
    Name        = "web-server"
    Environment = var.environment
    Project     = "website"
    ManagedBy   = "Terraform"
    Owner       = "platform-team"
    CostCenter  = "engineering"
  }
}

Tags enable:

  • Cost tracking and allocation
  • Resource organization and filtering
  • Automated compliance checks
  • Disaster recovery planning

Keep Modules Small and Focused

Each module should do one thing well:

Good:

  • vpc module: Creates VPC, subnets, route tables, internet gateway
  • eks module: Creates EKS cluster, node groups, IAM roles
  • rds module: Creates RDS instance, subnet group, security group

Bad:

  • everything module: Creates VPC, EKS, RDS, S3, and more in one file

Small modules are easier to test, understand, and reuse.

Document Your Infrastructure

Add README files to each module:

# VPC Module

Creates a VPC with public and private subnets across multiple availability zones.

## Usage

```hcl
module "vpc" {
  source = "./modules/vpc"

  cidr_block = "10.0.0.0/16"
  azs        = ["us-east-1a", "us-east-1b"]
  environment = "production"
}

Inputs

Name Description Type Default
cidr_block CIDR block for VPC string -
azs Availability zones list(string) -

Outputs

Name Description
vpc_id ID of the VPC
public_subnet_ids IDs of public subnets

### Implement Drift Detection

Infrastructure can drift from the defined state due to manual changes. Detect drift regularly:

```bash
# Terraform
terraform plan  # Shows drift

# AWS CloudFormation
aws cloudformation detect-stack-drift --stack-name my-stack
aws cloudformation describe-stack-drift-detection-status --stack-drift-detection-id <id>

In CI/CD, run drift detection on a schedule (daily or weekly) and alert on changes.


Integrating IaC with CI/CD

Automated Deployment Pipeline

Typical workflow:

  1. Developer creates a feature branch, makes infrastructure changes
  2. CI runs terraform validate and terraform plan
  3. Pull request shows the planned changes
  4. Code review by team members
  5. Merge triggers terraform apply to staging
  6. Manual approval required for production
  7. Deploy to production with terraform apply

Example GitHub Actions Workflow

.github/workflows/terraform.yml:

name: Terraform

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2

      - name: Terraform Init
        run: terraform init

      - name: Terraform Format
        run: terraform fmt -check

      - name: Terraform Validate
        run: terraform validate

      - name: Terraform Plan
        run: terraform plan -no-color
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

  apply:
    if: github.ref == 'refs/heads/main'
    needs: plan
    runs-on: ubuntu-latest
    environment: production  # Requires approval
    steps:
      - uses: actions/checkout@v3

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2

      - name: Terraform Init
        run: terraform init

      - name: Terraform Apply
        run: terraform apply -auto-approve
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

GitOps for Infrastructure

Similar to application GitOps, infrastructure changes are made exclusively through Git:

  1. All changes go through pull requests
  2. Main branch represents production state
  3. Automated systems apply changes from Git
  4. Manual console changes are prohibited

Tools like Atlantis automate Terraform operations in pull requests:

  • Comments terraform plan output in PR
  • Team members can review changes
  • Merge triggers automatic apply

Common IaC Problems and Solutions

Problem: "State file is locked"

Cause: Someone else is running Terraform, or a previous operation crashed.

Solution:

# Check who has the lock (S3 + DynamoDB)
aws dynamodb get-item --table-name terraform-locks --key '{"LockID":{"S":"<path>/terraform.tfstate"}}'

# Force unlock (use cautiously)
terraform force-unlock <lock-id>

Problem: "Resource already exists"

Cause: Resource was created manually or by another Terraform config.

Solution:

# Import existing resource into state
terraform import aws_instance.web i-0123456789abcdef0

# Or, remove from code and manage separately

Problem: "Dependencies are out of order"

Cause: Terraform can't determine the correct order to create resources.

Solution:

# Explicit dependency
resource "aws_eip" "web" {
  instance = aws_instance.web.id
  depends_on = [aws_internet_gateway.main]  # Force ordering
}

Problem: "Plan shows unexpected changes"

Cause: Drift from manual changes or state corruption.

Solution:

# Refresh state from actual infrastructure
terraform refresh

# Review what changed
terraform plan

# If acceptable, apply to resync
terraform apply

Problem: "Failed to destroy resource"

Cause: Dependencies prevent deletion (e.g., security group attached to instances).

Solution:

# Target specific resources
terraform destroy -target=aws_instance.web
terraform destroy -target=aws_security_group.web

# Or, remove protection from code
resource "aws_s3_bucket" "data" {
  # Remove or set to false
  lifecycle {
    prevent_destroy = false
  }
}

Problem: "Secrets in state file"

Cause: Sensitive values stored in state (passwords, API keys).

Solution:

  • Store state in encrypted remote backend (S3 with encryption, Terraform Cloud)
  • Restrict state file access via IAM/RBAC
  • Mark outputs as sensitive:
    output "db_password" {
      value     = aws_db_instance.main.password
      sensitive = true
    }
    
  • Never commit state files to Git

Learning Path: From Basics to Mastery

Week 1: Fundamentals

  • Install Terraform and configure AWS/Azure credentials
  • Create a simple EC2 instance or virtual machine
  • Learn init, plan, apply, destroy workflow
  • Understand resources, providers, and state

Week 2: Organization

  • Create reusable modules
  • Use variables and outputs
  • Set up remote state storage
  • Structure code by environment

Week 3: Real Infrastructure

  • Deploy a complete application stack:
    • VPC with public/private subnets
    • Load balancer
    • Auto-scaling group
    • RDS database
    • S3 bucket for assets
  • Practice updating and modifying infrastructure

Week 4: Automation

  • Integrate Terraform with Git
  • Set up CI/CD pipeline (GitHub Actions, GitLab CI)
  • Implement automated testing (terraform validate, tflint, checkov)
  • Add drift detection

Week 5: Advanced Concepts

  • Learn terraform import for existing resources
  • Practice disaster recovery (destroy and recreate)
  • Use workspaces for environment isolation
  • Implement modules from Terraform Registry

Month 2+: Production Patterns

  • Design multi-region infrastructure
  • Implement blue/green deployments with IaC
  • Practice Kubernetes manifest management (Helm, Kustomize)
  • Learn additional tools (Pulumi, CloudFormation)

Practical Exercises

Exercise 1: Simple Web Server

Deploy a web server with Terraform:

  • VPC with one public subnet
  • Security group allowing HTTP and SSH
  • EC2 instance running a simple web server
  • Output the instance's public IP

Exercise 2: Multi-Tier Application

Create a three-tier architecture:

  • VPC with public and private subnets
  • Public subnet: Load balancer
  • Private subnet: Application servers (auto-scaling group)
  • Private subnet: RDS database
  • S3 bucket for static assets

Exercise 3: Module Development

Extract the VPC configuration into a reusable module:

  • Create modules/vpc directory
  • Define inputs (CIDR, AZs, tags)
  • Define outputs (VPC ID, subnet IDs)
  • Use the module in multiple environments

Exercise 4: CI/CD Integration

Set up automated deployment:

  • Store Terraform code in Git
  • Create GitHub Actions workflow
  • Run terraform plan on pull requests
  • Run terraform apply on merge to main
  • Require manual approval for production

Exercise 5: Disaster Recovery

Test infrastructure recovery:

  • Deploy a complete stack
  • Document the current state
  • Run terraform destroy
  • Rebuild from code with terraform apply
  • Verify all services are functional

The Bottom Line: IaC Is Non-Negotiable

Infrastructure as Code is no longer optional for cloud engineers. It's how modern infrastructure is managed. Manual provisioning through consoles doesn't scale, can't be tested, and creates endless opportunities for mistakes. Every major company managing cloud infrastructure uses IaC.

Start with Terraform if you're learning IaC for the first time-- it's multi-cloud, has a massive community, and is the most requested skill in cloud engineering job postings. Once you understand IaC principles, you can easily pick up other tools like CloudFormation, Pulumi, or ARM templates.

The sooner you start thinking in code, the faster you'll be able to build, test, and scale infrastructure. This isn't just a skill for DevOps engineers-- it's core knowledge for anyone working with cloud platforms.

For more cloud engineering fundamentals, check out Git fundamentals, Linux fundamentals, and networking fundamentals.

← Back to Insights