Infrastructure as Code Fundamentals
January 17, 2025
Learn Infrastructure as Code from the ground up: understand declarative vs imperative approaches, master tools like Terraform and CloudFormation, and implement best practices for managing cloud infrastructure at scale.
Infrastructure as Code Fundamentals
Infrastructure as Code (IaC) is the practice of managing and provisioning infrastructure through machine-readable configuration files rather than manual processes or interactive configuration tools. Instead of clicking through cloud consoles or running ad-hoc scripts, you define your entire infrastructure in code that can be versioned, tested, and deployed automatically.
This approach has become fundamental to modern cloud engineering. Every major cloud platform supports IaC, and it's a core requirement for most cloud engineering roles. This guide covers what Infrastructure as Code is, why it matters, the most popular tools, and how to implement IaC effectively in professional environments.
Why Cloud Engineers Need Infrastructure as Code
Manual infrastructure management doesn't scale. As your infrastructure grows, clicking through consoles becomes error-prone, inconsistent, and impossible to track. IaC solves these problems by treating infrastructure the same way we treat application code.
- Consistency and repeatability: Define infrastructure once, deploy it identically across environments (dev, staging, production).
- Version control: Track every change to your infrastructure in Git, see who changed what and when, and roll back mistakes instantly.
- Automation: Integrate infrastructure changes into CI/CD pipelines, eliminating manual deployment steps.
- Documentation: Your IaC files become living documentation of your infrastructure. New team members can read the code to understand the system.
- Disaster recovery: If your infrastructure is destroyed, you can rebuild it completely from code in minutes rather than days.
- Cost optimization: Review infrastructure before deployment, estimate costs, and catch over-provisioned resources in code reviews.
- Testing: Test infrastructure changes in isolated environments before applying them to production.
- Collaboration: Multiple engineers can work on infrastructure simultaneously, merging changes through pull requests.
Core IaC Concepts
Declarative vs Imperative
Declarative (what you want):
# Terraform - declarative
resource "aws_instance" "web" {
ami = "ami-12345678"
instance_type = "t3.medium"
count = 3
}
You describe the desired end state. The tool figures out how to achieve it. Most modern IaC tools (Terraform, CloudFormation, Pulumi) are declarative.
Imperative (step-by-step instructions):
# Imperative - traditional script
aws ec2 run-instances --image-id ami-12345678 --count 3 --instance-type t3.medium
aws ec2 create-tags --resources $instance_id --tags Key=Name,Value=web
aws ec2 create-security-group --group-name web-sg
You specify exactly which actions to perform in order. Traditional scripting is imperative.
Declarative is usually better because:
- The tool handles dependencies automatically
- Running the same code multiple times is safe (idempotent)
- You don't need to track current state manually
State Management
Most IaC tools maintain a state file that tracks:
- What resources currently exist
- How they map to your configuration
- Metadata and relationships between resources
Example Terraform state:
{
"version": 4,
"resources": [
{
"type": "aws_instance",
"name": "web",
"instances": [
{
"attributes": {
"id": "i-0123456789abcdef0",
"instance_type": "t3.medium",
"public_ip": "54.123.45.67"
}
}
]
}
]
}
Critical state management rules:
- Never edit state files manually
- Store state remotely (S3, Terraform Cloud, Azure Storage) for team access
- Lock state during operations to prevent conflicts
- Back up state files regularly
Idempotency
Running the same IaC code multiple times produces the same result. If the infrastructure already matches your code, nothing changes.
# First run: creates 3 instances
terraform apply
# Second run: no changes needed
terraform apply # Output: "No changes. Infrastructure is up-to-date."
This is a key advantage over imperative scripts, which often create duplicates or fail when run twice.
Modules and Reusability
Break infrastructure into reusable components:
infrastructure/
├── modules/
│ ├── vpc/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── eks/
│ └── rds/
└── environments/
├── prod/
│ └── main.tf
├── staging/
└── dev/
Each module is self-contained and parameterized. Environments use modules with different inputs.
Popular IaC Tools
Terraform
The most popular multi-cloud IaC tool, created by HashiCorp.
Pros:
- Works with AWS, Azure, GCP, and 3,000+ providers
- Large community and extensive documentation
- Declarative HCL language (HashiCorp Configuration Language)
- Strong state management and planning capabilities
- Free and open source (OpenTofu fork available)
Cons:
- Learning curve for HCL syntax
- State file management can be complex
- Some providers lag behind cloud platform features
Use case: Best for multi-cloud environments or when you need provider flexibility.
AWS CloudFormation
AWS's native IaC service.
Pros:
- Deep AWS integration
- No external state management (AWS handles it)
- Supports all AWS services immediately upon release
- Free to use (only pay for resources created)
- StackSets for multi-account deployments
Cons:
- AWS-only (not multi-cloud)
- YAML/JSON can be verbose
- Error messages can be cryptic
- Slower rollbacks compared to Terraform
Use case: Best for AWS-only environments where native integration is a priority.
Pulumi
Modern IaC using general-purpose programming languages.
Pros:
- Write infrastructure in Python, TypeScript, Go, C#, or Java
- Full programming language features (loops, conditionals, functions)
- Strong typing and IDE support
- Multi-cloud support
- Easier for developers who already know these languages
Cons:
- Smaller community than Terraform
- More complex to learn if you're not a developer
- Pulumi service required for state management (free tier available)
Use case: Best for teams with strong programming backgrounds who want language flexibility.
Azure Resource Manager (ARM) and Bicep
ARM Templates: Azure's native IaC using JSON.
Bicep: Newer, simpler language that compiles to ARM templates.
Pros:
- Native Azure integration
- Bicep is much more readable than ARM JSON
- No external state management
- Free to use
Cons:
- Azure-only
- Smaller community compared to Terraform
- Less mature than CloudFormation or Terraform
Use case: Best for Azure-only environments.
Ansible
Configuration management tool that can also provision infrastructure.
Pros:
- Agentless (uses SSH)
- Simple YAML syntax
- Great for configuration management and application deployment
- Can orchestrate both infrastructure and software
Cons:
- Not purpose-built for IaC (better tools exist for infrastructure)
- No native state management
- Slower for large infrastructure deployments
- Imperative by default, though can be used declaratively
Use case: Best for configuration management and orchestration, not primary infrastructure provisioning.
Getting Started with Terraform
Terraform is the most popular IaC tool and a safe bet for learning. Here's a practical introduction.
Installation
# macOS
brew install terraform
# Windows (Chocolatey)
choco install terraform
# Linux
curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
sudo apt-get update && sudo apt-get install terraform
# Verify installation
terraform --version
Basic Terraform Workflow
# 1. Initialize: Download provider plugins
terraform init
# 2. Plan: Preview changes before applying
terraform plan
# 3. Apply: Create/update infrastructure
terraform apply
# 4. Destroy: Remove all infrastructure
terraform destroy
Example: Creating an AWS EC2 Instance
Create main.tf
:
# Configure the AWS provider
provider "aws" {
region = "us-east-1"
}
# Create a VPC
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
tags = {
Name = "main-vpc"
}
}
# Create a subnet
resource "aws_subnet" "public" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.1.0/24"
availability_zone = "us-east-1a"
tags = {
Name = "public-subnet"
}
}
# Create a security group
resource "aws_security_group" "web" {
name = "web-sg"
description = "Allow HTTP and SSH"
vpc_id = aws_vpc.main.id
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
# Create an EC2 instance
resource "aws_instance" "web" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.micro"
subnet_id = aws_subnet.public.id
vpc_security_group_ids = [aws_security_group.web.id]
tags = {
Name = "web-server"
}
}
# Output the public IP
output "instance_public_ip" {
value = aws_instance.web.public_ip
}
Run the workflow:
terraform init
terraform plan
terraform apply # Type 'yes' to confirm
Terraform creates the VPC, subnet, security group, and EC2 instance in the correct order, handling dependencies automatically.
Variables and Outputs
variables.tf:
variable "region" {
description = "AWS region"
type = string
default = "us-east-1"
}
variable "instance_type" {
description = "EC2 instance type"
type = string
default = "t3.micro"
}
variable "environment" {
description = "Environment name"
type = string
}
terraform.tfvars:
region = "us-west-2"
instance_type = "t3.small"
environment = "production"
outputs.tf:
output "vpc_id" {
description = "ID of the VPC"
value = aws_vpc.main.id
}
output "instance_ip" {
description = "Public IP of the instance"
value = aws_instance.web.public_ip
}
Variables make your infrastructure reusable across environments. Outputs expose values for use in other modules or for reference.
Remote State with S3
Store Terraform state remotely for team collaboration:
backend.tf:
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
This stores state in S3 and uses DynamoDB for state locking, preventing concurrent modifications.
IaC Best Practices
Structure Your Code
Organize by environment and module:
infrastructure/
├── modules/
│ ├── vpc/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ └── README.md
│ ├── eks/
│ ├── rds/
│ └── s3/
├── environments/
│ ├── prod/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── backend.tf
│ │ └── terraform.tfvars
│ ├── staging/
│ └── dev/
├── .gitignore
└── README.md
Each module is self-contained and testable. Environments are thin layers that compose modules.
Use Version Control
Treat infrastructure code like application code:
- Commit all IaC files to Git
- Use
.gitignore
to exclude sensitive files - Open pull requests for infrastructure changes
- Require code reviews before merging
.gitignore for Terraform:
# Terraform files
.terraform/
*.tfstate
*.tfstate.backup
*.tfvars
.terraform.lock.hcl
# Crash logs
crash.log
# Sensitive files
*.pem
*.key
.env
Never Hardcode Secrets
Bad:
resource "aws_db_instance" "main" {
password = "mysecretpassword123" # NEVER DO THIS
}
Good:
resource "aws_db_instance" "main" {
password = var.db_password # Passed from environment variable
}
Use:
- Environment variables:
export TF_VAR_db_password=<secret>
- AWS Secrets Manager, Azure Key Vault, HashiCorp Vault
- Terraform Cloud encrypted variables
.tfvars
files excluded from version control
Plan Before Apply
Always run terraform plan
before apply
:
# Review changes
terraform plan -out=tfplan
# If changes look good, apply
terraform apply tfplan
In CI/CD, require human approval after plan:
# GitHub Actions example
- name: Terraform Plan
run: terraform plan -out=tfplan
- name: Wait for Approval
uses: trstringer/manual-approval@v1
- name: Terraform Apply
run: terraform apply tfplan
Use State Locking
Prevent multiple people from modifying infrastructure simultaneously:
- S3 + DynamoDB (Terraform with AWS)
- Azure Storage (Terraform with Azure)
- GCS (Terraform with GCP)
- Terraform Cloud (automatic locking)
Without locking, two engineers running terraform apply
at the same time can corrupt state.
Tag Everything
Apply consistent tags to all resources:
resource "aws_instance" "web" {
ami = var.ami
instance_type = var.instance_type
tags = {
Name = "web-server"
Environment = var.environment
Project = "website"
ManagedBy = "Terraform"
Owner = "platform-team"
CostCenter = "engineering"
}
}
Tags enable:
- Cost tracking and allocation
- Resource organization and filtering
- Automated compliance checks
- Disaster recovery planning
Keep Modules Small and Focused
Each module should do one thing well:
Good:
vpc
module: Creates VPC, subnets, route tables, internet gatewayeks
module: Creates EKS cluster, node groups, IAM rolesrds
module: Creates RDS instance, subnet group, security group
Bad:
everything
module: Creates VPC, EKS, RDS, S3, and more in one file
Small modules are easier to test, understand, and reuse.
Document Your Infrastructure
Add README files to each module:
# VPC Module
Creates a VPC with public and private subnets across multiple availability zones.
## Usage
```hcl
module "vpc" {
source = "./modules/vpc"
cidr_block = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b"]
environment = "production"
}
Inputs
Name | Description | Type | Default |
---|---|---|---|
cidr_block | CIDR block for VPC | string | - |
azs | Availability zones | list(string) | - |
Outputs
Name | Description |
---|---|
vpc_id | ID of the VPC |
public_subnet_ids | IDs of public subnets |
### Implement Drift Detection
Infrastructure can drift from the defined state due to manual changes. Detect drift regularly:
```bash
# Terraform
terraform plan # Shows drift
# AWS CloudFormation
aws cloudformation detect-stack-drift --stack-name my-stack
aws cloudformation describe-stack-drift-detection-status --stack-drift-detection-id <id>
In CI/CD, run drift detection on a schedule (daily or weekly) and alert on changes.
Integrating IaC with CI/CD
Automated Deployment Pipeline
Typical workflow:
- Developer creates a feature branch, makes infrastructure changes
- CI runs
terraform validate
andterraform plan
- Pull request shows the planned changes
- Code review by team members
- Merge triggers
terraform apply
to staging - Manual approval required for production
- Deploy to production with
terraform apply
Example GitHub Actions Workflow
.github/workflows/terraform.yml:
name: Terraform
on:
pull_request:
branches: [main]
push:
branches: [main]
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
- name: Terraform Init
run: terraform init
- name: Terraform Format
run: terraform fmt -check
- name: Terraform Validate
run: terraform validate
- name: Terraform Plan
run: terraform plan -no-color
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
apply:
if: github.ref == 'refs/heads/main'
needs: plan
runs-on: ubuntu-latest
environment: production # Requires approval
steps:
- uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
- name: Terraform Init
run: terraform init
- name: Terraform Apply
run: terraform apply -auto-approve
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
GitOps for Infrastructure
Similar to application GitOps, infrastructure changes are made exclusively through Git:
- All changes go through pull requests
- Main branch represents production state
- Automated systems apply changes from Git
- Manual console changes are prohibited
Tools like Atlantis automate Terraform operations in pull requests:
- Comments
terraform plan
output in PR - Team members can review changes
- Merge triggers automatic apply
Common IaC Problems and Solutions
Problem: "State file is locked"
Cause: Someone else is running Terraform, or a previous operation crashed.
Solution:
# Check who has the lock (S3 + DynamoDB)
aws dynamodb get-item --table-name terraform-locks --key '{"LockID":{"S":"<path>/terraform.tfstate"}}'
# Force unlock (use cautiously)
terraform force-unlock <lock-id>
Problem: "Resource already exists"
Cause: Resource was created manually or by another Terraform config.
Solution:
# Import existing resource into state
terraform import aws_instance.web i-0123456789abcdef0
# Or, remove from code and manage separately
Problem: "Dependencies are out of order"
Cause: Terraform can't determine the correct order to create resources.
Solution:
# Explicit dependency
resource "aws_eip" "web" {
instance = aws_instance.web.id
depends_on = [aws_internet_gateway.main] # Force ordering
}
Problem: "Plan shows unexpected changes"
Cause: Drift from manual changes or state corruption.
Solution:
# Refresh state from actual infrastructure
terraform refresh
# Review what changed
terraform plan
# If acceptable, apply to resync
terraform apply
Problem: "Failed to destroy resource"
Cause: Dependencies prevent deletion (e.g., security group attached to instances).
Solution:
# Target specific resources
terraform destroy -target=aws_instance.web
terraform destroy -target=aws_security_group.web
# Or, remove protection from code
resource "aws_s3_bucket" "data" {
# Remove or set to false
lifecycle {
prevent_destroy = false
}
}
Problem: "Secrets in state file"
Cause: Sensitive values stored in state (passwords, API keys).
Solution:
- Store state in encrypted remote backend (S3 with encryption, Terraform Cloud)
- Restrict state file access via IAM/RBAC
- Mark outputs as sensitive:
output "db_password" { value = aws_db_instance.main.password sensitive = true }
- Never commit state files to Git
Learning Path: From Basics to Mastery
Week 1: Fundamentals
- Install Terraform and configure AWS/Azure credentials
- Create a simple EC2 instance or virtual machine
- Learn
init
,plan
,apply
,destroy
workflow - Understand resources, providers, and state
Week 2: Organization
- Create reusable modules
- Use variables and outputs
- Set up remote state storage
- Structure code by environment
Week 3: Real Infrastructure
- Deploy a complete application stack:
- VPC with public/private subnets
- Load balancer
- Auto-scaling group
- RDS database
- S3 bucket for assets
- Practice updating and modifying infrastructure
Week 4: Automation
- Integrate Terraform with Git
- Set up CI/CD pipeline (GitHub Actions, GitLab CI)
- Implement automated testing (terraform validate, tflint, checkov)
- Add drift detection
Week 5: Advanced Concepts
- Learn
terraform import
for existing resources - Practice disaster recovery (destroy and recreate)
- Use workspaces for environment isolation
- Implement modules from Terraform Registry
Month 2+: Production Patterns
- Design multi-region infrastructure
- Implement blue/green deployments with IaC
- Practice Kubernetes manifest management (Helm, Kustomize)
- Learn additional tools (Pulumi, CloudFormation)
Practical Exercises
Exercise 1: Simple Web Server
Deploy a web server with Terraform:
- VPC with one public subnet
- Security group allowing HTTP and SSH
- EC2 instance running a simple web server
- Output the instance's public IP
Exercise 2: Multi-Tier Application
Create a three-tier architecture:
- VPC with public and private subnets
- Public subnet: Load balancer
- Private subnet: Application servers (auto-scaling group)
- Private subnet: RDS database
- S3 bucket for static assets
Exercise 3: Module Development
Extract the VPC configuration into a reusable module:
- Create
modules/vpc
directory - Define inputs (CIDR, AZs, tags)
- Define outputs (VPC ID, subnet IDs)
- Use the module in multiple environments
Exercise 4: CI/CD Integration
Set up automated deployment:
- Store Terraform code in Git
- Create GitHub Actions workflow
- Run
terraform plan
on pull requests - Run
terraform apply
on merge to main - Require manual approval for production
Exercise 5: Disaster Recovery
Test infrastructure recovery:
- Deploy a complete stack
- Document the current state
- Run
terraform destroy
- Rebuild from code with
terraform apply
- Verify all services are functional
The Bottom Line: IaC Is Non-Negotiable
Infrastructure as Code is no longer optional for cloud engineers. It's how modern infrastructure is managed. Manual provisioning through consoles doesn't scale, can't be tested, and creates endless opportunities for mistakes. Every major company managing cloud infrastructure uses IaC.
Start with Terraform if you're learning IaC for the first time-- it's multi-cloud, has a massive community, and is the most requested skill in cloud engineering job postings. Once you understand IaC principles, you can easily pick up other tools like CloudFormation, Pulumi, or ARM templates.
The sooner you start thinking in code, the faster you'll be able to build, test, and scale infrastructure. This isn't just a skill for DevOps engineers-- it's core knowledge for anyone working with cloud platforms.
For more cloud engineering fundamentals, check out Git fundamentals, Linux fundamentals, and networking fundamentals.