Introduction
Welcome to A Guide to DevOps Engineering: Bridging the Gap β a book written specifically for junior DevOps engineers who want to accelerate their growth and learn the lessons that typically take years of experience to acquire.
Why This Book Exists
The journey from junior to senior DevOps engineer is filled with challenges that textbooks and tutorials rarely address. While there are countless resources teaching you how to use Kubernetes, Terraform, or CI/CD tools, few teach you when, why, and what can go wrong when you use them in production environments.
This book fills that gap.
Who This Book Is For
This book is designed for:
- Junior DevOps engineers (6-18 months of experience) who want to level up faster
- System administrators transitioning to DevOps roles
- Software developers expanding into infrastructure and operations
- Anyone who has deployed to production and realized there's so much more to learn
You should have basic familiarity with:
- Linux command line
- Git version control
- Docker containers (basic usage)
- At least one cloud provider (AWS, Azure, or GCP)
- Basic programming/scripting (Python, Bash, or similar)
What Makes This Book Different
π Scenario-Based Learning
Instead of dry explanations, you'll follow Sarah, a junior DevOps engineer, as she encounters real-world challenges. You'll experience the problem from her perspective, understand the context, and learn both the immediate solution and the deeper principles.
π§ Senior Engineer Thinking
Each chapter includes "The Senior's Perspective" β revealing the mental models, frameworks, and considerations that experienced engineers apply automatically but rarely articulate.
π‘ Lessons from Production
The scenarios in this book are based on real incidents, challenges, and "aha moments" that engineers experience in production environments. You'll learn from others' mistakes without having to make them all yourself.
π§ Practical and Production-Ready
Every code example, configuration, and command is production-ready and follows industry best practices. No toy examples β this is the real deal.
π Bridging Knowledge Gaps
The book explicitly addresses the "unknown unknowns" β the things you don't know to ask about because you haven't encountered them yet.
How This Book Is Structured
The book is organized into seven parts:
- Foundations β Core DevOps practices including deployments, logging, environments, and CI/CD
- Infrastructure as Code β Mastering Terraform, managing state, modules, and cost control
- Container Orchestration β Deep dive into Kubernetes, from basics to production patterns
- Observability and Reliability β Monitoring, alerting, tracing, SLOs, and debugging
- Security and Compliance β Container security, secrets management, access control, and compliance
- CI/CD Mastery β Advanced pipeline patterns, testing strategies, and GitOps
- Collaboration and Culture β Communication, on-call, automation decisions, and career growth
Each chapter follows a consistent structure:
- What You'll Learn β A quick overview of what you'll learn in the chapter
- Sarah's Challenge β The scenario and context
- Understanding the Problem β Breaking down the concepts
- The Senior's Perspective β How experienced engineers think about this
- The Solution β Step-by-step walkthrough with code
- Lessons Learned β Key takeaways and when to apply them
- Reflection Questions β Help you apply concepts to your context
What You'll Learn
By the end of this book, you will:
- β Understand the "why" behind DevOps best practices, not just the "how"
- β Recognize common production issues before they become incidents
- β Make architectural and tooling decisions with confidence
- β Debug complex distributed systems systematically
- β Implement security and compliance without sacrificing velocity
- β Build reliable, observable, and maintainable infrastructure
- β Communicate effectively with both technical and non-technical stakeholders
- β Navigate your career growth intentionally
A Note on Tools and Technologies
This book uses specific tools (Kubernetes, Terraform, AWS, Prometheus, etc.) because concrete examples are more valuable than abstract concepts. However, the principles and mental models apply regardless of your specific tech stack.
If you use different tools:
- Azure instead of AWS? The cloud concepts still apply
- GitLab CI instead of Jenkins? The pipeline principles are the same
- Nomad instead of Kubernetes? The orchestration patterns translate
- Pulumi instead of Terraform? The IaC best practices remain relevant
Focus on the why and the thinking process, and you'll be able to apply these lessons to any technology.
How to Get the Most from This Book
For Cover-to-Cover Readers
The book is designed to be read sequentially. Each chapter builds on concepts from previous chapters, and Sarah's journey follows a logical progression.
For Reference Seekers
Need to solve a specific problem? Check the detailed table of contents and jump to the relevant chapter. Each chapter is self-contained enough to be useful on its own.
For Hands-On Learners
All code examples are available in the accompanying GitHub repository. Clone it, experiment, break things, and rebuild them. The best learning happens through doing.
For Discussion Groups
This book works great as a book club or team learning resource. The reflection questions at the end of each chapter are designed to spark discussions about how concepts apply to your specific environment.
Contributing to This Book
This book is open source! If you find errors, have suggestions, or want to contribute additional scenarios, please visit our GitHub repository. The DevOps community thrives on shared knowledge, and your contributions help other junior engineers on their journey.
A Personal Note
Every senior engineer was once a junior engineer who felt overwhelmed by the complexity of production systems. The difference isn't innate talent β it's experience, mentorship, and a lot of learning from mistakes.
This book is the mentorship and experience compressed into a format you can absorb in weeks or months instead of years. But remember: reading is just the first step. Apply these lessons, experiment, make mistakes in safe environments, and keep growing.
The gap between junior and senior isn't as wide as it seems. Let's bridge it together.
Ready to begin? Let's meet Sarah and start her first day dealing with a production incident.
About Sarah
Before we dive into the technical journey, let's get to know Sarah β the junior DevOps engineer you'll be following throughout this book.
Sarah's Background
Sarah Martinez is 27 years old and has been working as a DevOps engineer for about 8 months at TechFlow, a mid-sized SaaS company with approximately 150 employees. TechFlow provides a B2B project management platform used by thousands of companies worldwide.
Her Journey to DevOps
Sarah didn't start in DevOps. Like many in the field, she took a winding path:
- Computer Science degree from a state university (graduated 3 years ago)
- First job: Junior software developer at a small consultancy, building web applications
- Transition: After 2 years of development, she became curious about how applications get deployed, monitored, and scaled
- Current role: Joined TechFlow's platform team 8 months ago as their second DevOps engineer
What She Knows
Sarah has solid foundations in:
- Programming: Comfortable with Python and JavaScript; can write Bash scripts
- Linux: Daily user, knows common commands, can SSH and navigate servers
- Docker: Has containerized several applications, understands images and containers
- AWS basics: Can launch EC2 instances, create S3 buckets, and navigate the console
- Git: Proficient with branches, commits, pull requests, and merge conflicts
- CI/CD: Has set up basic GitHub Actions workflows
What She's Learning
Sarah is still getting comfortable with:
- Kubernetes: Deployed a few services but doesn't fully understand the networking model
- Terraform: Can modify existing code but struggles with state management and modules
- Monitoring: Knows she should monitor things, but unsure what metrics matter
- Incident response: Has been paged once and it was stressful
- Making decisions: Often second-guesses herself when choosing between approaches
Her Challenges
Like most junior engineers, Sarah faces common challenges:
- Imposter syndrome: Surrounded by senior engineers who seem to know everything
- Information overload: Every solution seems to require learning three new tools
- Production anxiety: Fears breaking things in production
- Unknown unknowns: Doesn't know what she doesn't know
- Time pressure: Balancing learning with delivering on sprint commitments
The TechFlow Environment
To understand Sarah's scenarios, it helps to know her company's technical landscape:
The Application
TechFlow runs a microservices architecture with:
- 12 core services (user management, projects, tasks, notifications, etc.)
- 3 frontend applications (web app, mobile API, admin panel)
- PostgreSQL databases (RDS on AWS)
- Redis for caching and session management
- RabbitMQ for async messaging
The Infrastructure
- Cloud Provider: AWS
- Orchestration: Kubernetes (EKS) with 3 clusters (dev, staging, production)
- IaC: Terraform for infrastructure, Helm for Kubernetes deployments
- CI/CD: GitHub for code, GitHub Actions for CI/CD pipelines
- Monitoring: Prometheus and Grafana (recently adopted)
- Logging: CloudWatch Logs (migrating to ELK stack)
The Team
Sarah works on the Platform Team:
- Marcus (Engineering Manager) β Former DevOps lead, now managing the team
- James (Senior DevOps Engineer) β 7 years experience, Sarah's mentor, very patient
- Sarah (DevOps Engineer) β That's our protagonist!
- Priya (DevOps Engineer) β Joined 3 months after Sarah, also learning
The team also collaborates closely with:
- Development teams (3 teams, ~15 developers total)
- Product team (defining features and priorities)
- On-call rotation (all engineers participate)
Why Sarah?
Sarah represents the reality of junior DevOps engineers:
- She's capable but not yet confident
- She knows the basics but lacks production experience
- She's eager to learn but sometimes overwhelmed
- She makes mistakes and learns from them
- She asks questions even when she feels she should "already know"
- She's relatable β her challenges are probably your challenges too
Sarah's Goals
Throughout this book, Sarah aims to:
- β Build confidence in making production decisions
- β Develop systematic approaches to debugging and problem-solving
- β Understand the "why" behind best practices, not just the "what"
- β Learn to balance quick fixes with proper solutions
- β Communicate technical concepts effectively
- β Eventually mentor other junior engineers
Following Sarah's Journey
Each chapter presents a real scenario Sarah encounters at TechFlow. You'll see:
- Her initial reaction and uncertainty
- How she approaches the problem
- Guidance from James (the senior engineer)
- The solution and its reasoning
- Lessons she takes away
Sarah's journey isn't linear β she'll make mistakes, circle back to concepts, and gradually build competence. Just like real professional growth.
Your Journey Alongside Sarah
As you read Sarah's story:
- Reflect on your own experiences β Have you faced similar challenges?
- Notice the thought processes β How does Sarah's thinking evolve?
- Try the examples β All the code and configurations are real and runnable
- Ask "what if" β How would you handle different constraints or contexts?
Remember: Sarah is learning, and so are you. It's okay to not understand everything immediately. The goal is progress, not perfection.
Now that you know Sarah, let's talk about how to get the most out of this book.
Continue to How to Use This Book β
How to Use This Book
This book is designed to be flexible β whether you're reading cover-to-cover, looking for specific solutions, or using it as a team learning resource. Here's how to get the most value based on your goals and learning style.
Reading Strategies
π― The Complete Journey (Recommended for First Read)
Best for: Junior engineers who want comprehensive growth
Read the book sequentially from Part I to Part VII. This approach:
- Builds foundational knowledge progressively
- Follows Sarah's growth as she gains experience
- Introduces concepts in a logical order
- Creates connections between related topics
Time commitment: 40-60 hours (spread over 2-3 months)
Approach:
- Read one chapter at a time
- Try the code examples in a safe environment
- Answer the reflection questions
- Wait a day or two before the next chapter (let concepts settle)
- Revisit chapters when you encounter similar situations at work
π The Reference Approach
Best for: Experienced juniors or those facing specific challenges
Use the detailed table of contents to jump to relevant chapters.
When to use:
- "Our Terraform state is corrupted" β Chapter 6
- "I need to set up monitoring" β Chapter 17
- "How do I handle secrets properly?" β Chapter 24
- "Planning my first on-call rotation" β Chapter 34
Approach:
- Use the SUMMARY.md to find relevant chapters
- Read the "Sarah's Challenge" section to see if it matches your situation
- Skim the "Understanding the Problem" for context
- Focus on "The Solution" and "Lessons Learned"
- Read related chapters mentioned in the text
π§ͺ The Hands-On Lab Approach
Best for: Kinesthetic learners who learn by doing
Set up a lab environment and work through examples as you read.
Setup required:
- Local Kubernetes cluster (minikube, kind, or k3s)
- AWS free tier account (or equivalent)
- Terraform installed locally
- Docker Desktop or equivalent
Approach:
- Read the scenario
- Pause before the solution
- Try to solve it yourself
- Compare your approach with Sarah's solution
- Experiment with variations
π₯ The Team Learning Approach
Best for: Teams wanting to level up together
Use this book as a structured learning program for your team.
Format:
- Weekly discussion: One chapter per week
- Meeting length: 60-90 minutes
- Rotation: Different team member presents each week
Structure:
- Everyone reads the chapter beforehand (30-40 min)
- Presenter summarizes key points (10 min)
- Group discusses how concepts apply to your environment (20 min)
- Share personal experiences with similar challenges (15 min)
- Identify one thing to implement or improve (10 min)
- Optional: Hands-on exercise together (30 min)
π The Certification Prep Approach
Best for: Preparing for DevOps certifications (CKA, AWS DevOps, etc.)
Use this book alongside official study guides for practical context.
Approach:
- Study official certification material for theoretical knowledge
- Read relevant chapters for real-world application
- Use code examples for hands-on practice
- Focus on "Common Misconceptions" sections
How to Approach Each Chapter
Before Reading
- Skim the title and introduction β What challenge will Sarah face?
- Check prerequisites β Do you need to review earlier chapters?
- Prepare your lab (if hands-on) β Have the environment ready
During Reading
-
Read Sarah's Challenge first β Put yourself in her shoes
- What would YOU do?
- What information would you need?
- What are you uncertain about?
-
Study the diagrams carefully β Visualize the architecture and flow
-
Don't skip the "Senior's Perspective" β This is where the wisdom is
- Notice what questions are asked first
- Observe the decision-making framework
- Identify what considerations matter
-
Try the code examples β Don't just read them
- Type them out (builds muscle memory)
- Modify them (test your understanding)
- Break them intentionally (learn what fails)
-
Pause at "Lessons Learned" β Reflect before moving on
- Do you agree with the takeaways?
- Can you think of exceptions?
- How does this apply to your context?
After Reading
- Answer the reflection questions β Write or discuss responses
- Bookmark for later β Note chapters to revisit
- Apply one concept β Pick one thing to try at work
- Share with your team β Teaching reinforces learning
Special Features and How to Use Them
π― "What You'll Learn" Sections
Quick lists at the start of each chapter summarizing what you'll be able to do by the end. Use these: Skim them before reading to focus your attention, and revisit them after reading to check your understanding against the outcomes.
π‘ Tip Boxes
Quick, actionable advice that you can apply immediately. Use these: Bookmark or copy to your notes for reference.
β οΈ Warning Boxes
Common mistakes and anti-patterns to avoid. Use these: Check your existing systems for these issues.
π Diagrams
Visual representations of architectures, flows, and concepts. Use these: Draw similar diagrams for your own systems.
π Deep Dive Sections
Advanced topics for curious readers. Use these: Skip on first read; return when ready for more depth.
π Sarah's Thoughts
Sarah's internal monologue showing her thinking process. Use these: Notice how her thinking evolves over time.
π― Reflection Questions
Questions to help you apply concepts to your situation. Use these: Journal responses or discuss with peers.
Companion Resources
Code Examples Repository
All code examples, configurations, and scripts are available in the GitHub repository:
https://github.com/BahaTanvir/devops-guide-book
Repository structure:
examples/
βββ chapter-01/ # Working examples for each chapter
βββ chapter-02/
βββ ...
terraform-modules/ # Reusable Terraform modules
kubernetes-manifests/ # Example K8s YAML files
scripts/ # Helper scripts
labs/ # Hands-on lab exercises
Community Forum
Join discussions with other readers:
- Ask questions
- Share your own scenarios
- Get help with exercises
- Connect with mentors
Video Walkthroughs (Coming Soon)
Selected chapters will have video companions demonstrating:
- Complex CLI operations
- Debugging processes
- Architecture diagrams explained
Creating Your Learning Environment
Recommended Setup
For the best hands-on experience:
# Local Kubernetes cluster
brew install kind # or minikube, k3s
kind create cluster --name devops-learning
# Essential tools
brew install kubectl terraform helm
brew install awscli # if using AWS
brew install docker
# Monitoring tools
brew install k9s # Kubernetes CLI UI
brew install kubectx # Context switching
Safe Practice Environment
Option 1: Local Only
- Use
kindorminikubefor Kubernetes - LocalStack for AWS emulation
- No risk of cloud costs
Option 2: Cloud Free Tier
- AWS/GCP/Azure free tier account
- Set up billing alerts ($10 threshold)
- Use small instance types
- Remember to tear down resources
Option 3: Company Sandbox
- Ask your employer for a dev/sandbox account
- Isolated from production
- Real cloud environment
Lab Etiquette
- π·οΈ Tag all resources with your name and purpose
- π° Monitor costs β set up alerts
- π§Ή Clean up after each session
- π Never use production credentials
- π Document your experiments
Pace Yourself
Recommended Schedule
Intensive Track (3 months):
- 2-3 chapters per week
- 2-3 hours per chapter
- Active hands-on practice
Balanced Track (6 months):
- 1-2 chapters per week
- 1-2 hours per chapter
- Selective hands-on practice
Relaxed Track (12 months):
- 1 chapter per week
- 30-60 minutes per chapter
- Read and reflect, less hands-on
There's no "right" pace β choose what fits your schedule and learning style.
Avoiding Burnout
- Don't rush through chapters
- Take breaks between sections
- Celebrate small wins
- It's okay to not understand everything immediately
- Return to challenging chapters later
Measuring Progress
Self-Assessment
After completing each part, ask yourself:
Confidence Level:
- Can I explain this concept to someone else?
- Could I implement this in a real environment?
- Do I understand when to apply this approach?
Practical Application:
- Have I tried at least one example?
- Can I modify the example for my use case?
- Do I know where to find more information?
Critical Thinking:
- Do I understand the trade-offs?
- Can I identify when NOT to use this approach?
- What questions do I still have?
Portfolio Building
As you progress:
- Create a personal documentation wiki
- Build a GitHub repository with your examples
- Write blog posts about what you've learned
- Present learnings at team meetings
When You Get Stuck
- Re-read the chapter β Often makes more sense the second time
- Check the GitHub issues β Someone may have asked the same question
- Try a simpler version β Break down the problem
- Ask in the community forum β Others are learning too
- Move on and return later β Sometimes you need more context
Updating Your Knowledge
DevOps tools and practices evolve rapidly:
- Core concepts remain relevant (monitoring, IaC, CI/CD principles)
- Specific tools may change (but patterns transfer)
- Check the GitHub repo for updates and errata
- Community contributions keep examples current
A Note on Certification
This book alone won't pass a certification exam, but it will:
- β Provide real-world context for exam concepts
- β Help you understand WHY things work, not just HOW
- β Give you confidence to apply knowledge practically
- β Prepare you for interview questions
Combine this book with official study guides for best results.
Ready to Start?
You now have everything you need to begin your journey with Sarah. Remember:
- Be patient with yourself β Learning takes time
- Stay curious β Ask "why" often
- Practice deliberately β Hands-on experience is key
- Share your knowledge β Teaching others deepens understanding
- Enjoy the journey β DevOps is challenging but rewarding
Let's get started with Chapter 1, where Sarah faces her first production incident.
Chapter 1: The Incident That Changed Everything
"The best teacher is experience, and the most memorable lessons come from production outages."
Sarah's Challenge
It was a Thursday afternoon, three months into her role at TechFlow, when Sarah experienced her first production incident. She had just finished lunch and was reviewing a pull request when her phone buzzed. Then again. And again.
The #incidents Slack channel was exploding with messages:
@channel CRITICAL: Checkout service is down
Multiple customer reports - cannot complete purchases
Revenue impact - immediate attention needed
Sarah's heart raced. She had deployed a new version of the checkout service just 20 minutes ago. The deployment had completed successfullyβall green checkmarks in the CI/CD pipeline. She had even checked the pods, and they were running. What could have gone wrong?
"Sarah, did you just deploy checkout?" James, the senior DevOps engineer, appeared at her desk.
"Yes, about twenty minutes ago. Version 2.3.0. The deployment succeeded, and all pods are running," Sarah replied, her voice tight with anxiety.
"Let me take a look with you," James said calmly, pulling up a chair. "Show me what you deployed."
Sarah pulled up her terminal, fingers slightly trembling as she typed:
kubectl get pods -n production -l app=checkout-service
The output showed:
NAME READY STATUS RESTARTS AGE
checkout-service-7d8f4c5b9d-8xk2p 1/1 Running 0 19m
checkout-service-7d8f4c5b9d-j7h9m 1/1 Running 0 19m
checkout-service-7d8f4c5b9d-m2p4w 1/1 Running 0 19m
"See? All three pods are running," Sarah said, confused.
"Running doesn't mean working," James said gently. "Let's check the logs."
kubectl logs checkout-service-7d8f4c5b9d-8xk2p -n production
The terminal filled with error messages:
Error: DATABASE_URL environment variable not set
Fatal: Cannot connect to database
Application startup failed
[1] 156 segmentation fault ./checkout-service
Sarah's stomach dropped. "Oh no. I forgot to add the new database environment variable."
The new version of the checkout service required a DATABASE_URL environment variable that she had tested locally but never added to the Kubernetes deployment configuration. The pods started successfully because the container launched, but the application inside crashed immediately. Since there were no proper health checks configured, Kubernetes kept the pods in "Running" state even though they weren't serving any traffic.
"This is a perfect learning moment," James said. "Let's fix this and talk about what happened. First priority: restore service. Can you roll back?"
Sarah's mind went blank. "How do I roll back?"
Understanding the Problem
Sarah's first incident revealed several common issues that junior DevOps engineers face:
1. The "Running" vs "Ready" Misconception
In Kubernetes, a pod can be in "Running" state without actually being able to serve traffic. Here's what happened:
- Container Started: The checkout service container launched successfully
- Process Started: The main application process started
- Application Crashed: The application immediately crashed due to missing configuration
- Kubernetes Unaware: Without proper health checks, Kubernetes had no way to know the application wasn't working
This is one of the most common sources of confusion for newcomers to Kubernetes. The pod status reflects the container runtime state, not the application health.
2. Missing Health Checks
Sarah's deployment had no health checks configured. Kubernetes supports three types of probes:
- Liveness Probe: Is the application alive? If not, restart the container
- Readiness Probe: Is the application ready to serve traffic? If not, remove from service endpoints
- Startup Probe: Has the application finished starting up? (For slow-starting applications)
Without these probes, Kubernetes assumes a running container is a healthy containerβa dangerous assumption.
3. Configuration Drift Between Environments
The classic "works on my machine" problem manifested here:
- Local Development: Sarah set
DATABASE_URLin her.envfile - Staging: The variable was configured in the staging deployment (she had tested there)
- Production: She forgot to add it to the production deployment manifest
This environment configuration drift is a frequent source of production issues.
4. Lack of Deployment Validation
The deployment succeeded from Kubernetes' perspective because:
- The deployment resource was valid YAML
- The pods were scheduled successfully
- The containers started
But there was no validation that the application was actually working correctly.
5. No Rollback Plan
When the incident occurred, Sarah didn't know how to quickly rollback. This extended the outage duration unnecessarily. Having a rollback plan is as important as the deployment itself.
The Senior's Perspective
James walked Sarah through his mental model for handling deployment incidents:
Incident Response Framework
"When an incident happens right after a deployment," James explained, "I follow a specific mental checklist:"
1. Restore Service First (Incident Response)
- Can we rollback immediately?
- What's the blast radius? (How many users affected?)
- Is there a quick mitigation without rollback?
2. Gather Information (Diagnostic Phase)
- What changed? (Recent deployments, config changes, traffic patterns)
- What are the symptoms? (Errors in logs, failed health checks, metrics anomalies)
- What's the timeline? (When did it start? Any correlation with events?)
3. Understand the Root Cause
- Why did the deployment succeed but the application fail?
- Why didn't our testing catch this?
- What safeguards should have prevented this?
4. Prevent Recurrence
- What process changes are needed?
- What automation can help?
- What monitoring would have caught this sooner?
The Questions Senior Engineers Ask
James shared the questions he automatically asks during any deployment issue:
-
"What does 'success' mean?"
- For Sarah, deployment success meant pods running
- For James, success means users can complete their workflows
-
"What are we not seeing?"
- The logs showed errors, but without looking, everything appeared fine
- What metrics or alerts should have notified them immediately?
-
"How quickly can we rollback?"
- Always know your rollback procedure before deploying
- Practice rollbacks in staging
-
"What's different between environments?"
- Configuration differences are the #1 cause of "works in staging but not production"
- Environment parity is crucial
-
"What will I learn from this?"
- Every incident is a learning opportunity
- Post-mortems without blame lead to better systems
The Deployment Safety Mental Model
James explained his framework for deployment safety:
Safe Deployment = Validation + Gradual Rollout + Health Checks + Easy Rollback
- Validation: Automated checks that the deployment is actually working
- Gradual Rollout: Don't update all instances at once (we'll cover strategies later)
- Health Checks: Let Kubernetes know if the application is healthy
- Easy Rollback: One command to undo changes
"The goal," James said, "isn't to never have incidents. It's to detect them quickly, resolve them fast, and learn from each one."
The Solution
Immediate Fix: Rolling Back
James showed Sarah the quickest way to rollback a Kubernetes deployment:
# View deployment history
kubectl rollout history deployment/checkout-service -n production
REVISION CHANGE-CAUSE
1 Initial deployment v2.2.0
2 Update to v2.3.0 (current)
# Rollback to previous version
kubectl rollout undo deployment/checkout-service -n production
# Watch the rollback progress
kubectl rollout status deployment/checkout-service -n production
Within 30 seconds, the previous version was restored, and checkout functionality was working again. Sarah immediately posted to the #incidents channel:
Service restored via rollback to v2.2.0
Issue: Missing DATABASE_URL env var in production deployment
Post-mortem to follow
Understanding What Happened
Let's look at what Sarah deployed vs. what she should have deployed.
Sarah's Deployment (Broken):
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-service
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: checkout-service
template:
metadata:
labels:
app: checkout-service
spec:
containers:
- name: checkout
image: techflow/checkout-service:2.3.0
ports:
- containerPort: 8080
env:
- name: PORT
value: "8080"
# Missing: DATABASE_URL environment variable
What She Should Have Deployed:
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-service
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: checkout-service
version: v2.3.0 # Version label for tracking
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Create 1 extra pod during rollout
maxUnavailable: 0 # Ensure all replicas available during rollout
template:
metadata:
labels:
app: checkout-service
version: v2.3.0
spec:
containers:
- name: checkout
image: techflow/checkout-service:2.3.0
ports:
- containerPort: 8080
env:
- name: PORT
value: "8080"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: checkout-secrets
key: database-url
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
# Health checks - Critical!
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
Key Improvements Explained
1. Environment Variable from Secret:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: checkout-secrets
key: database-url
- Retrieves the database URL from a Kubernetes Secret
- Keeps sensitive data out of the deployment manifest
- Can be managed separately per environment
2. Resource Limits:
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
requests: Minimum resources guaranteed to the podlimits: Maximum resources the pod can use- Prevents one pod from starving others
- Helps Kubernetes schedule pods appropriately
3. Liveness Probe:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3
- Kubernetes checks
/healthendpoint every 10 seconds - If it fails 3 times, Kubernetes restarts the container
- Catches situations where the application is frozen or deadlocked
4. Readiness Probe:
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
- Kubernetes checks
/readyendpoint every 5 seconds - If it fails, the pod is removed from the Service endpoints (no traffic sent to it)
- Only passes when the application is ready to serve requests
- This would have prevented Sarah's incident: Pods without DATABASE_URL would never become Ready
5. Rolling Update Strategy:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
maxSurge: 1: Can create 1 extra pod during rollout (so with 3 replicas, temporarily have 4)maxUnavailable: 0: All original pods must be available during rollout- This ensures zero downtime during deployments
- New pods must pass readiness checks before old pods are terminated
Deep Dive: Deployment Strategies
James explained different deployment strategies and when to use each. If you're new to Kubernetes, treat this section as a referenceβfocus on understanding that you have options and can roll out changes gradually, rather than memorizing every detail.
1. Recreate Strategy
strategy:
type: Recreate
How it works:
- Terminate all old pods
- Then create new pods
Pros:
- Simple
- Guarantees no two versions running simultaneously
Cons:
- Downtime during transition
- Not acceptable for most production services
When to use:
- Development environments
- Services where downtime is acceptable
- Applications that can't run multiple versions simultaneously
2. Rolling Update (Default)
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
How it works:
- Gradually replace old pods with new ones
- Can configure how many to update at once
Pros:
- Zero downtime if configured correctly
- Automatic rollback if new pods fail health checks
- Works for most use cases
Cons:
- Both versions running during rollout
- Slower than recreate
When to use:
- Most production deployments
- When zero downtime is required
- When health checks are properly configured
3. Blue-Green Deployment
# Blue (current production)
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-service-blue
spec:
replicas: 3
selector:
matchLabels:
app: checkout-service
version: blue
template:
metadata:
labels:
app: checkout-service
version: blue
spec:
containers:
- name: checkout
image: techflow/checkout-service:2.2.0
---
# Green (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-service-green
spec:
replicas: 3
selector:
matchLabels:
app: checkout-service
version: green
template:
metadata:
labels:
app: checkout-service
version: green
spec:
containers:
- name: checkout
image: techflow/checkout-service:2.3.0
---
# Service (switch between blue and green)
apiVersion: v1
kind: Service
metadata:
name: checkout-service
spec:
selector:
app: checkout-service
version: blue # Change to 'green' to switch traffic
ports:
- port: 80
targetPort: 8080
How it works:
- Run both versions in parallel
- Switch traffic by changing Service selector
- Keep old version running for quick rollback
Pros:
- Instant switchover
- Instant rollback
- Can test new version in production before switching traffic
Cons:
- Requires 2x resources during deployment
- More complex to manage
When to use:
- Critical services where instant rollback is essential
- When you have resources to run duplicate environments
- When you want to validate in production before switching traffic
4. Canary Deployment
# Stable deployment (90% of traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-service-stable
spec:
replicas: 9 # 90% of desired capacity
selector:
matchLabels:
app: checkout-service
track: stable
template:
metadata:
labels:
app: checkout-service
track: stable
version: v2.2.0
spec:
containers:
- name: checkout
image: techflow/checkout-service:2.2.0
---
# Canary deployment (10% of traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-service-canary
spec:
replicas: 1 # 10% of desired capacity
selector:
matchLabels:
app: checkout-service
track: canary
template:
metadata:
labels:
app: checkout-service
track: canary
version: v2.3.0
spec:
containers:
- name: checkout
image: techflow/checkout-service:2.3.0
---
# Service sends traffic to both
apiVersion: v1
kind: Service
metadata:
name: checkout-service
spec:
selector:
app: checkout-service # Matches both stable and canary
ports:
- port: 80
targetPort: 8080
How it works:
- Deploy new version to small subset of pods
- Monitor metrics and errors
- Gradually increase percentage if healthy
- Rollback immediately if issues detected
Pros:
- Limits blast radius of bad deployments
- Real production validation with minimal risk
- Can catch issues before full rollout
Cons:
- More complex to orchestrate
- Requires good monitoring to detect issues
- Takes longer to fully roll out
When to use:
- High-traffic services where you can detect issues quickly
- When you want to validate with real production traffic
- Services where a small percentage of errors is acceptable during validation
Creating the Required Secret
Before deploying, Sarah needed to create the Secret containing the database URL:
# Create secret from literal value (for testing - not recommended for production)
kubectl create secret generic checkout-secrets \
--from-literal=database-url='postgresql://user:pass@db.example.com:5432/checkout' \
-n production
# Better: Create from file that's not in version control
echo 'postgresql://user:pass@db.example.com:5432/checkout' > /tmp/db-url
kubectl create secret generic checkout-secrets \
--from-file=database-url=/tmp/db-url \
-n production
rm /tmp/db-url
# Best: Use external secret management (covered in Chapter 24)
# Tools: Sealed Secrets, External Secrets Operator, Vault, etc.
Deploying the Fix
With the corrected deployment manifest and secret created, Sarah could now deploy safely:
# Apply the corrected deployment
kubectl apply -f checkout-deployment.yaml -n production
# Watch the rollout
kubectl rollout status deployment/checkout-service -n production
# Check pod status
kubectl get pods -n production -l app=checkout-service
# Verify health checks are passing
kubectl describe pod <pod-name> -n production | grep -A 10 "Conditions:"
# Check application logs
kubectl logs -f deployment/checkout-service -n production
# Test the endpoint
kubectl port-forward service/checkout-service 8080:80 -n production
curl http://localhost:8080/health
curl http://localhost:8080/ready
Monitoring the Deployment
James showed Sarah how to monitor deployments effectively:
# Watch deployment progress in real-time
kubectl get pods -n production -l app=checkout-service -w
# Check deployment events
kubectl describe deployment checkout-service -n production
# View recent events in the namespace
kubectl get events -n production --sort-by='.lastTimestamp' | head -20
# Check if new pods are ready
kubectl get deployment checkout-service -n production
# Output will show:
# NAME READY UP-TO-DATE AVAILABLE AGE
# checkout-service 3/3 3 3 5m
Understanding the output:
READY: 3/3 means 3 of 3 replicas are ready (passing readiness probe)UP-TO-DATE: 3 pods are running the latest versionAVAILABLE: 3 pods are available to serve traffic
If the readiness probe fails, you'd see something like:
NAME READY UP-TO-DATE AVAILABLE AGE
checkout-service 0/3 3 0 5m
This indicates the pods are running but failing readiness checksβexactly what Sarah's incident would have shown with proper health checks.
Lessons Learned
After resolving the incident, Sarah and James had a post-mortem discussion. Here are the key lessons:
1. "Running" β "Working"
The Lesson: Never trust pod status alone. Always verify the application is actually healthy.
How to Apply:
- Always configure liveness and readiness probes
- Test health check endpoints thoroughly
- Monitor application-level metrics, not just infrastructure metrics
Red Flags to Watch For:
- Pods showing "Running" but service is down
- Deployment shows "complete" but errors are occurring
- No health check endpoints defined in your application
2. Health Checks Are Not Optional
The Lesson: Health checks are the contract between your application and Kubernetes. Without them, Kubernetes is flying blind.
How to Apply:
# Minimum viable health checks
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
What Health Checks Should Test:
- Liveness: Is the application process alive? (Basic responsiveness)
- Readiness: Can the application serve traffic? (Database connected, dependencies available)
Implementation Tips:
# Example in Python/Flask
@app.route('/health')
def health():
# Simple liveness check
return {'status': 'healthy'}, 200
@app.route('/ready')
def ready():
# More thorough readiness check
try:
# Check database connection
db.execute('SELECT 1')
# Check required environment variables
required_vars = ['DATABASE_URL', 'API_KEY']
missing = [v for v in required_vars if not os.getenv(v)]
if missing:
return {'status': 'not ready', 'missing': missing}, 503
return {'status': 'ready'}, 200
except Exception as e:
return {'status': 'not ready', 'error': str(e)}, 503
3. Configuration Management Is Critical
The Lesson: Configuration drift between environments is a primary cause of "works in staging but not production" issues.
How to Apply:
- Use the same configuration mechanism across all environments
- Store configuration in version control (except secrets)
- Use tools like Helm, Kustomize, or Terraform to manage environment-specific values
- Validate configuration before deploying
Pattern to Follow:
# Base configuration (shared)
base/
deployment.yaml
service.yaml
# Environment-specific overlays
overlays/
staging/
kustomization.yaml # Staging-specific values
production/
kustomization.yaml # Production-specific values
4. Always Have a Rollback Plan
The Lesson: Before deploying, know exactly how you'll rollback if something goes wrong.
How to Apply:
# Document rollback commands in your runbook
# Quick rollback
kubectl rollout undo deployment/<name> -n <namespace>
# Rollback to specific revision
kubectl rollout undo deployment/<name> --to-revision=2 -n <namespace>
# Verify rollback
kubectl rollout status deployment/<name> -n <namespace>
Rollback Checklist:
- Test rollback in staging first
- Verify rollback doesn't require database migrations
- Ensure monitoring is in place to detect if rollback fixed the issue
- Have runbook with exact commands ready
- Know who has authority to execute rollback
5. Deploy With Progressive Validation
The Lesson: Don't deploy to all instances at once. Gradual rollouts catch issues before they affect everyone.
Deployment Best Practices:
- Start with canary (1-10% of traffic)
- Monitor metrics (errors, latency, resource usage)
- Gradually increase if metrics look good
- Rollback immediately if anomalies detected
- Full rollout only after validation period
Metrics to Monitor During Deployment:
- Error rate (should not increase)
- Response time (p50, p95, p99)
- Request rate (should remain stable)
- Resource usage (CPU, memory)
- Custom business metrics (conversion rate, checkout completion)
6. Automate Validation
The Lesson: Humans forget steps. Automation doesn't.
What to Automate:
# In your CI/CD pipeline
steps:
- name: Validate Deployment Manifest
run: |
# Check for required fields
kubectl apply --dry-run=client -f deployment.yaml
- name: Check for Required Secrets
run: |
# Verify secrets exist before deploying
kubectl get secret checkout-secrets -n production
- name: Run Smoke Tests
run: |
# After deployment, verify service works
./scripts/smoke-test.sh
- name: Monitor for Errors
run: |
# Watch for 5 minutes, rollback if error rate spikes
./scripts/monitor-deployment.sh
7. Post-Mortems Without Blame
The Lesson: The goal of a post-mortem is to improve systems, not to assign blame.
Post-Mortem Template:
# Incident Post-Mortem: Checkout Service Outage
## Summary
- **Date:** 2024-01-18
- **Duration:** 20 minutes
- **Impact:** Checkout unavailable, ~$X revenue loss
- **Root Cause:** Missing environment variable in production deployment
## Timeline
- 14:05 - Deployment of v2.3.0 started
- 14:06 - Deployment marked "complete" by CI/CD
- 14:08 - First customer complaint received
- 14:10 - #incidents alert posted
- 14:12 - Issue identified (missing DATABASE_URL)
- 14:13 - Rollback initiated
- 14:14 - Service restored
## What Went Well
- Rollback was quick once issue identified
- Team communication was clear
- Customer support notified promptly
## What Went Wrong
- No health checks to catch the issue
- Configuration not validated before deployment
- Issue not caught in staging (why?)
## Action Items
- [ ] Add liveness and readiness probes (Sarah, by Friday)
- [ ] Implement pre-deployment validation script (James, next week)
- [ ] Sync production secrets to staging for accurate testing (Sarah + James)
- [ ] Update deployment runbook with rollback procedure
- [ ] Add automated smoke tests to CI/CD pipeline
## Lessons for the Team
- Health checks are mandatory for all services
- "Pods running" doesn't mean "service working"
- Always test rollback procedure
8. Deployment Readiness Checklist
Before Every Production Deployment:
## Pre-Deployment Checklist
### Code & Configuration
- [ ] Code reviewed and approved
- [ ] All tests passing (unit, integration, e2e)
- [ ] Configuration validated in staging
- [ ] Secrets verified to exist in production
- [ ] Database migrations tested (if applicable)
### Health & Monitoring
- [ ] Health check endpoints implemented and tested
- [ ] Metrics and logging configured
- [ ] Alerts configured for new version
- [ ] Dashboard updated for monitoring deployment
### Deployment Strategy
- [ ] Deployment strategy chosen (rolling/blue-green/canary)
- [ ] Rollback procedure documented and tested
- [ ] Resource limits appropriate for expected load
- [ ] Deployment during low-traffic window (if possible)
### Communication
- [ ] Team notified of deployment
- [ ] Customer support aware (if customer-facing change)
- [ ] Incident response team on standby
- [ ] Post-deployment validation plan ready
### Validation
- [ ] Smoke tests ready to run post-deployment
- [ ] Monitoring in place to detect issues
- [ ] Success criteria defined
- [ ] Rollback triggers identified
Reflection Questions
Take a moment to think about how these lessons apply to your own environment:
-
Health Checks in Your Services
- Do all your production services have liveness and readiness probes configured?
- What do your health check endpoints actually verify?
- Have you tested what happens when health checks fail?
-
Your Last Deployment
- What was your deployment strategy? (Recreate, rolling, blue-green, canary?)
- How did you verify the deployment was successful?
- How long would it take you to rollback right now?
-
Configuration Management
- How do you manage environment-specific configuration?
- How confident are you that staging matches production?
- Where are your secrets stored, and who has access?
-
Incident Response
- Does your team have a documented incident response process?
- Who is responsible for production deployments?
- How do you communicate during incidents?
-
Learning from Incidents
- When was your last production incident?
- Did you write a blameless post-mortem?
- What systemic improvements came from it?
-
Your Deployment Confidence
- On a scale of 1-10, how confident are you when deploying to production?
- What would increase that confidence?
- What keeps you up at night about your deployments?
What's Next?
Sarah learned crucial lessons from her first incident:
- The difference between "running" and "working"
- The importance of health checks
- How to rollback quickly
- The value of blameless post-mortems
But this incident also revealed gaps in TechFlow's infrastructure:
- Logs were hard to find during the incident (Chapter 2)
- Environment parity between staging and production was questionable (Chapter 3)
- Resource limits weren't configured, which could cause other issues (Chapter 4)
- Deployments took a long time and could be optimized (Chapter 5)
In the next chapter, we'll follow Sarah as she faces another common challenge: the mystery of the disappearing logs. When debugging a production issue, she'll discover that the logs she needs aren't where she expects them to beβand sometimes aren't being collected at all.
Code Examples
All the code examples from this chapter are available in the GitHub repository:
# Clone the repository
git clone https://github.com/BahaTanvir/devops-guide-book.git
cd devops-guide-book/examples/chapter-01
# Or if you already have the repo
cd examples/chapter-01
See the Chapter 1 Examples README for detailed instructions on running these examples in your own environment.
Try it yourself:
- Deploy the broken version and observe the issue
- Practice rolling back
- Deploy the fixed version with health checks
- Experiment with different deployment strategies
- Intentionally break health checks to see Kubernetes' response
Remember: The best way to learn is by doingβin a safe, non-production environment! π
Chapter 2: The Mystery of the Disappearing Logs
"You can't debug what you can't see."
Sarah's Challenge
It was Monday morning, two weeks after the incident with the checkout service. Sarah had just settled into her desk with her coffee when a message popped up in the #platform-team channel:
@sarah Can you help debug an issue?
Users reporting intermittent 500 errors on the API
Started about 30 minutes ago
Sarah felt more confident this time. She had learned from the last incident. First step: check the logs.
She opened her terminal and typed the command she'd used dozens of times:
kubectl logs deployment/api-service -n production
The output scrolled pastβsuccessful requests, database queries, normal operations. Everything looked fine. But users were reporting errors. She tried filtering for errors:
kubectl logs deployment/api-service -n production | grep -i error
A few errors appeared, but they were oldβfrom hours ago, not the recent 30 minutes. Sarah frowned. Where were the recent error logs?
She tried checking individual pods:
kubectl get pods -n production -l app=api-service
Three pods were running. She checked the first one:
kubectl logs api-service-7d8f4c5b9d-abc123 -n production
The logs stopped 15 minutes ago. The pod was still running, but no new logs appeared. She checked the second podβsame thing. The third pod showed recent logs, but only from the last 5 minutes.
"Where are the logs from the past 30 minutes?" Sarah muttered to herself.
James walked by and noticed her confusion. "Lost logs?"
"Yeah," Sarah said, frustration creeping into her voice. "Users are reporting errors, but I can't find the logs. Some pods have logs that just... stop. And I can't see anything from when the errors actually started."
"Ah, the disappearing logs mystery," James said with a knowing smile. "Let me show you what's happening and how we fix this."
Understanding the Problem
Sarah's situation revealed several fundamental issues with logging in Kubernetes and distributed systems:
1. Ephemeral Logs in Kubernetes
By default, kubectl logs only shows logs from the current container. Here's what Sarah didn't understand:
Container Logs Are Ephemeral:
- Logs are stored on the node's disk
- When a pod restarts, previous logs are gone
- When a node dies, all logs on that node are lost
kubectl logsonly shows stdout/stderr from the running container
Pod Lifecycle and Logs:
Pod Created β Logs Start β Pod Deleted β Logs Lost
β
Container Restart β Previous Logs Gone
Sarah's pods had likely restarted due to the errors, and she lost the critical logs from the incident.
2. The kubectl Logs Limitations
The kubectl logs command has several limitations:
Time Window:
kubectl logs pod-name # Only current container
kubectl logs pod-name --previous # Previous container (if it crashed)
kubectl logs pod-name --since=1h # Last hour only
kubectl logs pod-name --tail=100 # Last 100 lines
Multi-Pod Confusion: When you have multiple pods:
kubectl logs deployment/nameshows logs from a random pod- No aggregation across pods
- No way to correlate logs from different pods
- Can't see logs from deleted pods
Storage Limits:
- Logs are rotated on the node
- Default: 10MB per container
- Older logs get deleted automatically
- No long-term retention
3. The Missing Context Problem
Even when Sarah found logs, they lacked context:
2024-01-22 10:15:23 ERROR: Database connection failed
Questions this log doesn't answer:
- Which user experienced this error?
- What request triggered it?
- Which pod/container logged this?
- How many times did this happen?
- What was the request ID?
- What else was happening at the same time?
4. Distributed System Challenges
TechFlow's microservices architecture made debugging harder:
User Request β API Gateway β Auth Service β API Service β Database
β
Cache Service
A single user request touches multiple services. Without correlation:
- Can't trace a request across services
- Can't see the full picture
- Can't identify which service actually failed
- Blame game begins ("It's not my service!")
5. The Three States of Logs
James explained that logs exist in three states:
State 1: In Memory (Application)
- Application generates logs
- Buffered in memory
- Problem: Lost if application crashes before flush
State 2: On Disk (Node)
- Written to node filesystem
- Available via
kubectl logs - Problem: Lost when pod/node dies
State 3: Centralized (Log Aggregation)
- Shipped to external system
- Persistent and searchable
- Problem: TechFlow didn't have this!
Sarah was only looking at State 2 logs, which were ephemeral and incomplete.
The Senior's Perspective
James walked Sarah through his approach to logging in production systems.
The Logging Mental Model
"When I debug production issues," James explained, "I think about logging in layers:
Layer 1: Structured Logging
- Logs should be machine-readable
- Include context: request ID, user ID, service name
- Use consistent format across all services
Layer 2: Centralized Collection
- All logs go to one place
- Survive pod/node failures
- Searchable and indexed
Layer 3: Correlation
- Connect logs across services
- Track request flow end-to-end
- Identify patterns and anomalies
Layer 4: Retention and Cost
- Keep what's useful
- Archive what's required
- Delete what's expensive
Without Layer 2, you're debugging blind."
Questions Senior Engineers Ask About Logs
James shared his logging checklist:
-
"Where are the logs?"
- Application stdout/stderr (good start)
- But also: error logs, access logs, audit logs
- Centralized system? (should be yes)
-
"How long are logs kept?"
- Real-time logs: hours
- Historical logs: days/weeks/months
- Compliance logs: years
- Cost vs. value trade-off
-
"Can I correlate logs?"
- Request ID in every log?
- Trace ID across services?
- Timestamp synchronization?
-
"What am I logging?"
- Too much: expensive, noisy
- Too little: can't debug
- Just right: actionable information
-
"Who needs access?"
- Developers for debugging
- SRE for incidents
- Security for audits
- Compliance for regulations
The Logging Stack Decision Framework
James explained TechFlow's options:
Option 1: ELK Stack (Elasticsearch, Logstash, Kibana)
- Pros: Powerful search, flexible, self-hosted
- Cons: Operationally complex, resource-heavy, expensive at scale
- Best for: Teams with ops resources, on-prem requirements
Option 2: EFK Stack (Elasticsearch, Fluentd, Kibana)
- Pros: Similar to ELK, Fluentd is lighter and more flexible
- Cons: Still complex to operate
- Best for: Kubernetes-native environments
Option 3: Loki + Grafana
- Pros: Cost-effective, integrates with metrics, simpler than ELK
- Cons: Less powerful search than Elasticsearch
- Best for: Most Kubernetes environments, budget-conscious teams
Option 4: Cloud Providers (CloudWatch, Cloud Logging, etc.)
- Pros: Managed, integrated, easy to set up
- Cons: Vendor lock-in, can get expensive, limited features
- Best for: Teams already on that cloud, wanting simplicity
Option 5: Third-Party SaaS (Datadog, Splunk, etc.)
- Pros: Feature-rich, no ops burden, great UI
- Cons: Expensive at scale, data leaves your network
- Best for: Teams prioritizing features over cost
"For TechFlow," James said, "we'll use Loki + Grafana. It's cost-effective, Kubernetes-native, and you already know Grafana from our metrics dashboards."
The Solution
James and Sarah set up a centralized logging system for TechFlow.
Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Kubernetes Cluster β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Pod β β Pod β β Pod β β
β β (stdout) β β (stdout) β β (stdout) β β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β
β β β β β
β βββββββββββββββ΄ββββββββββββββ β
β β β
β ββββββββΌβββββββ β
β β Promtail β (DaemonSet on each node) β
β β(Log Shipper)β β
β ββββββββ¬βββββββ β
β β β
βββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββ
β Loki β (Log aggregation)
β (Storage) β
βββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββ
β Grafana β (Visualization & Search)
β (Dashboard) β
ββββββββββββββββββ
Step 1: Improve Application Logging
First, James showed Sarah how to improve the application logs themselves.
Before (Bad Logging):
# api-service/app.py
@app.route('/api/users/<user_id>')
def get_user(user_id):
try:
user = db.get_user(user_id)
return jsonify(user)
except Exception as e:
print(f"Error: {e}")
return {"error": "Internal server error"}, 500
Problems:
- Generic error message
- No context
- No request ID
- No severity level
- Not structured
After (Good Logging):
# api-service/app.py
import logging
import json
from datetime import datetime
from flask import g, request
import time
import uuid
# Configure structured logging
logging.basicConfig(
level=logging.INFO,
format='%(message)s'
)
logger = logging.getLogger(__name__)
def log_json(level, message, **kwargs):
"""Helper to log structured JSON"""
log_entry = {
'timestamp': datetime.utcnow().isoformat(),
'level': level,
'message': message,
'service': 'api-service',
'request_id': g.get('request_id', 'unknown'),
**kwargs
}
logger.log(getattr(logging, level), json.dumps(log_entry))
@app.before_request
def before_request():
"""Generate request ID for correlation"""
g.request_id = request.headers.get('X-Request-ID', str(uuid.uuid4()))
g.start_time = time.time()
log_json('INFO', 'Request started',
method=request.method,
path=request.path,
user_agent=request.headers.get('User-Agent'))
@app.route('/api/users/<user_id>')
def get_user(user_id):
try:
log_json('INFO', 'Fetching user', user_id=user_id)
user = db.get_user(user_id)
log_json('INFO', 'User fetched successfully', user_id=user_id)
return jsonify(user)
except DatabaseConnectionError as e:
log_json('ERROR', 'Database connection failed',
user_id=user_id,
error=str(e),
error_type='DatabaseConnectionError')
return {"error": "Service temporarily unavailable"}, 503
except UserNotFoundError:
log_json('WARN', 'User not found', user_id=user_id)
return {"error": "User not found"}, 404
except Exception as e:
log_json('ERROR', 'Unexpected error',
user_id=user_id,
error=str(e),
error_type=type(e).__name__,
traceback=traceback.format_exc())
return {"error": "Internal server error"}, 500
@app.after_request
def after_request(response):
"""Log response"""
duration_ms = (time.time() - getattr(g, 'start_time', time.time())) * 1000
log_json('INFO', 'Request completed',
status_code=response.status_code,
response_time_ms=duration_ms)
return response
Benefits:
- Structured JSON logs
- Request ID for correlation
- Different severity levels
- Rich context
- Traceable across services
Step 2: Deploy Loki (Deep Dive)
James created the Loki deployment configuration. This section shows a complete example that you can use as a reference, not a dropβin production manifest. Loki's recommended configuration (especially around log paths and retention) evolves over time, so for production you should always consult the official Loki documentation for your version, storage backend, and retention requirements.
loki-config.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: loki-config
namespace: logging
data:
loki.yaml: |
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
chunk_idle_period: 5m
chunk_retain_period: 30s
schema_config:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/boltdb-shipper-active
cache_location: /loki/boltdb-shipper-cache
shared_store: filesystem
filesystem:
directory: /loki/chunks
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h # 7 days
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
chunk_store_config:
max_look_back_period: 720h # 30 days
table_manager:
retention_deletes_enabled: true
retention_period: 720h # 30 days
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: loki
namespace: logging
spec:
serviceName: loki
replicas: 1
selector:
matchLabels:
app: loki
template:
metadata:
labels:
app: loki
spec:
containers:
- name: loki
image: grafana/loki:2.9.0
ports:
- containerPort: 3100
name: http
volumeMounts:
- name: config
mountPath: /etc/loki
- name: storage
mountPath: /loki
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
volumes:
- name: config
configMap:
name: loki-config
volumeClaimTemplates:
- metadata:
name: storage
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
name: loki
namespace: logging
spec:
type: ClusterIP
ports:
- port: 3100
targetPort: 3100
name: http
selector:
app: loki
Step 3: Deploy Promtail (Log Shipper)
Promtail runs on every node and ships logs to Loki. The example below focuses on the overall structure; consult the Loki/Promtail documentation for the exact __path__ relabeling needed for your container runtime and log file locations:
promtail-daemonset.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: promtail-config
namespace: logging
data:
promtail.yaml: |
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
# Scrape all pod logs
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Add namespace label
- source_labels: [__meta_kubernetes_pod_namespace]
target_label: namespace
# Add pod name label
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
# Add container name label
- source_labels: [__meta_kubernetes_pod_container_name]
target_label: container
# Add app label
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
# Drop logs from logging namespace (avoid recursion)
- source_labels: [__meta_kubernetes_pod_namespace]
regex: logging
action: drop
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: promtail
namespace: logging
spec:
selector:
matchLabels:
app: promtail
template:
metadata:
labels:
app: promtail
spec:
serviceAccountName: promtail
containers:
- name: promtail
image: grafana/promtail:2.9.0
args:
- -config.file=/etc/promtail/promtail.yaml
volumeMounts:
- name: config
mountPath: /etc/promtail
- name: varlog
mountPath: /var/log
readOnly: true
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"
volumes:
- name: config
configMap:
name: promtail-config
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: promtail
namespace: logging
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: promtail
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: promtail
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: promtail
subjects:
- kind: ServiceAccount
name: promtail
namespace: logging
Step 4: Configure Grafana
Add Loki as a data source in Grafana:
grafana-datasource.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: logging
data:
datasources.yaml: |
apiVersion: 1
datasources:
- name: Loki
type: loki
access: proxy
url: http://loki:3100
isDefault: true
editable: true
Step 5: Deploy Everything
# Create logging namespace
kubectl create namespace logging
# Deploy Loki
kubectl apply -f loki-config.yaml
# Deploy Promtail
kubectl apply -f promtail-daemonset.yaml
# Wait for Loki to be ready
kubectl wait --for=condition=ready pod -l app=loki -n logging --timeout=300s
# Verify Promtail is running on all nodes
kubectl get pods -n logging -l app=promtail -o wide
Step 6: Searching Logs in Grafana
Now Sarah could search logs effectively:
Query Examples:
- Find all errors in the last hour:
{namespace="production"} |= "ERROR" | json
- Track a specific request:
{namespace="production"} | json | request_id="abc-123-def"
- Find database connection errors:
{app="api-service"} |= "DatabaseConnectionError" | json
- See error rate over time:
sum(rate({namespace="production"} |= "ERROR"[5m])) by (app)
- Find slow requests (> 1 second):
{namespace="production"} | json | response_time_ms > 1000
Step 7: Log Retention and Cost Management
James explained the cost considerations:
Retention Policy:
# In loki-config.yaml
table_manager:
retention_deletes_enabled: true
retention_period: 720h # 30 days for production
Different retention for different namespaces:
# Hot logs (7 days, fast access): Production errors and warnings
# Warm logs (30 days, slower access): Production info logs
# Cold logs (90 days, archive): Audit logs
# Deleted (>90 days): Debug logs
Cost Optimization Tips:
- Don't log everything - Be selective
- Use appropriate log levels - Debug only in dev
- Sample high-volume logs - Log 1% of successful requests
- Compress old logs - Move to cheaper storage
- Delete what you don't need - Debug logs after 7 days
Lessons Learned
Sarah documented the key lessons from setting up centralized logging:
1. Ephemeral Logs Are Not Enough
The Lesson:
kubectl logs is useful for quick checks, but not for debugging production issues.
How to Apply:
- Always use centralized logging in production
- Keep logs beyond pod lifecycle
- Make logs searchable and correlatable
Red Flags:
- No centralized logging system
- Relying on
kubectl logsfor debugging - Logs disappear when pods restart
2. Structure Your Logs
The Lesson: Unstructured logs are hard to search and analyze. JSON-structured logs enable powerful queries.
Good Structured Log:
{
"timestamp": "2024-01-22T10:15:23Z",
"level": "ERROR",
"message": "Database connection failed",
"service": "api-service",
"request_id": "req-123-abc",
"user_id": "user-456",
"error_type": "DatabaseConnectionError",
"retry_attempt": 2
}
Benefits:
- Easy to parse programmatically
- Can filter by any field
- Aggregate and analyze
- Create metrics from logs
3. Correlation Is Key
The Lesson: In microservices, a single request touches multiple services. Correlation IDs tie logs together.
Implementation:
# Generate request ID at entry point (API Gateway)
request_id = str(uuid.uuid4())
# Pass in headers to downstream services
headers = {'X-Request-ID': request_id}
# Log with request ID in every service
logger.info("Processing request", extra={'request_id': request_id})
Benefits:
- Trace full request flow
- Identify bottlenecks
- Debug distributed issues
- Create dependency maps
4. Log Levels Matter
The Lesson: Use appropriate log levels to control noise and cost.
Log Level Guidelines:
- DEBUG: Detailed information for diagnosing problems (dev only)
- INFO: General informational messages (key operations)
- WARN: Warning messages (potential issues)
- ERROR: Error messages (failures that don't crash the app)
- FATAL: Critical failures (application crash)
In Production:
# Production: INFO and above
logging.basicConfig(level=logging.INFO)
# Development: DEBUG and above
logging.basicConfig(level=logging.DEBUG)
5. Balance Cost and Value
The Lesson: Logs are expensive. Log what's useful, not everything.
Cost Factors:
- Storage: Volume of logs Γ retention period
- Ingestion: Cost per GB ingested
- Search: Query costs
- Network: Data transfer costs
Optimization Strategies:
# Sample successful requests (log 1%)
if response.status_code == 200:
if random.random() < 0.01: # 1% sampling
log_request(request, response)
else:
# Always log errors
log_request(request, response)
6. Retention Policies Are Essential
The Lesson: Different logs have different value over time. Implement tiered retention.
Retention Strategy:
Hot Tier (1-7 days): All logs, fast search
Warm Tier (8-30 days): Errors and warnings only
Cold Tier (31-90 days): Audit logs, compressed
Archive (91-365 days): Compliance requirements only
Deleted (>365 days): Unless legally required
7. Security and Compliance
The Lesson: Logs contain sensitive data. Handle them carefully.
Best Practices:
# DON'T log sensitive data
logger.info(f"User logged in: {username} with password {password}") # BAD!
# DO sanitize logs
logger.info(f"User logged in", extra={
'user_id': user.id,
'ip_address': request.ip,
# Password never logged
})
# Redact sensitive fields
def sanitize_log(data):
sensitive_fields = ['password', 'ssn', 'credit_card']
return {k: '***REDACTED***' if k in sensitive_fields else v
for k, v in data.items()}
Compliance Considerations:
- GDPR: Personal data retention and deletion
- HIPAA: Healthcare data security
- PCI DSS: Credit card data protection
- SOX: Financial record retention
8. Alerting on Logs
The Lesson: Logs aren't just for debuggingβthey can trigger alerts.
Alert Examples:
# Alert on high error rate
sum(rate({namespace="production"} |= "ERROR"[5m])) by (app) > 10
# Alert on specific errors
count_over_time({app="api-service"} |= "DatabaseConnectionError"[5m]) > 5
# Alert on no logs (service might be down)
sum(count_over_time({app="api-service"}[5m])) == 0
Reflection Questions
Consider how logging applies to your environment:
-
Your Current Logging:
- How do you access logs in your production environment?
- Do logs survive pod/container restarts?
- How long are logs retained?
-
Log Structure:
- Are your logs structured (JSON) or unstructured (plain text)?
- Do you use consistent log levels across services?
- Can you easily search and filter logs?
-
Correlation:
- Do you use request IDs or trace IDs?
- Can you follow a request across multiple services?
- How do you debug distributed system issues?
-
Cost and Retention:
- What's your monthly logging cost?
- Do you have a retention policy?
- Are you logging too much or too little?
-
Security:
- Do you log sensitive data?
- Who has access to production logs?
- Do logs meet compliance requirements?
-
Observability:
- Do you create alerts from logs?
- Can you create metrics from log patterns?
- How quickly can you find root cause of issues?
What's Next?
Sarah now had centralized logging in place. She could:
- Search logs across all pods and services
- Correlate requests with trace IDs
- Debug issues even after pods restart
- Create alerts based on log patterns
But she quickly discovered another challenge: the logs looked perfect in her local environment and staging, but production behaved differently. Environment-specific configurations were causing issues again.
In Chapter 3, "It Works on My Machine," Sarah will learn about environment parity and configuration managementβensuring that what works locally actually works in production.
Code Examples
All the code examples from this chapter are available in the GitHub repository:
# Clone the repository
git clone https://github.com/BahaTanvir/devops-guide-book.git
cd devops-guide-book/examples/chapter-02
# Or if you already have the repo
cd examples/chapter-02
See the Chapter 2 Examples README for detailed instructions on:
- Deploying Loki and Promtail
- Configuring structured logging in your applications
- Creating useful log queries
- Setting up log-based alerts
Try it yourself:
- Deploy the logging stack in your cluster
- Update your application to use structured logging
- Practice writing LogQL queries
- Set up alerts based on log patterns
- Experiment with retention policies
Remember: Good logging is the foundation of observability! π
Chapter 3: "It Works on My Machine"
"Environment parity isn't optionalβit's fundamental."
Sarah's Challenge
Three weeks had passed since Sarah set up the centralized logging system. The team was now able to debug issues much faster with Loki and structured logs. Sarah felt more confidentβuntil Friday afternoon.
Marcus, the engineering manager, stopped by Sarah's desk. "Hey Sarah, we need to deploy the new notification service to production. It's been tested in staging and looks good. Can you handle the deployment?"
"Sure!" Sarah said confidently. She had deployed several services now and felt comfortable with the process.
She pulled up the deployment manifest and reviewed it:
apiVersion: apps/v1
kind: Deployment
metadata:
name: notification-service
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: notification-service
template:
metadata:
labels:
app: notification-service
spec:
containers:
- name: notification
image: techflow/notification-service:v1.2.0
ports:
- containerPort: 8080
env:
- name: PORT
value: "8080"
- name: REDIS_URL
value: "redis://redis:6379"
Everything looked standard. The same configuration had worked perfectly in staging. Sarah deployed to production:
kubectl apply -f notification-service.yaml -n production
The deployment completed successfully. Pods were running. Health checks passed. Sarah marked the task as done in Jira and went home for the weekend feeling accomplished.
Monday morning, she arrived to an urgent message:
@sarah Notification service is broken in production
- Emails not being sent
- Push notifications failing
- No errors in logs
- Staging still works fine!
Sarah's heart sank. How could this be? It worked perfectly in staging! She quickly checked the production logs:
kubectl logs deployment/notification-service -n production | grep -i error
No errors. The service was running, responding to health checks, but simply not sending notifications. She checked staging:
kubectl logs deployment/notification-service -n staging | grep -i notification
{"level":"INFO","message":"Email sent successfully","recipient":"user@example.com"}
{"level":"INFO","message":"Push notification delivered","device_id":"abc123"}
Staging was working perfectly. Production was running but doing nothing.
James walked over. "The classic 'works on my machine' problem. Or in this case, 'works in staging.' Let's figure out what's different."
Understanding the Problem
Sarah's situation is one of the most common and frustrating issues in software deployment: environment drift. The code is identical, the deployment manifests look the same, but the behavior is completely different.
1. The Environment Parity Problem
Environment parity means keeping development, staging, and production environments as similar as possible. When environments drift, you get unpredictable behavior.
Three Types of Parity:
Dev/Prod Parity (The Twelve-Factor App):
- Time: Reduce time between writing code and deploying
- Personnel: Developers who write code should deploy it
- Tools: Keep development and production tools as similar as possible
Common Drift Scenarios:
Local β Staging β Production
SQLite PostgreSQL PostgreSQL (different version)
ENV vars ConfigMap Secrets
Mock APIs Real APIs Real APIs (different endpoints)
Single node 3 nodes 10 nodes
2. Configuration Drift
Configuration is the #1 source of environment differences. Sarah's notification service had different configurations in staging vs production that she didn't realize:
Staging Configuration (working):
env:
- name: REDIS_URL
value: "redis://redis:6379"
- name: SMTP_HOST
value: "mailhog:1025" # Test mail server
- name: SMTP_USER
value: "test"
- name: SMTP_PASS
value: "test"
- name: PUSH_API_KEY
value: "test-key-12345"
Production Configuration (Sarah's deployment - broken):
env:
- name: REDIS_URL
value: "redis://redis:6379"
# Missing: SMTP_HOST, SMTP_USER, SMTP_PASS
# Missing: PUSH_API_KEY
The service didn't crash because it had default behavior: if configuration is missing, silently fail and log nothing. This is poor application design, but a common reality.
3. The Configuration Management Problem
TechFlow was managing configuration in multiple ways:
Method 1: Hardcoded in Deployment (Bad)
env:
- name: PORT
value: "8080" # Hardcoded
Method 2: Direct values (Better, still not great)
env:
- name: REDIS_URL
value: "redis://redis:6379" # Different per environment
Method 3: ConfigMaps (Better)
env:
- name: REDIS_URL
valueFrom:
configMapKeyRef:
name: notification-config
key: redis-url
Method 4: Secrets (Best for sensitive data)
env:
- name: SMTP_PASS
valueFrom:
secretKeyRef:
name: notification-secrets
key: smtp-password
The Problem: Different approaches in different environments made it hard to track what was configured where.
4. The Secrets Problem
Secrets are particularly tricky:
- Can't be checked into Git (security risk)
- Different in every environment
- Easy to forget during deployment
- Hard to verify without exposing values
Sarah's staging environment had secrets configured months ago by another engineer. Production was missing them, and she had no way to know.
5. Dependencies and Service Discovery
Services depend on other services. These dependencies can differ between environments:
Notification Service depends on:
- Redis (cache)
- SMTP Server (email)
- Push Notification API (mobile notifications)
- User Service (to get user preferences)
Staging:
- Redis:
redis.staging.svc.cluster.local:6379 - SMTP:
mailhog.staging.svc.cluster.local:1025(test server) - Push API: Test API with mock responses
- User Service: Staging version with test data
Production:
- Redis:
redis.production.svc.cluster.local:6379 - SMTP:
smtp.sendgrid.net:587(real email service) - Push API: Production API requiring real credentials
- User Service: Production version with real user data
If any of these URLs or credentials are wrong, the service fails silently.
6. The Twelve-Factor App Methodology
The Twelve-Factor App is a methodology for building modern applications. Factor III is particularly relevant:
III. Config - Store config in the environment
An app's config is everything that is likely to vary between deploys (staging, production, developer environments, etc).
Strict separation of config from code:
- Config varies across deploys
- Code does not
- Config includes: database URLs, credentials, service endpoints
- Config should never be checked into version control
The Senior's Perspective
James explained his approach to environment configuration.
Configuration Mental Model
"Think of configuration in layers," James said, drawing on the whiteboard:
Layer 1: Application Defaults (in code)
β (overridden by)
Layer 2: Environment Variables
β (overridden by)
Layer 3: ConfigMaps/Files
β (overridden by)
Layer 4: Secrets
β (overridden by)
Layer 5: Command-line flags (if needed)
"Each layer should override the previous. And critically: never, ever hardcode environment-specific values in your application code or deployment manifests."
Questions Senior Engineers Ask About Configuration
-
"What varies between environments?"
- Database URLs
- API endpoints
- API keys and secrets
- Feature flags
- Resource limits
- Replica counts
- Log levels
-
"How do I verify all config is present?"
- Use admission webhooks
- Application startup validation
- Pre-deployment checks
- Config validation tools
-
"How do I prevent config drift?"
- Use GitOps (config in Git)
- Infrastructure as Code (Terraform, Helm)
- Configuration templates
- Environment promotion pipeline
-
"How do I manage secrets safely?"
- External secret managers (Vault, AWS Secrets Manager)
- Encrypted secrets in Git (Sealed Secrets, SOPS)
- Rotation policies
- Least-privilege access
-
"How do I test configuration?"
- Dry-run deployments
- Integration tests per environment
- Smoke tests post-deployment
- Configuration validation tools
Configuration Management Approaches
James explained TechFlow's options:
Option 1: Environment-Specific Manifests
deployments/
βββ notification-service-dev.yaml
βββ notification-service-staging.yaml
βββ notification-service-production.yaml
- Pros: Simple, explicit
- Cons: Duplication, drift risk, maintenance burden
Option 2: Kustomize (Overlays)
notification-service/
βββ base/
β βββ deployment.yaml
β βββ kustomization.yaml
βββ overlays/
βββ staging/
β βββ kustomization.yaml
βββ production/
βββ kustomization.yaml
- Pros: DRY, built into kubectl, simple
- Cons: Limited templating, learning curve
Option 3: Helm (Charts)
notification-service/
βββ Chart.yaml
βββ values.yaml
βββ values-staging.yaml
βββ values-production.yaml
βββ templates/
βββ deployment.yaml
βββ service.yaml
- Pros: Powerful templating, package management
- Cons: Complex, can be overused, "Helm hell"
Option 4: External Configuration (Recommended for TechFlow)
Combine:
- Helm for templating
- External Secrets Operator for secrets
- GitOps (ArgoCD/Flux) for deployment
"For TechFlow," James said, "we'll use Kustomize. It's simple, built into kubectl, and solves 80% of our needs without the complexity of Helm."
The Solution
James and Sarah implemented a proper configuration management system.
Step 1: Audit Current Configuration
First, they documented what actually varied between environments:
# Configuration Audit
## Notification Service Configuration
### Varies by Environment:
- SMTP credentials (username, password, host, port)
- Push notification API key
- Redis URL
- User service endpoint
- Log level
- Replica count
### Same Across Environments:
- Port (8080)
- Health check paths
- Base image
- Resource requests (tuned per environment later)
### Missing in Production:
- SMTP_HOST β
- SMTP_PORT β
- SMTP_USER β
- SMTP_PASS β
- PUSH_API_KEY β
Step 2: Create Base Configuration with Kustomize
We'll start with a minimal but realistic base that is shared across environments, then layer environmentβspecific differences on top.
Tip If you're new to Kustomize, don't worry about memorizing every detail. Focus on the idea that you define a base once and then apply small patches per environment.
Directory Structure (Conceptual):
notification-service/
βββ base/
β βββ deployment.yaml
β βββ service.yaml
β βββ configmap.yaml
β βββ kustomization.yaml
βββ overlays/
βββ staging/
β βββ kustomization.yaml
β βββ configmap-patch.yaml
β βββ secrets.yaml
βββ production/
βββ kustomization.yaml
βββ configmap-patch.yaml
βββ resources-patch.yaml
base/deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: notification-service
spec:
replicas: 2 # Will be overridden per environment
selector:
matchLabels:
app: notification-service
template:
metadata:
labels:
app: notification-service
spec:
containers:
- name: notification
image: techflow/notification-service:v1.2.0
ports:
- containerPort: 8080
name: http
env:
# Non-sensitive config from ConfigMap
- name: PORT
valueFrom:
configMapKeyRef:
name: notification-config
key: port
- name: REDIS_URL
valueFrom:
configMapKeyRef:
name: notification-config
key: redis-url
- name: USER_SERVICE_URL
valueFrom:
configMapKeyRef:
name: notification-config
key: user-service-url
- name: LOG_LEVEL
valueFrom:
configMapKeyRef:
name: notification-config
key: log-level
# Sensitive config from Secrets
- name: SMTP_HOST
valueFrom:
secretKeyRef:
name: notification-secrets
key: smtp-host
- name: SMTP_PORT
valueFrom:
secretKeyRef:
name: notification-secrets
key: smtp-port
- name: SMTP_USER
valueFrom:
secretKeyRef:
name: notification-secrets
key: smtp-user
- name: SMTP_PASS
valueFrom:
secretKeyRef:
name: notification-secrets
key: smtp-password
- name: PUSH_API_KEY
valueFrom:
secretKeyRef:
name: notification-secrets
key: push-api-key
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
base/configmap.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: notification-config
data:
port: "8080"
# These will be overridden by environment-specific values
redis-url: "OVERRIDE"
user-service-url: "OVERRIDE"
log-level: "INFO"
base/kustomization.yaml:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
- configmap.yaml
commonLabels:
app: notification-service
Step 3: Create Staging Overlay
overlays/staging/configmap-patch.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: notification-config
# Kustomize will merge this with the base ConfigMap
# based on name+namespace
data:
port: "8080"
redis-url: "redis://redis.staging.svc.cluster.local:6379"
user-service-url: "http://user-service.staging.svc.cluster.local"
log-level: "DEBUG" # More verbose in staging
overlays/staging/secrets.yaml:
apiVersion: v1
kind: Secret
metadata:
name: notification-secrets
# In real systems you would not commit real secret values; this is for illustration.
type: Opaque
stringData:
smtp-host: "mailhog.staging.svc.cluster.local"
smtp-port: "1025"
smtp-user: "test"
smtp-password: "test"
push-api-key: "test-key-staging-12345"
overlays/staging/kustomization.yaml:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: staging
resources:
- ../../base
- secrets.yaml
# Patch the base ConfigMap with stagingβspecific values
patchesStrategicMerge:
- configmap-patch.yaml
# Environment-specific Secret manifest
# (in real systems you wouldn't commit real secret values)
# Override replica count for staging
replicas:
- name: notification-service
count: 2
# Pin the image tag for this environment
images:
- name: techflow/notification-service
newTag: v1.2.0
Step 4: Create Production Overlay
overlays/production/configmap-patch.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: notification-config
data:
port: "8080"
redis-url: "redis://redis.production.svc.cluster.local:6379"
user-service-url: "http://user-service.production.svc.cluster.local"
log-level: "INFO" # Less verbose in production
overlays/production/secrets.yaml:
apiVersion: v1
kind: Secret
metadata:
name: notification-secrets
# Do not commit real production secrets to Git. Use this only in a demo environment,
# and prefer tools like External Secrets Operator, Sealed Secrets, or Vault in practice.
type: Opaque
stringData:
smtp-host: "smtp.sendgrid.net"
smtp-port: "587"
smtp-user: "apikey"
smtp-password: "SG.REAL_API_KEY_HERE" # Placeholder for real credentials
push-api-key: "prod-push-api-key-real-12345"
overlays/production/resources-patch.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: notification-service
spec:
template:
spec:
containers:
- name: notification
resources:
requests:
memory: "512Mi" # More resources in production
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
overlays/production/kustomization.yaml:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: production
resources:
- ../../base
- secrets.yaml
patchesStrategicMerge:
- configmap-patch.yaml
- resources-patch.yaml
# Environment-specific Secret manifest
replicas:
- name: notification-service
count: 5 # More replicas in production
images:
- name: techflow/notification-service
newTag: v1.2.0
Step 5: Deploy with Kustomize
Deep Dive: Validating Kustomize Output Before applying to a real cluster, always inspect the rendered manifests. This catches mistakes in patches and generators early.
To Staging:
# Preview what will be deployed
kubectl kustomize overlays/staging
# Apply to staging
kubectl apply -k overlays/staging
# Verify
kubectl get pods -n staging -l app=notification-service
kubectl logs -n staging -l app=notification-service | grep -i "Configuration loaded"
To Production:
# Preview
kubectl kustomize overlays/production
# Apply
kubectl apply -k overlays/production
# Verify
kubectl get pods -n production -l app=notification-service
kubectl logs -n production -l app=notification-service | tail -20
Step 6: Improve Application Configuration Validation
James also showed Sarah how to improve the application itself to fail fast when configuration is missing:
Before (Silent Failure):
# notification_service.py
smtp_host = os.getenv('SMTP_HOST', '') # Defaults to empty
smtp_user = os.getenv('SMTP_USER', '')
def send_email(to, subject, body):
if not smtp_host:
logger.warning("SMTP not configured, skipping email")
return # Silent failure
After (Fail Fast):
# notification_service.py
def validate_config():
"""Validate required configuration on startup"""
required_vars = {
'SMTP_HOST': os.getenv('SMTP_HOST'),
'SMTP_PORT': os.getenv('SMTP_PORT'),
'SMTP_USER': os.getenv('SMTP_USER'),
'SMTP_PASS': os.getenv('SMTP_PASS'),
'PUSH_API_KEY': os.getenv('PUSH_API_KEY'),
'REDIS_URL': os.getenv('REDIS_URL'),
}
missing = [k for k, v in required_vars.items() if not v]
if missing:
logger.error(f"Missing required configuration: {missing}")
sys.exit(1) # Fail fast!
logger.info("Configuration validated successfully")
logger.info(f"SMTP Host: {required_vars['SMTP_HOST']}") # Log (not password!)
logger.info(f"Redis URL: {required_vars['REDIS_URL']}")
# Call during application startup
if __name__ == '__main__':
validate_config()
app.run()
Now if configuration is missing, the pod won't even start, and readiness checks will fail. Much better than silent failure!
Step 7: Create Configuration Checklist
Sarah created a deployment checklist to prevent future issues:
# Deployment Checklist
## Pre-Deployment
- [ ] All required ConfigMaps exist in target environment
- [ ] All required Secrets exist in target environment
- [ ] ConfigMap/Secret values are correct for environment
- [ ] Application validates configuration on startup
- [ ] Dry-run deployment succeeds: `kubectl apply --dry-run=server -k overlays/<env>`
- [ ] Resource limits appropriate for environment
## Deployment
- [ ] Use Kustomize overlays: `kubectl apply -k overlays/<env>`
- [ ] Watch deployment: `kubectl rollout status deployment/<name> -n <namespace>`
- [ ] Check pod logs for configuration validation
- [ ] Verify all pods are Ready
## Post-Deployment
- [ ] Run smoke tests
- [ ] Check application logs for errors
- [ ] Verify integration with dependencies (Redis, SMTP, etc.)
- [ ] Monitor metrics for anomalies
- [ ] Test critical user flows
## Rollback Plan
- [ ] Previous version number: ___________
- [ ] Rollback command: `kubectl rollout undo deployment/<name> -n <namespace>`
- [ ] Verification steps: ___________
Lessons Learned
Sarah documented the key lessons about environment configuration:
1. "Works on My Machine" Is Always Configuration
The Lesson: When code works in one environment but not another, it's almost always configuration, not code.
Common Culprits:
- Missing environment variables
- Wrong service URLs
- Missing credentials
- Different dependency versions
- Resource constraints
- Network policies
How to Debug:
# Compare configurations
kubectl get configmap <name> -n staging -o yaml > staging-config.yaml
kubectl get configmap <name> -n production -o yaml > production-config.yaml
diff staging-config.yaml production-config.yaml
# Compare secrets (names only, not values)
kubectl get secrets -n staging
kubectl get secrets -n production
# Check environment variables in pod
kubectl exec -it <pod> -n <namespace> -- env | sort
2. Fail Fast on Missing Configuration
The Lesson: Applications should validate configuration on startup and fail immediately if something is wrong.
Implementation:
def validate_config():
required = ['DATABASE_URL', 'API_KEY', 'REDIS_URL']
missing = [var for var in required if not os.getenv(var)]
if missing:
print(f"ERROR: Missing required config: {missing}")
sys.exit(1)
# Run before starting the application
validate_config()
app.run()
Benefits:
- Pods won't become Ready if config is wrong
- Clear error messages
- Fast feedback
- Prevents silent failures
3. Use Configuration Management Tools
The Lesson: Don't manually manage environment-specific configuration. Use tools.
Tool Options:
Kustomize (Recommended for most):
# Simple, built into kubectl
kubectl apply -k overlays/production
Helm:
# Powerful templating
helm install myapp ./chart -f values-production.yaml
Terraform + Kubernetes Provider:
# Infrastructure as Code
resource "kubernetes_config_map" "app_config" {
# ...
}
4. Separate Config from Code
The Lesson: Configuration should never be hardcoded in application code or deployment manifests.
Bad (Hardcoded):
env:
- name: DATABASE_URL
value: "postgresql://prod-db:5432/myapp" # Hardcoded!
Good (ConfigMap):
env:
- name: DATABASE_URL
valueFrom:
configMapKeyRef:
name: app-config
key: database-url
Better (External Secrets):
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: app-secrets
key: database-url
5. Secrets Are Special
The Lesson: Secrets require special handlingβnever commit them to Git.
Secret Management Options:
Option 1: Manual Creation (Development only)
kubectl create secret generic app-secrets \
--from-literal=api-key=abc123 \
-n production
Option 2: Sealed Secrets (Encrypted in Git)
# Encrypt secret
kubeseal -f secret.yaml -w sealed-secret.yaml
# Commit sealed-secret.yaml to Git
# It decrypts automatically in cluster
Option 3: External Secrets Operator (Recommended)
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: app-secrets
spec:
secretStoreRef:
name: aws-secrets-manager
target:
name: app-secrets
data:
- secretKey: api-key
remoteRef:
key: prod/app/api-key
Option 4: HashiCorp Vault
# Inject secrets at runtime
annotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/role: "myapp"
vault.hashicorp.com/agent-inject-secret-config: "secret/data/myapp"
6. Environment Parity Reduces Risk
The Lesson: The more similar staging is to production, the fewer surprises you'll have.
Parity Checklist:
- Same Kubernetes version
- Same resource limits (scaled down is OK)
- Same configuration structure (ConfigMaps, Secrets)
- Same dependency versions (Redis, PostgreSQL, etc.)
- Same networking setup
- Same monitoring and logging
Acceptable Differences:
- Replica counts (fewer in staging)
- Resource amounts (less in staging)
- Data volume (smaller in staging)
- External service endpoints (test vs production)
7. Configuration as Code
The Lesson: Treat configuration like codeβversion controlled, reviewed, tested.
Best Practices:
β
Store configuration in Git
β
Require PR reviews for changes
β
Test configuration changes in staging first
β
Automate deployment with CI/CD
β
Use GitOps for deployment
β
Tag/version configuration changes
Git Structure:
infrastructure/
βββ applications/
β βββ notification-service/
β β βββ base/
β β βββ overlays/
β β βββ staging/
β β βββ production/
β βββ user-service/
βββ README.md
8. Document Environment Differences
The Lesson: Create a "source of truth" document listing all environment differences.
Example Documentation:
# Environment Configuration Matrix
| Component | Development | Staging | Production |
|-----------|------------|---------|------------|
| Database | SQLite | PostgreSQL 14 | PostgreSQL 14 |
| Redis | Local | redis:6379 | redis-cluster:6379 |
| Replicas | 1 | 2 | 5 |
| CPU Limit | 100m | 500m | 1000m |
| Memory Limit | 128Mi | 512Mi | 1Gi |
| Log Level | DEBUG | DEBUG | INFO |
| SMTP | Mailhog | Mailhog | SendGrid |
## Secrets Required
### Staging
- smtp-password (test value)
- push-api-key (test key)
### Production
- smtp-password (SendGrid API key)
- push-api-key (OneSignal production key)
- database-password (RDS password)
Reflection Questions
Think about configuration management in your environment:
-
Your Configuration Practice:
- How do you manage configuration across environments?
- Are configurations in version control?
- How similar are your staging and production environments?
-
"Works on My Machine" Incidents:
- When was the last time something worked in one environment but not another?
- What was the root cause?
- How could it have been prevented?
-
Secrets Management:
- Where do you store secrets?
- Are secrets in Git? (They shouldn't be!)
- How do you rotate secrets?
-
Environment Differences:
- What varies between your environments?
- Is this documented?
- Are the differences intentional or accidental?
-
Configuration Validation:
- Do your applications validate configuration on startup?
- What happens when configuration is missing?
- How quickly can you detect configuration issues?
-
Tools and Processes:
- Do you use Kustomize, Helm, or another tool?
- How do you deploy to different environments?
- Is deployment automated or manual?
What's Next?
Sarah now had proper configuration management in place. She could:
- Deploy the same application to any environment
- Know exactly what varies between environments
- Quickly identify configuration issues
- Avoid "works on my machine" problems
But she was about to face a new challenge: the notification service was running perfectly in production, but during a traffic spike, it started crashing. The logs showed OOMKilled errors. Sarah needed to learn about resource management in Kubernetes.
In Chapter 4, "The Resource Crunch," Sarah will learn about CPU and memory limits, how to rightsize applications, and how to prevent resource-related outages.
Code Examples
All code examples from this chapter are available in the examples/chapter-03/ directory of the GitHub repository.
To access the examples:
# Clone the repository
git clone https://github.com/BahaTanvir/devops-guide-book.git
cd devops-guide-book/examples/chapter-03
# See available files
ls -la
# Try deploying with Kustomize
kubectl apply -k overlays/staging --dry-run=client
# Deploy to local cluster
kubectl apply -k overlays/staging
What's included:
- Complete Kustomize base and overlays
- Configuration validation script
- Environment comparison tool
- Deployment checklist template
- Example applications with config validation
- Testing scripts
Online access: View examples on GitHub
Remember: Proper configuration management prevents 90% of deployment issues! π§
Chapter 4: The Resource Crunch
"Resource limits are guardrails, not restrictions."
Sarah's Challenge
Two weeks after fixing the configuration issues, Sarah was feeling confident. The notification service was running smoothly in production, sending emails and push notifications without issues. Everything seemed perfect.
Until Tuesday at 2 PM.
Her phone buzzed with alerts:
π¨ CRITICAL: notification-service pods restarting
π¨ CRITICAL: notification-service - OOMKilled
π¨ WARNING: notification-service - CrashLoopBackOff
Sarah's stomach dropped. OOMKilled? She'd heard about thisβit meant "Out Of Memory Killed." The pods were using too much memory and Kubernetes was killing them.
She quickly checked the pod status:
kubectl get pods -n production -l app=notification-service
NAME READY STATUS RESTARTS AGE
notification-service-7d8f4c5b9d-8xk2p 0/1 OOMKilled 5 10m
notification-service-7d8f4c5b9d-j7h9m 0/1 OOMKilled 4 10m
notification-service-7d8f4c5b9d-m2p4w 1/1 Running 3 10m
Two pods were repeatedly being killed, and even the running one had restarted 3 times. She checked the events:
kubectl get events -n production --sort-by='.lastTimestamp' | grep notification-service
10m Warning OOMKilling pod/notification-service-7d8f4c5b9d-8xk2p Memory cgroup out of memory
10m Warning BackOff pod/notification-service-7d8f4c5b9d-8xk2p Back-off restarting failed container
9m Warning OOMKilling pod/notification-service-7d8f4c5b9d-j7h9m Memory cgroup out of memory
The pods were being killed because they exceeded their memory limit. But Sarah had set memory limits based on what seemed reasonable. What went wrong?
She looked at the deployment configuration:
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
These values had worked fine for weeks. Why were they suddenly insufficient?
James walked over, noticed her concerned expression. "OOMKilled issues?"
"Yeah," Sarah said. "The notification service keeps getting killed for using too much memory. But I set limits!"
"Setting limits is good," James said, "but the wrong limits can be worse than no limits. Let's figure out what's actually happening with your pods."
Understanding the Problem
Sarah's resource management issues revealed several fundamental concepts about how Kubernetes manages resources and why pods get killed.
1. Requests vs Limits
Kubernetes has two resource specifications that many engineers confuse:
Requests (Minimum Guarantee):
- "I need at least this much to run"
- Used by the scheduler to decide which node to place the pod on
- Pod won't be scheduled if node doesn't have available resources
- Pod can use more than requested
Limits (Maximum Allowed):
- "Don't let me use more than this"
- Enforced by the container runtime
- If exceeded:
- CPU: Throttled (slowed down)
- Memory: Killed (OOMKilled)
resources:
requests: # "I need..."
memory: "256Mi"
cpu: "250m"
limits: # "Don't let me exceed..."
memory: "512Mi"
cpu: "500m"
Visual Representation:
Memory Usage Timeline:
0Mi ββββββββββββββββββββββββββββββββββββ> Time
β β β
Pod starts Request (256Mi) Limit (512Mi)
Guaranteed Kill if exceeded!
ββββββββ¬βββββββββββββββ¬ββββββββββββ€
0-256Miβ 256-512Mi β >512Mi
Safe β Can use β OOMKilled
β if available β
2. The OOMKilled Problem
When a pod exceeds its memory limit, the kernel's OOM (Out Of Memory) killer immediately terminates it. There's no graceful degradationβit's instant death.
What Happens:
- Application uses more memory than limit
- Kernel detects memory limit exceeded
- OOM killer terminates the process
- Container exits with code 137 (128 + 9 SIGKILL)
- Kubernetes sees container died
- Kubelet restarts the container
- If it happens repeatedly β CrashLoopBackOff
Why It Happens:
- Memory leak in application
- Sudden spike in traffic
- Large data processing
- Caching gone wrong
- Limits set too low
3. CPU Throttling
Unlike memory (which kills), CPU limits throttle:
CPU Limit Exceeded:
- Process doesn't get killed
- Gets throttled (slowed down)
- Can lead to:
- Slow response times
- Health check failures (timeouts)
- Request queuing
- Cascading failures
Example:
CPU Limit: 1 core (1000m)
App tries to use: 1.5 cores
Result: App runs at 66% speed (1.0/1.5)
Everything takes 50% longer
Requests start timing out
4. Resource Units in Kubernetes
Memory Units:
Ki = Kibibyte (1024 bytes)
Mi = Mebibyte (1024 Ki = 1,048,576 bytes)
Gi = Gibibyte (1024 Mi)
128974848 bytes = 123Mi
1Gi = 1024Mi = 1,048,576Ki
CPU Units:
1 CPU = 1000m (millicores)
500m = 0.5 CPU
100m = 0.1 CPU = 10% of one CPU core
1m = 0.001 CPU (minimum)
Example:
250m = 1/4 of a CPU core
2000m = 2 = 2 full CPU cores
5. Quality of Service (QoS) Classes
Kubernetes assigns QoS classes based on resource settings:
Guaranteed (Highest Priority):
- Requests = Limits for all containers
- Least likely to be evicted
- Example:
resources: requests: memory: "512Mi" cpu: "500m" limits: memory: "512Mi" # Same as request cpu: "500m" # Same as request
Burstable (Medium Priority):
- Requests < Limits, or only requests set
- Can use extra resources if available
- Sarah's configuration (most common)
resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" # Higher than request cpu: "500m"
BestEffort (Lowest Priority):
- No requests or limits set
- First to be evicted under pressure
- Not recommended for production
Eviction Priority:
BestEffort β Burstable β Guaranteed
(Killed first) (Killed last)
6. Node Resource Pressure
When a node runs out of resources, Kubernetes evicts pods:
Memory Pressure:
- Node is running out of memory
- Kubernetes evicts BestEffort pods first
- Then Burstable pods exceeding requests
- Finally Guaranteed pods (only in extreme cases)
Disk Pressure:
- Node running out of disk space
- Pods evicted based on QoS class
- Ephemeral storage limits can trigger this
7. Why Sarah's Pods Were OOMKilled
After investigation, James and Sarah discovered several issues:
Issue 1: Memory Leak The notification service had a memory leakβit cached notification templates in memory but never cleared old ones.
Issue 2: Traffic Spike Marketing sent a campaign to all users simultaneously, creating 10x normal notification volume.
Issue 3: Limits Too Low 256Mi request was reasonable for normal load, but 512Mi limit was too low for peak traffic with the memory leak.
Issue 4: No Horizontal Scaling Only 3 pods handled all trafficβno autoscaling configured.
The Senior's Perspective
James explained his approach to resource management.
The Resource Management Mental Model
"Think of Kubernetes resource management like a hotel," James explained:
Requests = Room Reservation
- You book a room (guarantee you'll have space)
- Hotel can't overbook beyond capacity
- You might not use the whole room, but it's yours
Limits = Fire Code Capacity
- Maximum occupancy for safety
- Exceeding it triggers immediate action
- Based on safety, not comfort
No Resources Set = Standby Passenger
- Hope for space but no guarantee
- First to lose seat if overbooked
Questions Senior Engineers Ask About Resources
-
"What does this application actually use?"
- Not guessingβmeasure it
- Monitor in staging under load
- Profile memory and CPU usage
- Understand growth patterns
-
"What happens under peak load?"
- Normal load vs. spike load
- Daily/weekly patterns
- Campaign/event driven spikes
- Gradual growth over time
-
"What's the cost of being wrong?"
- Too low β OOMKilled, poor performance
- Too high β wasted money, limited scale
- Balance reliability vs. cost
-
"Should this horizontally or vertically scale?"
- Horizontal: More pods (better for stateless)
- Vertical: Bigger pods (better for stateful)
- Most web services: horizontal
-
"What's the blast radius of resource issues?"
- One pod dying β service degraded
- All pods dying β service down
- Node resource exhaustion β multiple services impacted
Rightsizing Strategy
James shared his approach:
Phase 1: Measure
# Monitor actual usage in staging
kubectl top pods -n staging
# Use metrics server
kubectl get --raw /apis/metrics.k8s.io/v1beta1/pods
# Use Prometheus queries
rate(container_cpu_usage_seconds_total[5m])
container_memory_working_set_bytes
Phase 2: Set Conservative Limits
Requests: P50 usage (typical)
Limits: P95 usage (peaks) + 20% buffer
Phase 3: Monitor and Adjust
Watch for:
- OOMKilled events
- CPU throttling
- Resource waste
- Performance issues
Phase 4: Enable Autoscaling
# Scale based on actual usage
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
Common Resource Patterns
Pattern 1: CPU-Intensive (Data Processing)
resources:
requests:
memory: "512Mi"
cpu: "1000m" # High CPU
limits:
memory: "1Gi"
cpu: "2000m" # Allow bursting
Pattern 2: Memory-Intensive (Caching)
resources:
requests:
memory: "2Gi" # High memory
cpu: "250m"
limits:
memory: "4Gi" # Generous buffer
cpu: "500m"
Pattern 3: Balanced Web Service
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
Pattern 4: Guaranteed QoS (Critical)
resources:
requests:
memory: "1Gi"
cpu: "1000m"
limits:
memory: "1Gi" # Same as request
cpu: "1000m" # Same as request
The Solution
James and Sarah implemented proper resource management.
Step 1: Measure Actual Usage
First, they measured what the notification service actually used:
# Install metrics-server if not already present
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Check current usage
kubectl top pods -n production -l app=notification-service
Output:
NAME CPU(cores) MEMORY(bytes)
notification-service-7d8f4c5b9d-8xk2p 245m 487Mi
notification-service-7d8f4c5b9d-j7h9m 198m 456Mi
notification-service-7d8f4c5b9d-m2p4w 267m 512Mi β At limit!
They saw pods consistently using 450-512Mi of memoryβright at the limit!
Step 2: Analyze Memory Usage Over Time
Using Prometheus, they queried historical memory usage:
# Memory usage over last 24 hours
container_memory_working_set_bytes{
pod=~"notification-service-.*",
namespace="production"
}
# Results:
# P50 (median): 380Mi
# P95 (95th percentile): 520Mi β Exceeds current limit!
# P99: 580Mi
# Max: 620Mi
Discovery: The 512Mi limit was too low for peak usage!
Step 3: Check for Memory Leaks
They added memory profiling to identify the leak:
# notification_service.py
import tracemalloc
import logging
# Start memory profiling
tracemalloc.start()
# Periodic memory snapshot
@app.route('/debug/memory')
def memory_snapshot():
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
memory_info = []
for stat in top_stats[:10]:
memory_info.append({
'file': stat.traceback.format()[0],
'size_mb': stat.size / 1024 / 1024
})
return jsonify({
'current_mb': tracemalloc.get_traced_memory()[0] / 1024 / 1024,
'peak_mb': tracemalloc.get_traced_memory()[1] / 1024 / 1024,
'top_allocations': memory_info
})
Discovery: Template cache was growing indefinitely!
Step 4: Fix the Memory Leak
# Before (Memory Leak):
template_cache = {} # Grows forever!
def load_template(template_name):
if template_name not in template_cache:
template_cache[template_name] = load_from_disk(template_name)
return template_cache[template_name]
# After (Fixed with LRU Cache):
from functools import lru_cache
@lru_cache(maxsize=100) # Cache only 100 templates
def load_template(template_name):
return load_from_disk(template_name)
# Or use cachetools with TTL:
from cachetools import TTLCache
template_cache = TTLCache(maxsize=100, ttl=3600) # 1 hour TTL
Step 5: Set Appropriate Resource Limits
Based on measurements and the memory leak fix:
apiVersion: apps/v1
kind: Deployment
metadata:
name: notification-service
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: notification-service
template:
metadata:
labels:
app: notification-service
spec:
containers:
- name: notification
image: techflow/notification-service:v1.3.0
ports:
- containerPort: 8080
resources:
requests:
memory: "384Mi" # P50 + buffer
cpu: "250m" # Typical usage
limits:
memory: "768Mi" # P95 + 50% buffer
cpu: "1000m" # Allow bursting to 1 core
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5 # Account for CPU throttling
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
Key Changes:
- Memory request: 256Mi β 384Mi (based on P50)
- Memory limit: 512Mi β 768Mi (based on P95 + buffer)
- CPU limit: 500m β 1000m (allow bursting)
- Increased timeouts (account for CPU throttling)
Step 6: Configure Horizontal Pod Autoscaler (Deep Dive)
To handle traffic spikes, they added autoscaling. This example shows a fairly advanced HPA configuration; on a first read, focus on the idea that Kubernetes can scale based on resource usage. You can come back to the exact YAML when you're ready to implement it.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: notification-service-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: notification-service
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale when avg CPU > 70%
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80 # Scale when avg memory > 80%
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 50 # Remove max 50% of pods at once
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100 # Double pods if needed
periodSeconds: 15
- type: Pods
value: 4 # Or add 4 pods
periodSeconds: 15
selectPolicy: Max # Use whichever scales faster
HPA Configuration Explained:
Scale Up (Aggressive):
- No stabilization window (immediate)
- Can double pods (100%) or add 4 pods
- Uses whichever is faster
- Checks every 15 seconds
Scale Down (Conservative):
- 5-minute stabilization window
- Max 50% reduction at once
- Prevents flapping
Triggers:
- CPU > 70% average
- Memory > 80% average
Step 7: Set Resource Quotas and Limit Ranges
To prevent runaway resource usage across the namespace:
# Namespace ResourceQuota
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
requests.cpu: "50" # Max 50 CPUs requested
requests.memory: "100Gi" # Max 100Gi memory requested
limits.cpu: "100" # Max 100 CPUs limit
limits.memory: "200Gi" # Max 200Gi memory limit
pods: "100" # Max 100 pods
---
# LimitRange (defaults and constraints)
apiVersion: v1
kind: LimitRange
metadata:
name: production-limits
namespace: production
spec:
limits:
- max: # Maximum per pod
memory: "4Gi"
cpu: "4"
min: # Minimum per pod
memory: "64Mi"
cpu: "50m"
default: # Default limit if not specified
memory: "512Mi"
cpu: "500m"
defaultRequest: # Default request if not specified
memory: "256Mi"
cpu: "250m"
type: Container
Step 8: Monitor Resource Usage
Set up alerts for resource issues:
# Prometheus AlertManager rules
groups:
- name: resources
rules:
- alert: PodOOMKilled
expr: |
increase(kube_pod_container_status_restarts_total[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} restarted in the last 5 minutes"
description: "One or more containers restarted recently. Check if the cause was OOMKilled using logs or events."
- alert: PodCPUThrottling
expr: |
rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is being CPU throttled (example threshold)"
description: "CPU throttling may impact performance. Tune this threshold based on your workload and SLOs."
- alert: HighMemoryUsage
expr: |
container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} using >90% of memory limit"
description: "May be OOMKilled soon"
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
description: "Pod restarting frequently"
Step 9: Create Resource Management Dashboard
They created a Grafana dashboard to visualize:
Panel 1: Memory Usage vs Limit
Panel 2: CPU Usage vs Limit
Panel 3: Pod Restarts (OOMKilled)
Panel 4: HPA Scaling Events
Panel 5: Resource Waste (Requested but Unused)
Panel 6: Throttling Events
Key Queries:
# Memory usage percentage
container_memory_working_set_bytes / container_spec_memory_limit_bytes * 100
# CPU throttling
rate(container_cpu_cfs_throttled_seconds_total[5m])
# Resource waste
(container_spec_memory_limit_bytes - container_memory_working_set_bytes) / 1024 / 1024 / 1024
Lessons Learned
1. Always Set Resource Limits
The Lesson: Pods without limits can consume all node resources, impacting other pods.
Why It Matters:
- One pod can bring down entire node
- Noisy neighbor problem
- Makes capacity planning impossible
Implementation:
# β Bad (No limits)
containers:
- name: app
image: myapp:latest
# No resources defined!
# β
Good (With limits)
containers:
- name: app
image: myapp:latest
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
2. Requests Are for Scheduling, Limits Are for Safety
The Lesson:
- Requests: Tell scheduler where pod can fit
- Limits: Prevent runaway resource usage
Common Mistake:
# β Setting requests = limits unnecessarily
resources:
requests:
memory: "2Gi"
cpu: "2000m"
limits:
memory: "2Gi" # Same as request
cpu: "2000m" # Same as request
# This reserves resources that might not be used!
Better:
# β
Allow bursting to limits
resources:
requests:
memory: "512Mi" # What I typically need
cpu: "500m"
limits:
memory: "1Gi" # Can burst to this
cpu: "1000m"
3. Measure, Don't Guess
The Lesson: Don't guess resource requirementsβmeasure them.
How to Measure:
# Current usage
kubectl top pods
# Historical data (Prometheus)
container_memory_working_set_bytes
rate(container_cpu_usage_seconds_total[5m])
# Load testing
# Run load tests and monitor resource usage
Rightsizing Formula:
Requests = P50 (median) usage + 10% buffer
Limits = P95 (95th percentile) usage + 20-50% buffer
4. Memory Kills, CPU Throttles
The Lesson: Understand the difference:
- Memory over limit: Pod killed (OOMKilled)
- CPU over limit: Pod throttled (slowed down)
Implications:
- Memory limits must be generous (killing is severe)
- CPU limits can be tighter (throttling is recoverable)
- CPU throttling can cause timeout errors
Example:
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi" # Generous (2x request)
cpu: "1000m" # Very generous (10x request - allow bursting)
5. Use Horizontal Pod Autoscaling
The Lesson: Static pod counts can't handle variable load. Use HPA.
Benefits:
- Automatic scaling based on metrics
- Handle traffic spikes
- Save money during low traffic
- Prevent overload
Configuration:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: myapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
6. Quality of Service Matters
The Lesson: QoS class determines eviction priority during resource pressure.
Classes:
Guaranteed (requests = limits)
β Less likely to be evicted
Burstable (requests < limits)
β
BestEffort (no resources set)
β First to be evicted
When to Use:
- Guaranteed: Critical services (databases, core services)
- Burstable: Most applications (web services, APIs)
- BestEffort: Batch jobs, dev/test only
7. Monitor and Alert on Resource Issues
The Lesson: Don't wait for OOMKilledβalert before it happens.
Key Alerts:
- Memory usage > 90% of limit
- CPU throttling occurring
- OOMKilled events
- Crash loop backoff
- HPA scaling events
Grafana Queries:
# Approaching memory limit
(container_memory_working_set_bytes / container_spec_memory_limit_bytes) > 0.9
# CPU throttling
rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.1
# OOMKilled
rate(kube_pod_container_status_terminated_reason{reason="OOMKilled"}[5m]) > 0
8. Resource Quotas Prevent Disasters
The Lesson: Set namespace-level quotas to prevent one app from consuming all resources.
Why:
- Limit blast radius
- Enforce resource governance
- Prevent accidental overprovisioning
- Fair sharing of cluster resources
Implementation:
apiVersion: v1
kind: ResourceQuota
metadata:
name: prod-quota
namespace: production
spec:
hard:
requests.cpu: "50"
requests.memory: "100Gi"
limits.cpu: "100"
limits.memory: "200Gi"
Reflection Questions
-
Your Current Resource Settings:
- Do all your pods have resource requests and limits?
- How did you determine these values?
- When was the last time you reviewed them?
-
Monitoring:
- Can you see actual resource usage vs limits?
- Do you have alerts for OOMKilled events?
- Do you track CPU throttling?
-
Scaling:
- Do you use Horizontal Pod Autoscaling?
- How do your applications handle traffic spikes?
- What happens when you hit resource limits?
-
Past Issues:
- Have you experienced OOMKilled pods?
- What was the root cause?
- How did you determine the right limits?
-
Cost vs Performance:
- Are you over-provisioning resources?
- Where could you reduce without risk?
- Where should you increase for reliability?
What's Next?
Sarah now understood resource management. She could:
- Set appropriate requests and limits
- Use HPA for automatic scaling
- Monitor and alert on resource issues
- Prevent OOMKilled and throttling
But there was one more challenge in Part I: the CI/CD pipeline. Deployments still took 2+ hours, builds frequently failed, and the pipeline consumed excessive resources. Sarah needed to learn about pipeline optimization.
In Chapter 5, "The Slow Release Nightmare," Sarah will learn how to optimize CI/CD pipelines for speed and reliability.
Code Examples
All code examples from this chapter are available in the examples/chapter-04/ directory of the GitHub repository.
To access the examples:
# Clone the repository
git clone https://github.com/BahaTanvir/devops-guide-book.git
cd devops-guide-book/examples/chapter-04
# See available files
ls -la
# Try the examples
kubectl apply -f resource-examples/
What's included:
- Resource limit examples (various patterns)
- HPA configurations
- Resource quota examples
- Monitoring queries
- Load testing scripts
- Memory profiling tools
Online access: View examples on GitHub
Remember: Proper resource management prevents outages and saves money! π°
Chapter 5: The Slow Release Nightmare
"Fast feedback loops are the foundation of velocity."
Sarah's Challenge
A month had passed since Sarah fixed the resource management issues. The notification service was running smoothly with proper limits and HPA configured. Sarah felt like she was finally getting the hang of DevOps.
But there was one problem that had been bothering her since day one: deployments took forever.
Every time the development team wanted to release a new feature, the process was painful:
- Developer commits code
- Wait 15 minutes for tests to run
- Wait 45 minutes for Docker image build
- Wait 20 minutes for image push
- Wait 10 minutes for deployment
- Total: 90 minutes from commit to deployed
And that was when everything worked. Often, the build would fail halfway through, requiring another 90-minute cycle.
It was Thursday afternoon when Marcus called a team meeting.
"We need to talk about our release velocity," Marcus began. "The product team is frustrated. It takes 2+ hours to deploy a simple bug fix, and we can only do 2-3 deployments per day maximum. Our competitors are deploying 10+ times per day."
Sarah knew he was right. Just yesterday, a critical bug fix sat in the queue for 3 hours because the pipeline was backed up with other builds.
"What's slowing us down?" asked one of the developers.
Marcus pulled up the CI/CD dashboard. "Our GitHub Actions pipeline is the bottleneck. Let me show you..."
# Current pipeline (simplified)
name: Build and Deploy
on:
push:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install dependencies
run: |
npm install # Downloads 500MB of dependencies every time
pip install -r requirements.txt
- name: Run tests
run: npm test # 15 minutes
build:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Build Docker image
run: |
docker build -t myapp:${{ github.sha }} . # 45 minutes!
- name: Push to registry
run: |
docker push myapp:${{ github.sha }} # 20 minutes
deploy:
needs: build
runs-on: ubuntu-latest
steps:
- name: Deploy to Kubernetes
run: kubectl set image deployment/myapp myapp=myapp:${{ github.sha }}
- name: Wait for rollout
run: kubectl rollout status deployment/myapp # 10 minutes
"See the problem?" Marcus asked. "Everything runs sequentially. Tests wait for nothing. Builds wait for tests. Deploys wait for builds. And we're not caching anything!"
Sarah looked at the pipeline. She could see several obvious issues:
- No caching (downloading dependencies every time)
- Sequential execution (not parallel where possible)
- Huge Docker images (taking forever to build and push)
- Inefficient Dockerfile (rebuilding everything on tiny changes)
"Sarah," Marcus said, "you've learned a lot about Kubernetes. Now let's optimize our CI/CD pipeline. We need to get this down to under 15 minutes."
Sarah gulped. 90 minutes to 15 minutes? That seemed impossible. But she was ready to try.
Understanding the Problem
Sarah's CI/CD pipeline suffered from multiple inefficiencies that are common in many organizations.
1. Sequential vs Parallel Execution
Current (Sequential):
Test (15 min) β Build (45 min) β Deploy (10 min) = 70 minutes
Potential (Parallel):
Test (15 min) β
β Deploy (10 min) = 25 minutes
Build (15 min) β
Many jobs can run in parallel:
- Linting and testing
- Building different services
- Pushing multiple images
- Running different test suites
2. No Caching Strategy
Every pipeline run started from scratch:
Without Caching:
- Download 500MB of npm dependencies
- Download 200MB of Python packages
- Rebuild all Docker layers
- Total wasted: 10-15 minutes per build
With Caching:
- Restore cached dependencies (30 seconds)
- Reuse unchanged Docker layers
- Only rebuild what changed
- Time saved: 10-15 minutes
3. Inefficient Docker Builds
Bad Dockerfile (Sarah's current):
FROM node:18
WORKDIR /app
# β Copy everything first
COPY . .
# β Install dependencies after copying code
RUN npm install
# β Every code change invalidates all layers below
RUN npm run build
CMD ["npm", "start"]
Problem: Any code change invalidates the COPY . . layer, forcing npm install to run again.
Better Dockerfile:
FROM node:18
WORKDIR /app
# β
Copy dependency files first
COPY package*.json ./
# β
Install dependencies (cached if package.json unchanged)
RUN npm install
# β
Copy code last (doesn't invalidate dependency layer)
COPY . .
RUN npm run build
CMD ["npm", "start"]
4. Large Docker Images
Sarah's current image: 1.2 GB
Why so large:
- Included dev dependencies
- Used full node:18 image (not slim/alpine)
- Included build tools
- Contained test files
- Had unnecessary system packages
Impact:
- 20 minutes to push
- 15 minutes to pull on nodes
- Wasted disk space
- Slower deployments
5. No Build Matrix / Parallelization
Tests could run in parallel:
Unit tests (5 min) β
Integration tests (8 min) β Report (1 min)
E2E tests (12 min) β
Parallel: 13 minutes
Sequential: 26 minutes
6. Rebuilding Unchanged Services
In a monorepo with multiple services, Sarah's pipeline rebuilt everything even if only one service changed:
Commit to service-A β Rebuild service-A, service-B, service-C
(Waste: rebuilding B and C)
7. No Artifact Caching
Pipeline built the same Docker image multiple times:
- Build for testing
- Build for staging
- Build for production
Should build once, deploy everywhere.
8. Inefficient Test Strategy
Current:
- All tests run on every commit
- Slow tests block fast tests
- No test result caching
- Flaky tests cause full reruns
Better:
- Fast tests first (fail fast)
- Parallel test execution
- Cache test results
- Retry only failed tests
The Senior's Perspective
James shared his CI/CD optimization framework with Sarah.
The CI/CD Performance Mental Model
"Think of your pipeline as an assembly line," James explained. "You want to:
- Identify the Critical Path - What's the longest sequential chain?
- Parallelize Everything Possible - Run independent jobs simultaneously
- Cache Aggressively - Never rebuild what hasn't changed
- Fail Fast - Run quick checks first
- Optimize the Bottleneck - Focus on the slowest step"
Questions Senior Engineers Ask About CI/CD
-
"What's the critical path?"
- Identify the longest chain of dependent steps
- That's your minimum possible time
- Everything else can potentially parallelize
-
"What can we cache?"
- Dependencies (npm, pip, maven)
- Docker layers
- Build artifacts
- Test results
-
"What can run in parallel?"
- Different test suites
- Multiple services
- Lint/format/security scans
- Different deployment stages
-
"Where's the bottleneck?"
- Usually: Docker build, image push, or slow tests
- Use metrics to identify
- Optimize the slowest step first
-
"Are we rebuilding unnecessarily?"
- Changed path detection
- Monorepo service isolation
- Smart rebuilds only
The CI/CD Optimization Checklist
James shared his checklist:
## Build Speed
- [ ] Dependencies cached
- [ ] Docker layer caching enabled
- [ ] Build only changed services
- [ ] Use smaller base images
- [ ] Multi-stage builds
## Test Speed
- [ ] Fast tests run first
- [ ] Tests run in parallel
- [ ] Test results cached
- [ ] Flaky tests identified and fixed
- [ ] Only affected tests run
## Image Optimization
- [ ] Multi-stage Dockerfile
- [ ] Alpine/slim base images
- [ ] .dockerignore configured
- [ ] Only production dependencies
- [ ] Image < 200MB if possible
## Pipeline Structure
- [ ] Jobs run in parallel where possible
- [ ] Artifacts shared between jobs
- [ ] Matrix builds for multiple variants
- [ ] Early exit on failures
- [ ] Retries for flaky steps
## Deployment
- [ ] Rolling deployments
- [ ] Health checks before cutover
- [ ] Automatic rollback on failure
- [ ] Deployment notifications
Common Pipeline Anti-Patterns
James showed Sarah what to avoid:
Anti-Pattern 1: Sequential Everything
# β Bad
jobs:
lint:
steps: [lint]
test:
needs: lint # Unnecessary dependency
steps: [test]
build:
needs: test # Could run parallel with test
steps: [build]
Anti-Pattern 2: No Caching
# β Bad - reinstalls every time
- run: npm install
- run: pip install -r requirements.txt
Anti-Pattern 3: Building Multiple Times
# β Bad - builds 3 times
- build for test
- build for staging
- build for production
Anti-Pattern 4: Waiting for Approval in Pipeline
# β Bad - blocks pipeline
- name: Deploy to staging
- name: Manual approval # Blocks runner
- name: Deploy to production
The Solution
Sarah and James optimized the pipeline step by step.
Step 1: Optimize the Dockerfile
Before (1.2GB, 45-minute build):
FROM node:18
WORKDIR /app
COPY . .
RUN npm install
RUN npm run build
CMD ["npm", "start"]
After (180MB, 8-minute build):
# Multi-stage build
# Stage 1: Dependencies
FROM node:18-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
# Stage 2: Build
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
# Stage 3: Runtime
FROM node:18-alpine AS runtime
WORKDIR /app
# Copy only necessary files
COPY --from=deps /app/node_modules ./node_modules
COPY --from=builder /app/dist ./dist
COPY package*.json ./
# Run as non-root user
RUN addgroup -g 1001 -S nodejs
RUN adduser -S nodejs -u 1001
USER nodejs
EXPOSE 8080
CMD ["node", "dist/index.js"]
Improvements:
- Multi-stage build (only final stage in image)
- Alpine base (smaller)
- Production dependencies only
- Separate layers for dependencies and code
- Non-root user for security
- Size: 1.2GB β 180MB (85% reduction)
- Build: 45 min β 8 min (with caching)
Step 2: Add .dockerignore
# .dockerignore
node_modules
npm-debug.log
dist
.git
.gitignore
README.md
.env
.env.*
*.md
.vscode
.idea
coverage
.test
*.test.js
Dockerfile
.dockerignore
Impact: Faster COPY operations, smaller build context
Step 3: Optimize GitHub Actions Pipeline
Before (90 minutes):
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- run: npm install
- run: npm test
build:
needs: test
steps:
- uses: actions/checkout@v3
- run: docker build -t myapp .
- run: docker push myapp
deploy:
needs: build
steps:
- run: kubectl set image deployment/myapp myapp:$TAG
After (β12 minutes, with optimizations):
Deep Dive: Full GitHub Actions Workflow Treat this as a reference implementation. Even if you use GitLab CI, Jenkins, or another system, the structureβparallel jobs, caching, staged deploysβstill applies.
name: Optimized CI/CD
on:
push:
branches: [main]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
# Job 1: Fast checks (parallel)
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Node
uses: actions/setup-node@v3
with:
node-version: '18'
cache: 'npm' # β
Cache npm dependencies
- name: Install dependencies
run: npm ci
- name: Lint
run: npm run lint
# Job 2: Tests (parallel with lint)
test:
runs-on: ubuntu-latest
strategy:
matrix:
test-group: [unit, integration, e2e] # β
Parallel test execution
steps:
- uses: actions/checkout@v3
- name: Setup Node
uses: actions/setup-node@v3
with:
node-version: '18'
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run ${{ matrix.test-group }} tests
run: npm run test:${{ matrix.test-group }}
- name: Upload coverage
if: matrix.test-group == 'unit'
uses: codecov/codecov-action@v3
# Job 3: Build Docker image (parallel with lint/test)
build:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- uses: actions/checkout@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
- name: Log in to Container Registry
uses: docker/login-action@v2
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v4
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=ref,event=branch
type=sha,prefix={{branch}}-
- name: Build and push Docker image
uses: docker/build-push-action@v4
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache
cache-to: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache,mode=max
# β
Docker layer caching
# Job 4: Deploy (only after all checks pass)
deploy-staging:
needs: [lint, test, build]
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v3
- name: Setup kubectl
uses: azure/setup-kubectl@v3
- name: Configure kubeconfig
run: |
echo "${{ secrets.KUBE_CONFIG }}" | base64 -d > kubeconfig
export KUBECONFIG=./kubeconfig
- name: Deploy to staging
run: |
kubectl set image deployment/myapp \
myapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
-n staging
kubectl rollout status deployment/myapp -n staging --timeout=5m
- name: Run smoke tests
run: ./scripts/smoke-test.sh staging
- name: Notify team
if: failure()
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "Staging deployment failed for ${{ github.sha }}"
}
# Job 5: Production deployment (manual approval)
deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production # β
Requires approval
steps:
- uses: actions/checkout@v3
- name: Setup kubectl
uses: azure/setup-kubectl@v3
- name: Configure kubeconfig
run: |
echo "${{ secrets.KUBE_CONFIG_PROD }}" | base64 -d > kubeconfig
export KUBECONFIG=./kubeconfig
- name: Deploy to production
run: |
kubectl set image deployment/myapp \
myapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
-n production
kubectl rollout status deployment/myapp -n production --timeout=10m
- name: Run smoke tests
run: ./scripts/smoke-test.sh production
- name: Notify team
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "β
Production deployment successful: ${{ github.sha }}"
}
Key Optimizations:
-
Parallel Execution:
- Lint, tests, and build run simultaneously
- Test matrix runs 3 test suites in parallel
-
Caching:
- npm dependencies cached
- Docker layers cached in registry
- Restored on subsequent builds
-
Docker Buildx:
- BuildKit for faster builds
- Layer caching to registry
- Multi-platform support
-
Smart Dependencies:
- Deploy only after all checks pass
- Staging before production
- Manual approval for production
Results:
- Lint: 2 minutes
- Tests (parallel): 5 minutes
- Build: 8 minutes
- Deploy: 2 minutes
- Total: ~12 minutes (down from 90!)
Step 4: Monorepo Optimization
For repos with multiple services, add path filtering:
on:
push:
branches: [main]
paths:
- 'services/api/**'
- '.github/workflows/api.yml'
jobs:
build-api:
# Only runs if API code changed
steps:
- name: Build API
working-directory: services/api
run: docker build -t api .
Step 5: Caching Strategy
Dependencies:
- name: Cache node modules
uses: actions/cache@v3
with:
path: ~/.npm
key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
restore-keys: |
${{ runner.os }}-node-
Docker:
- name: Build with cache
uses: docker/build-push-action@v4
with:
cache-from: type=gha
cache-to: type=gha,mode=max
Step 6: Build Matrix for Multiple Variants
strategy:
matrix:
platform: [linux/amd64, linux/arm64]
node-version: [16, 18, 20]
steps:
- name: Build for ${{ matrix.platform }}
run: docker buildx build --platform ${{ matrix.platform }} .
Step 7: Smoke Tests
#!/bin/bash
# scripts/smoke-test.sh
ENVIRONMENT=$1
URL="https://api-${ENVIRONMENT}.example.com"
echo "Running smoke tests against $URL"
# Health check
if ! curl -f "$URL/health"; then
echo "β Health check failed"
exit 1
fi
# Key endpoint test
if ! curl -f "$URL/api/users/1"; then
echo "β API test failed"
exit 1
fi
echo "β
Smoke tests passed"
Lessons Learned
1. Parallelize Everything Possible
The Lesson: Independent jobs should run in parallel, not sequentially.
Implementation:
jobs:
lint: # No dependencies
test: # No dependencies
build: # No dependencies
deploy:
needs: [lint, test, build] # Waits for all
Impact: 70 minutes β 15 minutes
2. Cache Aggressively
The Lesson: Never rebuild what hasn't changed.
What to Cache:
- Dependencies (npm, pip, gems)
- Docker layers
- Build artifacts
- Test results
GitHub Actions Caching:
- uses: actions/cache@v3
with:
path: ~/.npm
key: ${{ hashFiles('package-lock.json') }}
3. Optimize Dockerfiles
The Lesson: Layer order matters. Put changing layers last.
Pattern:
# 1. Base image (changes rarely)
FROM node:18-alpine
# 2. Dependencies (changes occasionally)
COPY package*.json ./
RUN npm ci
# 3. Code (changes frequently)
COPY . .
RUN npm run build
4. Use Multi-Stage Builds
The Lesson: Keep only what you need in the final image.
Benefits:
- Smaller images (faster push/pull)
- No build tools in production
- Better security
- Clear separation of concerns
5. Fail Fast
The Lesson: Run quick checks first to catch errors early.
Order:
1. Lint (30 seconds) - catches syntax errors
2. Unit tests (2 min) - catches logic errors
3. Integration tests (5 min) - catches integration issues
4. Build (8 min) - only if tests pass
5. Deploy (2 min) - only if build succeeds
6. Smart Path Filtering
The Lesson: Don't rebuild services that haven't changed.
Monorepo Strategy:
on:
push:
paths:
- 'services/api/**' # Only API changes trigger API build
7. Use Build Matrices
The Lesson: Run multiple variants in parallel.
Examples:
- Multiple Node versions
- Multiple platforms (amd64, arm64)
- Multiple test suites
- Multiple environments
8. Monitor Pipeline Performance
The Lesson: Track metrics to identify slowdowns.
Key Metrics:
- Total pipeline duration
- Per-job duration
- Cache hit rate
- Failure rate
- Time to deploy
Reflection Questions
-
Your CI/CD Pipeline:
- How long does your pipeline take?
- What's the slowest step?
- What percentage could run in parallel?
-
Caching:
- What are you caching?
- What could you cache but aren't?
- What's your cache hit rate?
-
Docker Images:
- How large are your images?
- Do you use multi-stage builds?
- Are you using alpine/slim variants?
-
Tests:
- Do fast tests run before slow tests?
- Are tests running in parallel?
- Do flaky tests slow down your pipeline?
-
Deployment Frequency:
- How many times do you deploy per day?
- What prevents more frequent deployments?
- How long from commit to production?
What's Next?
Sarah had optimized the CI/CD pipeline from 90 minutes to 12 minutesβa 7.5x improvement! The team could now:
- Deploy bug fixes in minutes, not hours
- Deploy 10+ times per day
- Get faster feedback on code changes
- Experiment more freely
Part I Complete! π
Sarah had learned the fundamentals of DevOps:
- Chapter 1: Deployments and rollbacks
- Chapter 2: Centralized logging
- Chapter 3: Configuration management
- Chapter 4: Resource management
- Chapter 5: CI/CD optimization
With this foundation, Sarah was ready to dive deeper into Infrastructure as Code in Part II.
Code Examples
All code examples from this chapter are available in the examples/chapter-05/ directory of the GitHub repository.
To access the examples:
git clone https://github.com/BahaTanvir/devops-guide-book.git
cd devops-guide-book/examples/chapter-05
What's included:
- Optimized Dockerfiles (before/after)
- Complete GitHub Actions workflows
- GitLab CI examples
- Caching configurations
- Smoke test scripts
- Pipeline monitoring queries
Online access: View examples on GitHub
Remember: Fast pipelines enable fast iteration! π
Chapter 6: The Terraform State Disaster
"Terraform state is your source of truthβprotect it like your production database."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 7: The Drift Detective
"Infrastructure drift is technical debt compounding with interest."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 8: The Module Maze
"Don't Repeat Yourselfβbut know when abstraction helps and when it hurts."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 9: The Multi-Environment Challenge
"Dev, staging, and production should be similar enough to trust, different enough to be cost-effective."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 10: The Cloud Cost Catastrophe
"The cloud isn't cheaperβit's variable. That requires vigilance."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 11: The Kubernetes Crash Course
"Kubernetes is powerful, but power without understanding is dangerous."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 12: The Networking Puzzle
"In Kubernetes, everything is networked, and networking is everything."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 13: The Storage Surprise
"Stateful applications in Kubernetes require special careβand planning."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 14: The Configuration Complexity
"Configuration is codeβtreat it with the same rigor."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 15: The Health Check Headache
"A pod running doesn't mean it's ready to serve traffic."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 16: The Scaling Saga
"Manual scaling is reactive. Autoscaling is proactive. Choose wisely."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 17: The Monitoring Metamorphosis
"You can't improve what you don't measure, and you can't fix what you can't see."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 18: The Alert Fatigue
"Too many alerts is as bad as no alertsβboth lead to ignored warnings."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 19: The Tracing Trail
"In microservices, a single request becomes a journey. Tracing maps that journey."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 20: The SLO Awakening
"Reliability is not binaryβit's a negotiation between users and engineers."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 21: The Capacity Crisis
"Plan for growth, or growth will force your hand at the worst possible time."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 22: The Debugging Deep Dive
"Systematic debugging beats random changes every time."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 23: The Security Scare
"Security vulnerabilities don't wait for convenient times to be discovered."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 24: The Secrets Leak
"A secret in Git is no longer a secretβit's a liability."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 25: The Access Control Adventure
"Least privilege isn't paranoiaβit's operational hygiene."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 26: The Compliance Challenge
"Compliance isn't a checkboxβit's a continuous practice."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 27: The Network Security Narrative
"Trust nothing, verify everythingβeven inside your network."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 28: The Pipeline Principles
"Your CI/CD pipeline is your deployment safety netβmake it strong."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 29: The Testing Tower
"Fast tests give you confidence. Comprehensive tests give you sleep."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 30: The GitOps Journey
"Git as the single source of truthβsimple in concept, powerful in practice."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 31: The Rollback Recovery
"The ability to rollback quickly is as important as the ability to deploy quickly."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 32: The Deployment Diversity
"Different applications need different deployment strategiesβone size doesn't fit all."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 33: The Communication Conundrum
"Technical excellence without communication is invisible excellence."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 34: The On-Call Odyssey
"On-call is where theory meets reality at 3 AM."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 35: The Automation Advocate
"Automate the toil, but don't automate away understanding."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Chapter 36: The Career Compass
"Your career is a marathon, not a sprintβpace yourself and enjoy the journey."
Sarah's Challenge
Content coming soon...
Understanding the Problem
Content coming soon...
The Senior's Perspective
Content coming soon...
The Solution
Content coming soon...
Lessons Learned
Content coming soon...
Reflection Questions
Content coming soon...
Appendix A: Essential Tools Cheatsheet
Quick reference for the most commonly used DevOps commands and tools.
kubectl Commands
Content coming soon...
Docker Commands
Content coming soon...
Terraform Commands
Content coming soon...
Git Commands
Content coming soon...
AWS CLI Commands
Content coming soon...
Troubleshooting Commands
Content coming soon...
Appendix B: Configuration Examples
Production-ready configuration templates and examples.
Kubernetes Manifests
Content coming soon...
Terraform Modules
Content coming soon...
CI/CD Pipeline Templates
Content coming soon...
Monitoring Configurations
Content coming soon...
Helm Charts
Content coming soon...
Appendix C: Troubleshooting Flowcharts
Step-by-step guides for diagnosing common issues.
Pod Not Starting
Content coming soon...
Service Unreachable
Content coming soon...
High Latency Investigation
Content coming soon...
Out of Memory Issues
Content coming soon...
DNS Resolution Problems
Content coming soon...
Certificate Errors
Content coming soon...
Appendix D: Glossary
Definitions of DevOps terminology, acronyms, and concepts.
A
Content coming soon...
B
Content coming soon...
C
Content coming soon...
D
Content coming soon...
E
Content coming soon...
... (continuing through the alphabet)
Appendix E: Resources and Further Reading
Curated list of books, courses, communities, and resources for continued learning.
Books
Content coming soon...
Online Courses
Content coming soon...
Communities and Forums
Content coming soon...
Blogs and Newsletters
Content coming soon...
Conference Talks
Content coming soon...
Podcasts
Content coming soon...
Practice Platforms
Content coming soon...
Contributing to A Guide to DevOps Engineering
Thank you for your interest in contributing to this open-source book! This guide is a community effort to help junior DevOps engineers bridge the gap to senior-level expertise.
π― Our Mission
To create the most practical, scenario-based DevOps guide that helps junior engineers:
- Learn from real-world experiences
- Understand the "why" behind best practices
- Gain confidence in production environments
- Accelerate their professional growth
π€ How You Can Contribute
1. Report Issues
Found a problem? Please open an issue for:
- Technical errors in code examples
- Broken links or missing resources
- Typos and grammar mistakes
- Outdated information (tool versions, deprecated practices)
- Unclear explanations that need improvement
2. Suggest Improvements
Have ideas? We'd love to hear about:
- Additional scenarios Sarah should encounter
- Missing topics that should be covered
- Better explanations for complex concepts
- Diagrams that would help visualize concepts
- Real-world examples from your experience
3. Submit Content
Ready to write? You can contribute:
- New chapters on relevant DevOps topics
- Case studies from your own experience
- Code examples and configurations
- Troubleshooting guides
- Diagrams and illustrations
4. Improve Existing Content
Help make existing chapters better:
- Enhance code examples
- Add more detailed explanations
- Create better diagrams
- Add tips and warnings from experience
- Update content for new tool versions
5. Translate
Help make this book accessible globally:
- Translate chapters to other languages
- Review existing translations
- Maintain localized versions
π Contribution Guidelines
Writing Style
When contributing content, please follow these guidelines:
Voice and Tone
- Conversational but professional
- Empathetic to junior engineer struggles
- Practical over theoretical
- Encouraging without being condescending
Technical Content
- Accurate β test all code examples
- Production-ready β no toy examples
- Explained β don't just show, explain why
- Comprehensive β cover edge cases and gotchas
Scenario Structure
If writing a new chapter, follow this structure:
- Sarah's Challenge β The problem/scenario
- Understanding the Problem β Concepts and context
- The Senior's Perspective β Expert insights
- The Solution β Step-by-step implementation
- Lessons Learned β Key takeaways
- Reflection Questions β Help readers apply concepts
Code Standards
All code examples must:
- β Work β be tested and functional
- β Follow best practices β industry standards
- β Include comments β explain non-obvious parts
- β Be secure β no hardcoded secrets or vulnerabilities
- β Be formatted β use consistent style
Example:
# Good: Well-commented, explains the why
apiVersion: v1
kind: Service
metadata:
name: frontend
labels:
app: frontend
spec:
# Using ClusterIP since this service is internal-only
# and accessed via Ingress controller
type: ClusterIP
ports:
- port: 80
targetPort: 8080
protocol: TCP
name: http
selector:
app: frontend
Markdown Standards
- Use proper heading hierarchy (# β ## β ###)
- Include code fences with language specification
- Use bold for emphasis, italic for terms
- Add alt text to all images
- Keep line length reasonable (~100 characters)
Diagram Guidelines
If adding diagrams:
- Use consistent styling and colors
- Include source files (draw.io, mermaid, etc.)
- Export as SVG when possible (scales better)
- Add descriptive captions
- Consider accessibility (color blind friendly)
π Submission Process
For Small Changes (typos, small fixes)
- Fork the repository
- Create a branch:
git checkout -b fix/typo-chapter-15 - Make your changes
- Commit:
git commit -m "Fix typo in chapter 15" - Push:
git push origin fix/typo-chapter-15 - Open a Pull Request
For Larger Contributions (new content, major changes)
- Open an issue first to discuss your idea
- Get feedback from maintainers
- Fork and create a branch
- Write your content
- Test all code examples
- Submit a Pull Request with detailed description
Pull Request Checklist
Before submitting, ensure:
- Content follows the writing guidelines
- Code examples are tested and work
- No sensitive information (API keys, passwords, etc.)
- Markdown is properly formatted
- Links are working
- Diagrams have source files included
- You've added yourself to contributors list (if first contribution)
π Review Process
What to Expect
- Initial review within 1 week
- Feedback from maintainers and community
- Iterations to refine the content
- Approval from at least 2 maintainers
- Merge and inclusion in next release
Review Criteria
Contributions are evaluated on:
- Accuracy β Is the technical content correct?
- Relevance β Does it fit the book's scope?
- Quality β Is it well-written and clear?
- Completeness β Are examples and explanations sufficient?
- Consistency β Does it match the book's style?
π¨ Content Guidelines by Type
Adding a New Chapter
Required elements:
- Fits within existing book structure
- Includes a realistic scenario for Sarah
- Has working code examples
- Follows chapter template structure
- Adds 15-25 pages of content
- Includes reflection questions
Adding Code Examples
Requirements:
- Tested in a real environment
- Includes necessary context/setup
- Has inline comments explaining key points
- Shows best practices
- Includes error handling where appropriate
Adding Diagrams
Guidelines:
- Use consistent color scheme (navy/blue theme)
- Include architecture context
- Label all components clearly
- Show data flow with arrows
- Include legend if needed
Updating Existing Content
When updating:
- Preserve the original scenario/narrative
- Improve clarity without changing meaning
- Update tool versions in comments
- Add deprecation warnings if needed
- Link to additional resources
π οΈ Development Setup
Prerequisites
# Install mdBook
cargo install mdbook
# Or using package manager
brew install mdbook # macOS
Local Development
# Clone the repository
git clone https://github.com/BahaTanvir/devops-guide-book.git
cd devops-guide-book
# Serve the book locally
mdbook serve
# Open in browser: http://localhost:3000
# Build the book
mdbook build
# Test all code examples
./scripts/test-examples.sh
Testing Your Changes
Before submitting:
# Check markdown formatting
mdbook test
# Verify all links
./scripts/check-links.sh
# Test code examples
./scripts/test-code.sh
π Licensing
By contributing, you agree that:
- Your contributions will be licensed under the same license as the project
- You have the right to submit the contribution
- You're not including proprietary or confidential information
π Recognition
All contributors are:
- Added to the contributors list
- Credited in commit history
- Acknowledged in release notes
- Appreciated by the community! π
π¬ Getting Help
Need help with your contribution?
- GitHub Issues β Ask questions
- Discussions β Chat with the community
- Email β Reach out to maintainers (coming soon)
- Discord β Join our community (coming soon)
π Priority Areas
We especially need help with:
- Real-world scenarios β Share your experiences
- Diagrams β Visual learners need more graphics
- Code examples β More working examples
- Troubleshooting sections β Common issues and solutions
- Translations β Make it accessible globally
π― Good First Issues
New to contributing? Look for issues labeled:
good-first-issueβ Great for beginnershelp-wantedβ We need assistancedocumentationβ Improve docstypoβ Quick fixes
π Resources for Contributors
β Questions?
Don't hesitate to ask! Open an issue with the question label.
Thank you for helping junior DevOps engineers learn and grow! π
Every contribution, no matter how small, makes a difference in someone's career journey.
License
Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
This book is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.
You are free to:
- Share β copy and redistribute the material in any medium or format
- Adapt β remix, transform, and build upon the material for any purpose, even commercially
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
-
Attribution β You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
-
ShareAlike β If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
-
No additional restrictions β You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Notices:
You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation.
No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.
Full Legal Code
The complete license text is available at: https://creativecommons.org/licenses/by-sa/4.0/legalcode
Code Examples and Configurations
All code examples, configuration files, and scripts in this book are released under the MIT License to allow maximum flexibility for practical use:
MIT License
Copyright (c) 2024 DevOps Community Contributors
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Why This License?
For the Book (CC BY-SA 4.0):
We chose Creative Commons Attribution-ShareAlike because:
- β Keeps it open β Anyone can read for free
- β Allows derivatives β You can adapt for your context
- β Ensures attribution β Original authors get credit
- β Maintains openness β Derivatives must also be open
- β Permits commercial use β Can be printed or sold
For Code (MIT License):
We chose MIT for code because:
- β Maximum flexibility β Use in any project
- β No copyleft requirement β Can be used in proprietary software
- β Simple and clear β Easy to understand and comply with
- β Industry standard β Widely accepted and trusted
- β Commercial friendly β No barriers to business use
Using This Book
If You Want to:
Read online for free β
- Just visit the website and read!
Print for personal use β
- Feel free! PDF versions available for download
Share with your team β
- Send links, share PDFs, recommend to colleagues
Translate to another language β
- Please do! Just maintain attribution and same license
Create a training course based on this β
- Absolutely! Just attribute the source and share-alike
Remix/adapt chapters for your blog β
- Go ahead! Attribute and use same license for your adaptations
Use code examples in your production systems β
- That's exactly what they're for! MIT license applies
Sell printed copies β
- Yes, but derivatives must also be CC BY-SA 4.0
Create a proprietary derivative work β
- No, derivatives must be shared under the same license
Attribution Guidelines
When using or adapting this work, please provide attribution like:
For the book:
"A Guide to DevOps Engineering: Bridging the Gap" by DevOps Community Contributors,
licensed under CC BY-SA 4.0. Available at https://github.com/BahaTanvir/devops-guide-book
For code examples:
# Adapted from "A Guide to DevOps Engineering" (MIT License)
# https://github.com/BahaTanvir/devops-guide-book
Contributor Rights
By contributing to this book, you agree to:
- License your contributions under the same terms (CC BY-SA 4.0 for content, MIT for code)
- Confirm you have the right to submit the contribution
- Allow your contribution to be used as part of the collective work
You retain copyright on your contributions, but grant others the rights specified in the licenses above.
Questions About Licensing?
If you have questions about how you can use this book:
- Check the CC BY-SA 4.0 FAQ
- Open an issue on GitHub
- Contact the maintainers
Acknowledgments
This book is made possible by:
- Contributors who share their knowledge
- Readers who provide feedback
- The open-source community that builds the tools we describe
- Organizations that support learning and knowledge sharing
Thank you for being part of this community! π
The choice of open licensing reflects our belief that knowledge should be accessible to all, and that learning resources should be freely available to those who need them most.