Kubernetes in Production: 10 Lessons Learned

Running Kubernetes in development is straightforward. Running it in production - with real traffic, real SLAs, and real on-call rotations - is a different game entirely. After managing production EKS and AKS clusters serving millions of requests, here are the ten lessons that cost me the most sleep before I learned them.

1. Always Set Resource Requests and Limits

This is the single most impactful thing you can do for cluster stability. Without resource requests, the scheduler has no idea where to place pods. Without limits, a single runaway process can starve an entire node. I set requests based on the p95 usage from monitoring data and limits at 2x the request as a starting point, then tune from there.

2. Use Pod Disruption Budgets

PodDisruptionBudgets (PDBs) prevent Kubernetes from evicting too many pods simultaneously during node upgrades or maintenance. Without them, a rolling node update can take down your entire service. I set minAvailable: 50% as a baseline for all production deployments.

3. Implement Proper Health Checks

Liveness probes tell Kubernetes when to restart a container. Readiness probes tell it when a pod is ready to receive traffic. Getting these wrong is one of the most common causes of production incidents. A misconfigured liveness probe that's too aggressive will cause restart loops. A missing readiness probe will send traffic to pods that aren't ready, causing 5xx errors.

My pattern: readiness probes check the actual application health endpoint, liveness probes check a simpler "is the process alive" endpoint, and startup probes give slow-starting applications time to initialize before liveness kicks in.

4. Don't Skip Network Policies

By default, every pod in a Kubernetes cluster can communicate with every other pod. In production, this is a security risk. I implement deny-all-by-default network policies and explicitly allow only the traffic flows that should exist. This follows the principle of least privilege at the network layer.

5. Namespace Isolation Is Not Optional

I use namespaces to isolate teams, environments, and workloads. Combined with ResourceQuotas and LimitRanges, namespaces prevent any single team from monopolizing cluster resources. Each namespace gets a resource quota proportional to its SLA requirements.

6. Invest in Observability Early

You cannot operate what you cannot observe. Before deploying production workloads, I set up the observability trifecta: metrics (Prometheus + Grafana), logs (Fluent Bit to CloudWatch or Loki), and traces (OpenTelemetry to Jaeger or X-Ray). The cost of setting this up before production is a fraction of the cost of debugging blind during an incident.

7. Automate Everything with GitOps

Manual kubectl apply commands in production are a recipe for drift and outages. I use ArgoCD with a GitOps workflow where the Git repository is the single source of truth. Every change goes through a PR, gets reviewed, and ArgoCD syncs the desired state to the cluster automatically. This gives you an audit trail, rollback capability, and eliminates "it worked on my machine" deployments.

8. Plan for Node Failures

Nodes will fail. EBS volumes will detach. AWS availability zones will have issues. Design for it. I run workloads across multiple AZs with topology spread constraints, use pod anti-affinity rules to spread replicas, and configure the Cluster Autoscaler with multiple instance types and purchase options to handle capacity changes gracefully.

9. Secrets Management Matters

Kubernetes Secrets are base64-encoded, not encrypted. For production, I use AWS Secrets Manager or HashiCorp Vault with the External Secrets Operator to inject secrets at runtime. This keeps sensitive data out of Git, provides rotation capabilities, and gives you a centralized audit trail of secret access.

10. Practice Failure Regularly

The worst time to discover that your failover doesn't work is during an actual outage. I run regular chaos engineering experiments - killing pods, draining nodes, simulating AZ failures - during business hours with the team watching. Tools like Litmus Chaos and AWS Fault Injection Simulator make this repeatable and safe. Every experiment that reveals a weakness is a production incident prevented.

Final Thought

Kubernetes is powerful, but it's not magic. It amplifies both good practices and bad ones. Invest in the fundamentals - resource management, observability, security, and automation - before chasing the latest service mesh or operator pattern. The boring stuff is what keeps your cluster running at 3 AM.