Helpful context:


Docker solved the “it works on my machine” problem. You package an application and all its dependencies into a container image, and that image runs identically on any machine with a container runtime. This is genuinely useful, and for a single service on a single server, Docker is enough.

But production systems are not a single service on a single server. They are dozens of services, hundreds of containers, spread across many machines, restarting when they crash, scaling up when traffic spikes, scaling down when it subsides, updating continuously without downtime. Docker tells you how to package and run a container. It does not tell you which machine to run it on, what to do when it crashes, how to route traffic to healthy instances, or how to roll out a new version to 50 containers without taking them all offline at once.

These are orchestration problems. Kubernetes solves them.


Where Kubernetes Came From

Google had been running containerized workloads at massive scale since 2003, on an internal system called Borg. Borg scheduled containers across Google’s fleet of machines, handled failures, managed resource allocation, and did all of this for services serving billions of users. By the time containers became popular outside Google, the engineering team had a decade of hard-won lessons about what container orchestration needs to do.

In 2014, Google open-sourced a reimplementation of Borg’s core ideas under the name Kubernetes (from the Greek for “helmsman” or “governor” - the person who steers the ship). It was donated to the Cloud Native Computing Foundation in 2016. Today it is the default way to run containerized workloads in production, supported by every major cloud provider (GKE on GCP, EKS on AWS, AKS on Azure).

The design reflects Google’s decade of operational experience. Some of its concepts feel over-engineered for small deployments. They are sized for operating at Google scale, made available to everyone.


The Mental Model: Desired State

The central idea in Kubernetes is desired state. You tell Kubernetes what you want - “I want three copies of this container running, each with 2 vCPUs and 4 GB of RAM, accessible on port 8080” - and Kubernetes makes it happen and keeps it that way. If one copy crashes, Kubernetes starts a new one. If a machine fails, Kubernetes reschedules the containers that were running on it onto healthy machines. You do not issue commands (“start this container”). You declare intent (“the desired state is three running copies”) and Kubernetes continuously reconciles reality with that intent.

This reconciliation loop is the heart of Kubernetes. Every component in the system is a controller that watches the current state, compares it to the desired state, and takes action to close the gap.


The Physical Architecture

A Kubernetes cluster has two kinds of machines: the control plane and worker nodes.

The control plane (historically called the master) runs the components that make decisions for the cluster:

  • API server: the single entry point for all commands. Every interaction with Kubernetes - from you, from other components, from automation - goes through the API server. It is stateless and horizontally scalable.
  • etcd: a distributed key-value store that holds all cluster state. The desired configuration, the current list of running pods, service definitions - everything is in etcd. It is the source of truth.
  • Scheduler: watches for newly created pods that have not been assigned to a node, and assigns them to one. The assignment considers resource requests, node capacity, affinity rules, and taints.
  • Controller manager: runs the reconciliation loops. The ReplicaSet controller ensures the right number of pod copies exist. The Node controller detects when nodes fail. The Endpoints controller keeps service routing tables up to date.

Worker nodes are the machines that actually run your application containers. Each node runs:

  • kubelet: an agent that receives pod assignments from the control plane and ensures those pods are running and healthy.
  • kube-proxy: maintains network rules on the node so pods can communicate with services.
  • A container runtime (containerd or CRI-O) that actually starts and stops containers.

The Core Objects

Pod: the smallest deployable unit in Kubernetes. A pod wraps one or more containers that share a network namespace and storage. Containers in the same pod communicate via localhost. In practice, most pods contain a single container; multi-container pods are used for sidecars (logging agents, service mesh proxies) that need to be co-located with the main container.

Pods are ephemeral. When a pod dies, it is gone. Any data stored inside it is lost (unless mounted to a persistent volume). Never depend on a specific pod being alive.

Deployment: manages a set of identical pod replicas. You specify the container image, the number of replicas, and resource requirements. The Deployment controller ensures exactly that many pods are running. When you update the image (a new release), the Deployment performs a rolling update by default: it starts new pods with the new image before terminating old ones, so there is always some healthy capacity serving traffic.

Service: a stable network endpoint in front of a set of pods. Pods come and go; their IP addresses change. A Service gets a stable DNS name and IP, and load-balances traffic across all healthy pods that match its selector. When you deploy a new version and pods are replaced, the Service automatically updates to route only to the new pods. External traffic enters through a Service of type LoadBalancer, which provisions a cloud load balancer in front of the cluster.

ConfigMap and Secret: configuration separate from the container image. A ConfigMap holds non-sensitive configuration (database hostnames, feature flags). A Secret holds sensitive data (database passwords, API keys) encoded separately and accessible only to pods that need them. This is why you do not bake configuration into images.

Namespace: a logical partition within a cluster. Different teams or environments (staging, production) can run in different namespaces with their own resource quotas and access controls. This is Kubernetes’s answer to multi-tenancy within a single cluster.


Resources: Requests and Limits

Every container in Kubernetes should declare what resources it needs:

Requests are what the container is guaranteed. The scheduler only places a pod on a node that has enough unreserved capacity to satisfy the requests. If a pod requests 1 vCPU, the scheduler finds a node with at least 1 vCPU of unallocated capacity.

Limits are the maximum the container is allowed to use. A container that tries to exceed its CPU limit is throttled. A container that exceeds its memory limit is killed and restarted.

Setting requests too low means the scheduler can overcommit nodes, leading to noisy neighbor effects (see the multi-tenancy post). Setting them too high means pods cannot be scheduled because no node appears to have capacity - a phantom resource shortage.

resources:
  requests:
    memory: "512Mi"
    cpu: "500m"      # 500 millicores = 0.5 vCPU
  limits:
    memory: "1Gi"
    cpu: "1000m"

Getting these numbers right requires profiling your actual workload. The Vertical Pod Autoscaler can recommend values based on observed usage.


Autoscaling

Kubernetes provides scaling at two levels.

Horizontal Pod Autoscaler (HPA) scales the number of pod replicas based on observed metrics - CPU utilization, memory, custom metrics from your monitoring system. If average CPU across all replicas exceeds 70%, HPA adds more replicas. If CPU drops, it removes them. This handles traffic spikes without manual intervention.

Cluster Autoscaler scales the number of nodes in the cluster. If pods cannot be scheduled because all nodes are full, Cluster Autoscaler provisions a new node from the cloud provider. If nodes have been underutilized for a period, it drains pods from them and terminates the nodes to reduce cost.

The two autoscalers work together: HPA adds pods when load increases, and Cluster Autoscaler adds nodes when there is nowhere to put those pods.


Health Checks

Kubernetes monitors container health through probes.

Liveness probe: is this container alive? If the liveness probe fails, Kubernetes kills the container and restarts it. Use this for detecting deadlocks or corrupted state that the container cannot recover from on its own.

Readiness probe: is this container ready to receive traffic? If the readiness probe fails, Kubernetes removes the pod from the Service’s load-balancing pool but does not kill it. Use this during startup (the application is still initializing) or when a pod is temporarily overloaded.

Without readiness probes, Kubernetes sends traffic to pods the moment they start - before your application has finished loading its configuration, warming its caches, or establishing database connections. This causes errors during deployments that readiness probes prevent.


GKE: Kubernetes as a Cloud Service

Running Kubernetes yourself is complex. The control plane is stateful, must be highly available, and requires careful upgrades. Google Kubernetes Engine (GKE) manages the control plane for you. GCP handles etcd, upgrades the API server, and provides SLAs for control plane availability. You manage worker nodes and workloads.

GKE adds cloud-native integrations: Autopilot mode removes node management entirely (you pay per pod, not per node), Cloud Load Balancing integrates with Kubernetes Services, Cloud Armor provides DDoS protection, and Workload Identity replaces static service account keys with short-lived credentials.


Summary

Concept What it does
Pod Smallest deployable unit; one or more co-located containers
Deployment Manages a set of pod replicas; handles rolling updates
Service Stable network endpoint; load-balances across healthy pods
Namespace Logical cluster partition for multi-tenancy
Resource requests Guaranteed resources; used for scheduling decisions
Resource limits Hard cap; triggers throttling (CPU) or OOM kill (memory)
HPA Scales replica count based on metrics
Cluster Autoscaler Scales node count based on pod scheduling pressure
Liveness probe Restarts unhealthy containers
Readiness probe Removes unready pods from traffic rotation

Kubernetes is complex because the problems it solves are complex. Running hundreds of containers across dozens of machines, continuously, without downtime, with automatic failure recovery, is not a trivial problem. Kubernetes makes it a declarative one: you describe what you want, and it figures out how to get there and how to keep it that way.


Read next: