Skip to main content

Command Palette

Search for a command to run...

AI & ML

Kubernetes at Scale: Orchestrating Thousands of Containers

13 May 202615 min readSenthil Kumar

# Kubernetes at Scale: Orchestrating Thousands of Containers

Kubernetes is complex. Networking, storage, scheduling, secrets management, RBAC, ingress, operators. Learning curve is steep.

But once you understand it, Kubernetes enables something remarkable: define "I want 100 instances of this service running 24/7 with zero downtime during updates" and Kubernetes makes it happen automatically.

Core Kubernetes Concepts

**Cluster:** Multiple machines (nodes) running Kubernetes

**Pod:** Smallest deployable unit; one or more containers

Usually 1 container per pod

Containers in same pod share network, storage

Ephemeral; destroyed and recreated frequently

**Deployment:** Declarative specification of desired state

```yaml apiVersion: apps/v1 kind: Deployment metadata: name: myapp spec: replicas: 100 # Run 100 instances selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: containers: - name: app image: myapp:1.0.5 resources: requests: memory: "512Mi" cpu: "500m" limits: memory: "1Gi" cpu: "1000m" ```

Kubernetes watches this spec. If 5 pods crash, Kubernetes restarts them. If you change `replicas: 100` → `replicas: 150`, Kubernetes adds 50 pods. You declare desired state; Kubernetes achieves it.

**Service:** Exposes pods to network; load balances traffic

Pod IP changes constantly (pods recreated)

Service provides stable IP and DNS name

Routes traffic to healthy pods

**ConfigMap & Secret:** Configuration and secrets management

ConfigMap: Non-sensitive config (environment variables)

Secret: Sensitive data (database passwords, API keys)

**Namespace:** Logical isolation (dev, staging, production)

**Ingress:** Routes external traffic to services; handles TLS

Kubernetes Deployment Strategies

Rolling Update (Default)

Gradually replace old pods with new ones.

``` Initial: 5 pods of version 1.0 ↓ Start pod of version 1.1 ↓ Kill pod of version 1.0 ↓ Repeat until all pods are version 1.1 Result: Zero downtime; users always have service ```

**Configuration:**

```yaml spec: strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 # Max pods above replicas maxUnavailable: 0 # Min pods available during update ```

Blue-Green Deployment

Run two full environments; switch traffic between them.

``` Blue (production, version 1.0): 100 pods Green (staging, version 1.1): 100 pods ↓ Test green (automated tests) ↓ If tests pass, switch traffic: ingress routes to green ↓ Old blue (1.0) runs unused; available for quick rollback ```

**Advantages:** Instant rollback; full environment testing

**Disadvantages:** Double resource cost temporarily

Canary Deployment

Route small % of traffic to new version; monitor; ramp up.

``` Version 1.0: 95% traffic (95 pods) Version 1.1: 5% traffic (5 pods) ↓ Monitor: error rate, latency, exceptions ↓ If all good: Version 1.0: 50% traffic (50 pods) Version 1.1: 50% traffic (50 pods) ↓ If still good: Version 1.1: 100% traffic (100 pods) ↓ If error detected at any stage, revert to 100% version 1.0 ```

**Tools:** Istio, Flagger automate canary deployments

Kubernetes Anti-Patterns & Pitfalls

1. No Resource Limits

Pod uses all CPU/memory; starves other pods. Cluster becomes unstable.

**Fix:** Always set requests and limits:

```yaml resources: requests: memory: "512Mi" cpu: "500m" limits: memory: "1Gi" cpu: "1000m" ```

2. No Health Checks

Pod crashes; Kubernetes doesn't notice; users see errors.

**Fix:** Define liveness and readiness probes:

```yaml livenessProbe: httpGet: path: /healthz port: 8000 initialDelaySeconds: 30 periodSeconds: 10

readinessProbe: httpGet: path: /ready port: 8000 initialDelaySeconds: 5 periodSeconds: 5 ```

3. Single Pod Per Node

Node fails; pod lost; no replica.

**Fix:** Always run multiple replicas. Kubernetes' PodDisruptionBudget ensures graceful shutdowns.

4. Forgetting StatefulSets

Databases and stateful services need persistence.

**PetSet vs. CattleSet:**

Cattle: Pods are disposable (typical microservices)

Pets: Pods have identity, persistent storage (databases)

Use **StatefulSet** for databases, caches, message queues.

5. No Network Policies

By default, all pods can talk to all pods. Security risk.

**Fix:** Define NetworkPolicies:

```yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: deny-all-ingress spec: podSelector: {} policyTypes: - Ingress ```

Kubernetes Cost Optimization

**Resource costs:**

Master node: $100-200/month (managed by cloud provider, often free)

Worker node (t3.medium): $30-60/month each

10 nodes = $300-600/month base

**Optimization strategies:**

1. **Resource requests:** Accurate requests allow better packing; fewer nodes needed 2. **Cluster autoscaling:** Add nodes when needed; remove when not 3. **Pod eviction:** Low-priority pods evicted during resource shortage 4. **Reserved instances:** Buy long-term compute at discount 5. **Spot instances:** Cheap temporary instances; useful for batch jobs

**Real example:**

``` Initial: 20 nodes (constant), $1200/month After optimization: Autoscaling: 5-20 nodes based on load Resource limits: Better packing Spot instances: 50% of nodes New cost: $400-800/month (33-67% reduction) ```

Real-World Kubernetes Scenarios

Scenario 1: The Midnight Traffic Spike

E-commerce site: Typical traffic 100 requests/sec. Black Friday traffic: 1000 requests/sec.

Without Kubernetes: Provision for peak (20 servers); waste capacity 99% of year.

With Kubernetes:

Typical: 5 pods

Black Friday: Autoscaler detects high CPU; scales to 50 pods

Peak passes: Scales back to 5 pods

Result: Right-size capacity; save money

Scenario 2: The Zero-Downtime Deployment

Company deploys 50 times/day.

Without Kubernetes: Each deploy = brief downtime; users affected.

With Kubernetes: Rolling update (or blue-green) means zero downtime. Users don't notice deploys.

Scenario 3: The Resource Neighbor Problem

Two teams share Kubernetes cluster. Team A's batch job uses all CPU. Team B's critical service starves.

Without Kubernetes: Manual resource management; constant conflict.

With Kubernetes: ResourceQuotas limit Team A to 50% CPU. Team B guaranteed 50%. Isolation.

Kubernetes Learning Path

1. **Learn core concepts:** Pods, Deployments, Services, ConfigMaps 2. **Deploy locally:** Minikube or Docker Desktop Kubernetes 3. **Deploy on cloud:** EKS (AWS), GKE (Google), AKS (Azure) 4. **Master advanced:** StatefulSets, Operators, networking policies 5. **Optimize:** Resource requests, autoscaling, cost management

Common Kubernetes Operations

**Deploy new version:**

```bash kubectl set image deployment/myapp app=myapp:1.1 ```

Kubernetes automatically rolls out new version.

**Scale:**

```bash kubectl scale deployment myapp --replicas=200 ```

Instantly runs 200 instances.

**Rollback:**

```bash kubectl rollout undo deployment/myapp ```

Revert to previous version instantly.

**Monitor:**

```bash kubectl logs deployment/myapp kubectl top nodes kubectl describe pod myapp-abc123 ```

The Bottom Line

Kubernetes is complex, but it solves hard problems: scaling, updates, reliability, resource management.

Start small (5-10 pods). Learn gradually. Master rolling updates. Then scale confidently.

Kubernetes lets you run 10,000 containers with the operational complexity of 10.

That's powerful.

Senthil Kumar

Founder & CEO

Founder & CEO of Sentos Technologies. Passionate about AI-powered IT solutions and helping mid-market enterprises advance beyond.

Share this article

Want more insights?

Subscribe to the Sentos newsletter for expert perspectives on managed IT, cybersecurity, AI, and digital transformation.

Advance Beyond.