# Observability in Production: Logs, Metrics, Traces
Production issue: Latency spiked at 3 AM. By morning, 1000 customers experienced errors. How do you debug?
Without observability: Check error logs (millions of lines). Check metrics dashboard (opaque). Spend 4 hours narrowing down cause.
With observability: Query traces for slowest request. See exactly which service is slow. Drill into logs from that service. Root cause: database query on unindexed column. Fix identified in 10 minutes.
Observability is structured visibility into system behavior.
Three Pillars of Observability
1. Logs
Discrete events; human-readable; high volume.
**Good log:**
```json { "timestamp": "2026-05-13T03:45:22Z", "level": "ERROR", "message": "Database query timeout", "service": "order-service", "request_id": "req-abc-123", "user_id": 5234, "order_id": 98765, "query": "SELECT * FROM orders WHERE user_id = ?", "duration_ms": 5000, "trace_id": "trace-xyz-789" } ```
Structured logging (JSON) enables filtering, searching, correlating.
**Bad log:**
``` Error occurred at 03:45. Something failed. ```
**Logging best practices:**
Structured (JSON), not unstructured text
Include context (user_id, request_id, service)
Level appropriately (ERROR vs. WARN vs. INFO)
Don't log sensitive data (passwords, credit cards)
Include trace_id for correlation across services
2. Metrics
Aggregated measurements; numerical; queryable.
**Example metrics:**
Request latency (p50, p95, p99)
Error rate (errors per second)
CPU, memory, disk usage
Database query duration
Cache hit rate
**Instrumentation:**
```python
# Counter: increment on events counter_requests_total.labels(service="order", endpoint="/orders").inc()
# Histogram: measure distribution latency_histogram.labels(service="order").observe(0.123) # 123ms
# Gauge: measure point-in-time value memory_gauge.labels(service="order").set(512) # 512MB ```
**Querying:**
```
# What was the p99 latency in last hour? histogram_quantile(0.99, rate(latency_histogram[1h]))
# How many requests errored in last 5 minutes? rate(counter_errors_total[5m]) ```
3. Traces
Request journey through system; shows dependencies and bottlenecks.
**Example trace:**
``` Request arrives at API Gateway (0ms) ↓ Authenticate user (10ms) ↓ Call order-service (30ms) ↓ Call inventory-service (25ms) ↓ Call payment-service (40ms) ← SLOW ↓ Return to order-service (5ms) ↓ Return to API Gateway (5ms) ↓ Total latency: 115ms ```
Trace shows payment-service is bottleneck. Investigate that service's database.
**Trace instrumentation:**
```python with tracer.start_as_current_span("process_order") as span: span.set_attribute("order_id", order_id)
# Downstream call automatically traced inventory_result = call_inventory_service(order_id)
payment_result = call_payment_service(order_id, amount) # If payment_service is slow, trace shows it ```
Observability Tools
**Metrics:** Prometheus, Datadog, New Relic, CloudWatch
**Logs:** ELK (Elasticsearch, Logstash, Kibana), Splunk, Datadog, CloudWatch
**Traces:** Jaeger, Datadog, Honeycomb, Lightstep
**Common stack:** Prometheus (metrics) + ELK (logs) + Jaeger (traces)
Real-World Observability Scenarios
Scenario 1: Silent Failure
Service A calls Service B. Service B crashes; Service A retries silently.
Without observability: Request completes; seems OK. Users unaffected (briefly). With observability:
Trace shows retry (normally no retry in trace)
Error rate metric spikes
Dashboard alert: "Service B error rate > 1%"
Investigation: Service B crashed at 3:25 AM
Result: Team woken; issue resolved before most users notice.
Scenario 2: The Memory Leak
Service slowly leaks memory. Takes 2 weeks to run out; crashes.
Without observability: Service crashes; customer reports; manual restart; hidden issue. With observability:
Memory gauge shows gradual increase (weeks of data)
Trend analysis: "Memory increasing 50MB/day"
Alert: "Memory approaching limit"
Developer investigates code; finds leak
Result: Issue fixed before crash.
Scenario 3: Multi-Service Latency
User reports slowness. Which service is slow?
Without observability: Check logs (millions); impossible to correlate. With observability:
Query traces: filter for slow requests (>1000ms)
See pattern: 95% of slow requests have long latency in payment-service
Drill into payment-service traces: all slow requests go to old database host
Query logs from that host: "Disk I/O wait: 95%"
Disk failing; needs replacement
Result: Root cause identified in 15 minutes (vs. 4 hours guessing).
Observability Best Practices
1. Instrument from Start
Don't wait until issue; instrument while coding.
```python
# Add metrics, logs, traces to every business operation def process_order(order_id): with tracer.start_as_current_span("process_order"): logger.info("processing", order_id=order_id) metrics.counter_orders.inc()
result = do_work(order_id)
logger.info("completed", order_id=order_id, success=True) return result ```
2. Structured Logging
Always use structured (JSON) logs with context.
3. Correlation IDs
Trace request across services using correlation ID.
``` API Gateway generates: request_id = "req-abc-123" ↓ Pass to order-service (header) ↓ order-service passes to inventory-service (header) ↓ All logs include request_id ↓ Query: "All logs with request_id = req-abc-123" ↓ See full request journey ```
4. Cardinality Awareness
High-cardinality metrics (e.g., "user_id" as label) explode cost.
**Bad:**
``` metric.labels(user_id=123456).inc() # Millions of unique user_ids
# Cost: $1000s ```
**Good:**
``` metric.labels(service="order").inc() # Fixed labels (service, endpoint)
# Cost: manageable ```
5. Alerting
Alert on symptoms, not causes.
**Bad alerts:**
"CPU > 80%?" Maybe OK; maybe not
"Disk > 80%?" Maybe OK; maybe not
**Good alerts:**
"Error rate > 1%?" Clear action: investigate errors
"Request latency p99 > 500ms?" Clear action: optimize slow requests
"Database connections pool > 90%?" Clear action: scale database
Observability Costs
**Typical SaaS observability cost:**
Metrics ingestion: $100-500/month (proportional to cardinality, retention)
Logs ingestion: $500-2000/month (proportional to volume, retention)
Traces sampling: $100-300/month (sample 10-50% of requests)
**Total: $700-2800/month** for mid-size company
**Optimization:**
Sample logs (10% in production)
Sample traces (1-10% depending on volume)
Set retention (30-90 days, not forever)
Use high-cardinality tools carefully
The Bottom Line
You can't improve what you can't measure. Observability measures your system's behavior.
Instrument code. Collect metrics, logs, traces. Correlate with request IDs. Alert on symptoms. Investigate fast.
Without observability, you're flying blind. With it, you see everything.
Senthil Kumar
Founder & CEO
Founder & CEO of Sentos Technologies. Passionate about AI-powered IT solutions and helping mid-market enterprises advance beyond.