Observability in Production: Logs, Metrics, Traces

# Observability in Production: Logs, Metrics, Traces

Production issue: Latency spiked at 3 AM. By morning, 1000 customers experienced errors. How do you debug?

Without observability: Check error logs (millions of lines). Check metrics dashboard (opaque). Spend 4 hours narrowing down cause.

With observability: Query traces for slowest request. See exactly which service is slow. Drill into logs from that service. Root cause: database query on unindexed column. Fix identified in 10 minutes.

Observability is structured visibility into system behavior.

Three Pillars of Observability

1. Logs

Discrete events; human-readable; high volume.

**Good log:**

```json { "timestamp": "2026-05-13T03:45:22Z", "level": "ERROR", "message": "Database query timeout", "service": "order-service", "request_id": "req-abc-123", "user_id": 5234, "order_id": 98765, "query": "SELECT * FROM orders WHERE user_id = ?", "duration_ms": 5000, "trace_id": "trace-xyz-789" } ```

Structured logging (JSON) enables filtering, searching, correlating.

**Bad log:**

``` Error occurred at 03:45. Something failed. ```

**Logging best practices:**

Structured (JSON), not unstructured text

Include context (user_id, request_id, service)

Level appropriately (ERROR vs. WARN vs. INFO)

Don't log sensitive data (passwords, credit cards)

Include trace_id for correlation across services

2. Metrics

Aggregated measurements; numerical; queryable.

**Example metrics:**

Request latency (p50, p95, p99)

Error rate (errors per second)

CPU, memory, disk usage

Database query duration

Cache hit rate

**Instrumentation:**

```python

# Counter: increment on events counter_requests_total.labels(service="order", endpoint="/orders").inc()

# Histogram: measure distribution latency_histogram.labels(service="order").observe(0.123) # 123ms

# Gauge: measure point-in-time value memory_gauge.labels(service="order").set(512) # 512MB ```

**Querying:**

```

# What was the p99 latency in last hour? histogram_quantile(0.99, rate(latency_histogram[1h]))

# How many requests errored in last 5 minutes? rate(counter_errors_total[5m]) ```

3. Traces

Request journey through system; shows dependencies and bottlenecks.

**Example trace:**

``` Request arrives at API Gateway (0ms) ↓ Authenticate user (10ms) ↓ Call order-service (30ms) ↓ Call inventory-service (25ms) ↓ Call payment-service (40ms) ← SLOW ↓ Return to order-service (5ms) ↓ Return to API Gateway (5ms) ↓ Total latency: 115ms ```

Trace shows payment-service is bottleneck. Investigate that service's database.

**Trace instrumentation:**

```python with tracer.start_as_current_span("process_order") as span: span.set_attribute("order_id", order_id)

# Downstream call automatically traced inventory_result = call_inventory_service(order_id)

payment_result = call_payment_service(order_id, amount) # If payment_service is slow, trace shows it ```

Observability Tools

**Metrics:** Prometheus, Datadog, New Relic, CloudWatch

**Logs:** ELK (Elasticsearch, Logstash, Kibana), Splunk, Datadog, CloudWatch

**Traces:** Jaeger, Datadog, Honeycomb, Lightstep

**Common stack:** Prometheus (metrics) + ELK (logs) + Jaeger (traces)

Real-World Observability Scenarios

Scenario 1: Silent Failure

Service A calls Service B. Service B crashes; Service A retries silently.

Without observability: Request completes; seems OK. Users unaffected (briefly). With observability:

Trace shows retry (normally no retry in trace)

Error rate metric spikes

Dashboard alert: "Service B error rate > 1%"

Investigation: Service B crashed at 3:25 AM

Result: Team woken; issue resolved before most users notice.

Scenario 2: The Memory Leak

Service slowly leaks memory. Takes 2 weeks to run out; crashes.

Without observability: Service crashes; customer reports; manual restart; hidden issue. With observability:

Memory gauge shows gradual increase (weeks of data)

Trend analysis: "Memory increasing 50MB/day"

Alert: "Memory approaching limit"

Developer investigates code; finds leak

Result: Issue fixed before crash.

Scenario 3: Multi-Service Latency

User reports slowness. Which service is slow?

Without observability: Check logs (millions); impossible to correlate. With observability:

Query traces: filter for slow requests (>1000ms)

See pattern: 95% of slow requests have long latency in payment-service

Drill into payment-service traces: all slow requests go to old database host

Query logs from that host: "Disk I/O wait: 95%"

Disk failing; needs replacement

Result: Root cause identified in 15 minutes (vs. 4 hours guessing).

Observability Best Practices

1. Instrument from Start

Don't wait until issue; instrument while coding.

```python

# Add metrics, logs, traces to every business operation def process_order(order_id): with tracer.start_as_current_span("process_order"): logger.info("processing", order_id=order_id) metrics.counter_orders.inc()

result = do_work(order_id)

logger.info("completed", order_id=order_id, success=True) return result ```

2. Structured Logging

Always use structured (JSON) logs with context.

3. Correlation IDs

Trace request across services using correlation ID.

``` API Gateway generates: request_id = "req-abc-123" ↓ Pass to order-service (header) ↓ order-service passes to inventory-service (header) ↓ All logs include request_id ↓ Query: "All logs with request_id = req-abc-123" ↓ See full request journey ```

4. Cardinality Awareness

High-cardinality metrics (e.g., "user_id" as label) explode cost.

**Bad:**

``` metric.labels(user_id=123456).inc() # Millions of unique user_ids

# Cost: $1000s ```

**Good:**

``` metric.labels(service="order").inc() # Fixed labels (service, endpoint)

# Cost: manageable ```

5. Alerting

Alert on symptoms, not causes.

**Bad alerts:**

"CPU > 80%?" Maybe OK; maybe not

"Disk > 80%?" Maybe OK; maybe not

**Good alerts:**

"Error rate > 1%?" Clear action: investigate errors

"Request latency p99 > 500ms?" Clear action: optimize slow requests

"Database connections pool > 90%?" Clear action: scale database

Observability Costs

**Typical SaaS observability cost:**

Metrics ingestion: $100-500/month (proportional to cardinality, retention)

Logs ingestion: $500-2000/month (proportional to volume, retention)

Traces sampling: $100-300/month (sample 10-50% of requests)

**Total: $700-2800/month** for mid-size company

**Optimization:**

Sample logs (10% in production)

Sample traces (1-10% depending on volume)

Set retention (30-90 days, not forever)

Use high-cardinality tools carefully

The Bottom Line

You can't improve what you can't measure. Observability measures your system's behavior.

Instrument code. Collect metrics, logs, traces. Correlate with request IDs. Alert on symptoms. Investigate fast.

Without observability, you're flying blind. With it, you see everything.

Senthil Kumar

Founder & CEO

Founder & CEO of Sentos Technologies. Passionate about AI-powered IT solutions and helping mid-market enterprises advance beyond.

Share this article

LinkedIn Twitter

Advance Beyond.

Managed IT Services

AI Consulting & Engineering

Cybersecurity-as-a-Service

SentosIQ AI Platform

7 Industries

Observability in Production: Logs, Metrics, Traces

Three Pillars of Observability

1. Logs

2. Metrics

3. Traces

Observability Tools

Real-World Observability Scenarios

Scenario 1: Silent Failure

Scenario 2: The Memory Leak

Scenario 3: Multi-Service Latency

Observability Best Practices

1. Instrument from Start

2. Structured Logging

3. Correlation IDs

4. Cardinality Awareness

5. Alerting

Observability Costs

The Bottom Line

Related Articles

Advanced Analytics: Segmentation, Cohort Analysis, Attribution

AI Implementation Strategy: From POC to Production Without Breaking Systems

Analytics Dashboards: Turning Data Into Action

Want more insights?

Managed IT Services

AI Consulting & Engineering

Cybersecurity-as-a-Service

SentosIQ AI Platform

7 Industries

Command Palette

Three Pillars of Observability

1. Logs

2. Metrics

3. Traces

Observability Tools

Real-World Observability Scenarios

Scenario 1: Silent Failure

Scenario 2: The Memory Leak

Scenario 3: Multi-Service Latency

Observability Best Practices

1. Instrument from Start

2. Structured Logging

3. Correlation IDs

4. Cardinality Awareness

5. Alerting

Observability Costs

The Bottom Line

Related Articles

Advanced Analytics: Segmentation, Cohort Analysis, Attribution

AI Implementation Strategy: From POC to Production Without Breaking Systems

Analytics Dashboards: Turning Data Into Action

Want more insights?