Skip to main content

Command Palette

Search for a command to run...

Industry News

Real-Time Machine Learning: Building Sub-100ms Prediction Serving Infrastructure

13 May 202615 min readSenthil Kumar

# Real-Time Machine Learning: Building Sub-100ms Prediction Serving Infrastructure

A transaction arrives. Do you approve it or flag it as fraud? You have 50 milliseconds. Slower response = transaction timeout = customer frustration.

A customer lands on your site. What recommendations do you show? You have 100 milliseconds before page load timeout. Slower = users bounce.

Real-time ML isn't a research problem. It's an infrastructure problem.

Most companies train models that work offline (batch processing). But many use cases demand real-time: fraud detection, recommendations, anomaly detection, dynamic pricing, personalization. Latency budgets are tight (10-500ms).

This requires rethinking everything: architecture, caching, data fetching, feature computation, and fallback strategies.

The Real-Time ML Architecture

``` Request arrives ↓ Feature lookup (from cache or compute) ↓ Model inference (GPU, TPU, or CPU) ↓ Post-processing (thresholding, formatting) ↓ Response returned ↓ Prediction logged (for monitoring, retraining) ```

**Latency budget (100ms total):**

Feature lookup: 10ms (read from cache)

Model inference: 50ms (GPU inference)

Post-processing: 5ms

Network latency: 25ms

Buffer: 10ms

Every component must be optimized.

Component 1: Feature Serving

Features are computed properties used by the model (e.g., "customer spent $1000 in last 30 days").

**Challenge:** Computing features in real-time is slow.

**Example:**

``` Request: "Is this transaction fraudulent?" Needed feature: "customer's average transaction in last 30 days"

Naive approach: Query database: "SELECT AVG(amount) FROM transactions WHERE customer_id = X AND created_at > NOW() - INTERVAL '30 days'"

Problem: Database query takes 200ms. Timeout. ```

**Solution: Feature cache**

``` Pre-compute features offline Customer 1: avg_30d_spend = $500, transaction_count = 120, ... Customer 2: avg_30d_spend = $250, transaction_count = 50, ... ... Store in Redis or in-memory cache At request time: Cache hit (instant), compute fresh feature (if needed) ```

**Feature serving patterns:**

1. **Pre-computed cache:** Compute features nightly; store in cache - Latency: <1ms - Freshness: Daily (may miss recent activity) - Cost: Computation upfront

2. **Streaming aggregation:** Features computed in real-time from event stream - Latency: <10ms - Freshness: Real-time - Cost: Infrastructure for streaming (Kafka, Flink)

3. **Hybrid:** Recent features streamed; older features cached - Latency: <20ms - Freshness: Recent activity real-time; older cached - Cost: Moderate (streaming + cache)

**Tools:** Feast (Uber feature store), Tecton, Vertex AI Feature Store

Component 2: Model Inference

Model loaded in memory; handles predictions quickly.

**Architecture:**

``` Load Balancer ↓ Model Server 1 (GPU, 100 requests/sec) Model Server 2 (GPU, 100 requests/sec) Model Server 3 (GPU, 100 requests/sec) ↓ Model Service (handles ~300 requests/sec at 100ms latency) ```

**Optimization:**

**Batch requests:** Group incoming requests; inference faster per-sample

- Single request: 100ms - Batch of 32: 120ms (includes batch assembly overhead); 3.7x throughput

**GPU selection:** Inference GPU > training GPU

- Training: V100, A100 (large memory, slow for single-sample) - Inference: T4, A10 (smaller, faster per-sample)

**Model optimization:** See "Deep Learning in Production" above

- Quantization: 4x faster - Pruning: 30% faster - Distillation: Smaller, faster

**Framework:** TensorRT (NVIDIA) provides 5-10x inference speedup vs. PyTorch

Component 3: Caching

Cache prediction results to avoid repeated inference.

**Multi-level caching:**

**L1: Request cache (seconds):**

Same feature input → same prediction (probably)

Cache key: hash(features)

TTL: 10 seconds

Hit rate: 60–80% for many use cases

**L2: Model cache (hours):**

User-level cache

"Recommendations for user X" cached, refreshed every hour

Hit rate: 90%+

**L3: CDN cache (global):**

Cache predictions at edge locations

Serve from nearest data center

Reduces latency for geographically distributed users

Component 4: Feature Pipeline

Compute features in real-time if not in cache.

**Constraints:**

Must complete in <20ms (part of 100ms budget)

Must be deterministic (same input → same output)

Must handle failures (missing data, timeouts)

**Approaches:**

1. **SQL query:** Query database for features - Simple; slow (200ms+) - Use only for non-critical features

2. **Precomputed cache:** Features precomputed, stored in Redis - Fast (<1ms); may be stale - Refresh on schedule or on-demand

3. **Streaming aggregation:** Real-time aggregates from event stream - Fast (<10ms); always fresh - Requires streaming infrastructure (Kafka, Kinesis, Flink)

Real-Time ML Scenarios

Scenario 1: Fraud Detection

Online payment processor.

**Latency requirement:** 50ms (transaction timeout)

**Pipeline:**

Request: "Approve payment of $500 from customer X?"

Feature lookup (cache, 5ms): Recent purchase history, device info

If cache miss (2ms): Streaming aggregation

Model inference (20ms): Fraud classifier

Response (50ms): "Approve" or "Review"

**Architecture:**

Feature cache: Redis with 1-hour TTL

Features streamed: Kafka topic for latest purchases

Model: Quantized XGBoost (ONNX)

Fallback: Approve by default if model unavailable

**Result:** 99.9% latency <50ms; fraud detection 95% accuracy

Scenario 2: Real-Time Recommendations

E-commerce site.

**Latency requirement:** 100ms (page load)

**Pipeline:**

Request: "Recommend products for user Y?"

Feature lookup (cache, 10ms): Browsing history, purchase history

Model inference (40ms): Recommendation model

Rank results (20ms): Sort by relevance, diversity, inventory

Response (100ms): Top 10 products

**Optimization:**

Pre-cache popular recommendations (top 1000 users)

Cache hits: 70%; instant response

Cache misses: 100ms response (acceptable)

Model: Distilled from large recommender; 50ms → 40ms inference

**Result:** P95 latency 80ms; recommendation CTR increased 15%

Scenario 3: Anomaly Detection

Manufacturing plant monitoring sensors.

**Latency requirement:** <1 second (alert on anomaly)

**Pipeline:**

Streaming data: Sensor readings (temperature, pressure, vibration)

Feature engineering (100ms): Rolling averages, rates of change

Model inference (50ms): Anomaly detection model

If anomaly detected: Alert operations

If normal: Log and continue monitoring

**Optimization:**

Models run on edge devices (factory floor)

Local anomaly detection (no network latency)

Upstream aggregation (send results to cloud for monitoring dashboard)

**Result:** <200ms detection; fewer false alarms than threshold-based rules

Monitoring Real-Time ML

Production predictions must be monitored:

**Latency:** P95, P99 latency tracked; alert if >threshold

**Accuracy:** Prediction vs. actual (delayed labels); accuracy tracking

**Error rate:** Model failures, timeouts; alert immediately

**Feature freshness:** Is cache stale? Are features missing?

**Drift:** Prediction distribution changing? Retrain needed?

**Dashboards:**

Real-time latency histogram

Accuracy trend (daily)

Error rate spike detection

Feature freshness status

Cost Optimization

**GPU inference (1M predictions/day):**

Naive: Each prediction hits GPU = 300 requests/sec = 3 T4 GPUs = $1K/day

Cached (60% hit): 40% hit GPU = 1.2 GPUs = $400/day

Edge inference: $0/day (amortized hardware)

**Trade-offs:**

Raw GPU inference: Highest accuracy; highest cost

Cached: Balance cost and accuracy; stale predictions

Edge: Lowest cost; limited model complexity

The Bottom Line

Real-time ML isn't just about fast models. It's about architecture:

Feature serving (cache + streaming)

Model optimization (quantization, distillation)

Caching (multi-level)

Monitoring (latency, accuracy, drift)

Fallback (if model unavailable)

Build this, and you can run production ML at sub-100ms latency with 99.9% availability.

Skip any piece, and you'll discover it in production when transactions fail and customers complain.

Senthil Kumar

Founder & CEO

Founder & CEO of Sentos Technologies. Passionate about AI-powered IT solutions and helping mid-market enterprises advance beyond.

Share this article

Want more insights?

Subscribe to the Sentos newsletter for expert perspectives on managed IT, cybersecurity, AI, and digital transformation.

Advance Beyond.