# Real-Time Machine Learning: Building Sub-100ms Prediction Serving Infrastructure
A transaction arrives. Do you approve it or flag it as fraud? You have 50 milliseconds. Slower response = transaction timeout = customer frustration.
A customer lands on your site. What recommendations do you show? You have 100 milliseconds before page load timeout. Slower = users bounce.
Real-time ML isn't a research problem. It's an infrastructure problem.
Most companies train models that work offline (batch processing). But many use cases demand real-time: fraud detection, recommendations, anomaly detection, dynamic pricing, personalization. Latency budgets are tight (10-500ms).
This requires rethinking everything: architecture, caching, data fetching, feature computation, and fallback strategies.
The Real-Time ML Architecture
``` Request arrives ↓ Feature lookup (from cache or compute) ↓ Model inference (GPU, TPU, or CPU) ↓ Post-processing (thresholding, formatting) ↓ Response returned ↓ Prediction logged (for monitoring, retraining) ```
**Latency budget (100ms total):**
Feature lookup: 10ms (read from cache)
Model inference: 50ms (GPU inference)
Post-processing: 5ms
Network latency: 25ms
Buffer: 10ms
Every component must be optimized.
Component 1: Feature Serving
Features are computed properties used by the model (e.g., "customer spent $1000 in last 30 days").
**Challenge:** Computing features in real-time is slow.
**Example:**
``` Request: "Is this transaction fraudulent?" Needed feature: "customer's average transaction in last 30 days"
Naive approach: Query database: "SELECT AVG(amount) FROM transactions WHERE customer_id = X AND created_at > NOW() - INTERVAL '30 days'"
Problem: Database query takes 200ms. Timeout. ```
**Solution: Feature cache**
``` Pre-compute features offline Customer 1: avg_30d_spend = $500, transaction_count = 120, ... Customer 2: avg_30d_spend = $250, transaction_count = 50, ... ... Store in Redis or in-memory cache At request time: Cache hit (instant), compute fresh feature (if needed) ```
**Feature serving patterns:**
1. **Pre-computed cache:** Compute features nightly; store in cache - Latency: <1ms - Freshness: Daily (may miss recent activity) - Cost: Computation upfront
2. **Streaming aggregation:** Features computed in real-time from event stream - Latency: <10ms - Freshness: Real-time - Cost: Infrastructure for streaming (Kafka, Flink)
3. **Hybrid:** Recent features streamed; older features cached - Latency: <20ms - Freshness: Recent activity real-time; older cached - Cost: Moderate (streaming + cache)
**Tools:** Feast (Uber feature store), Tecton, Vertex AI Feature Store
Component 2: Model Inference
Model loaded in memory; handles predictions quickly.
**Architecture:**
``` Load Balancer ↓ Model Server 1 (GPU, 100 requests/sec) Model Server 2 (GPU, 100 requests/sec) Model Server 3 (GPU, 100 requests/sec) ↓ Model Service (handles ~300 requests/sec at 100ms latency) ```
**Optimization:**
**Batch requests:** Group incoming requests; inference faster per-sample
- Single request: 100ms - Batch of 32: 120ms (includes batch assembly overhead); 3.7x throughput
**GPU selection:** Inference GPU > training GPU
- Training: V100, A100 (large memory, slow for single-sample) - Inference: T4, A10 (smaller, faster per-sample)
**Model optimization:** See "Deep Learning in Production" above
- Quantization: 4x faster - Pruning: 30% faster - Distillation: Smaller, faster
**Framework:** TensorRT (NVIDIA) provides 5-10x inference speedup vs. PyTorch
Component 3: Caching
Cache prediction results to avoid repeated inference.
**Multi-level caching:**
**L1: Request cache (seconds):**
Same feature input → same prediction (probably)
Cache key: hash(features)
TTL: 10 seconds
Hit rate: 60–80% for many use cases
**L2: Model cache (hours):**
User-level cache
"Recommendations for user X" cached, refreshed every hour
Hit rate: 90%+
**L3: CDN cache (global):**
Cache predictions at edge locations
Serve from nearest data center
Reduces latency for geographically distributed users
Component 4: Feature Pipeline
Compute features in real-time if not in cache.
**Constraints:**
Must complete in <20ms (part of 100ms budget)
Must be deterministic (same input → same output)
Must handle failures (missing data, timeouts)
**Approaches:**
1. **SQL query:** Query database for features - Simple; slow (200ms+) - Use only for non-critical features
2. **Precomputed cache:** Features precomputed, stored in Redis - Fast (<1ms); may be stale - Refresh on schedule or on-demand
3. **Streaming aggregation:** Real-time aggregates from event stream - Fast (<10ms); always fresh - Requires streaming infrastructure (Kafka, Kinesis, Flink)
Real-Time ML Scenarios
Scenario 1: Fraud Detection
Online payment processor.
**Latency requirement:** 50ms (transaction timeout)
**Pipeline:**
Request: "Approve payment of $500 from customer X?"
Feature lookup (cache, 5ms): Recent purchase history, device info
If cache miss (2ms): Streaming aggregation
Model inference (20ms): Fraud classifier
Response (50ms): "Approve" or "Review"
**Architecture:**
Feature cache: Redis with 1-hour TTL
Features streamed: Kafka topic for latest purchases
Model: Quantized XGBoost (ONNX)
Fallback: Approve by default if model unavailable
**Result:** 99.9% latency <50ms; fraud detection 95% accuracy
Scenario 2: Real-Time Recommendations
E-commerce site.
**Latency requirement:** 100ms (page load)
**Pipeline:**
Request: "Recommend products for user Y?"
Feature lookup (cache, 10ms): Browsing history, purchase history
Model inference (40ms): Recommendation model
Rank results (20ms): Sort by relevance, diversity, inventory
Response (100ms): Top 10 products
**Optimization:**
Pre-cache popular recommendations (top 1000 users)
Cache hits: 70%; instant response
Cache misses: 100ms response (acceptable)
Model: Distilled from large recommender; 50ms → 40ms inference
**Result:** P95 latency 80ms; recommendation CTR increased 15%
Scenario 3: Anomaly Detection
Manufacturing plant monitoring sensors.
**Latency requirement:** <1 second (alert on anomaly)
**Pipeline:**
Streaming data: Sensor readings (temperature, pressure, vibration)
Feature engineering (100ms): Rolling averages, rates of change
Model inference (50ms): Anomaly detection model
If anomaly detected: Alert operations
If normal: Log and continue monitoring
**Optimization:**
Models run on edge devices (factory floor)
Local anomaly detection (no network latency)
Upstream aggregation (send results to cloud for monitoring dashboard)
**Result:** <200ms detection; fewer false alarms than threshold-based rules
Monitoring Real-Time ML
Production predictions must be monitored:
**Latency:** P95, P99 latency tracked; alert if >threshold
**Accuracy:** Prediction vs. actual (delayed labels); accuracy tracking
**Error rate:** Model failures, timeouts; alert immediately
**Feature freshness:** Is cache stale? Are features missing?
**Drift:** Prediction distribution changing? Retrain needed?
**Dashboards:**
Real-time latency histogram
Accuracy trend (daily)
Error rate spike detection
Feature freshness status
Cost Optimization
**GPU inference (1M predictions/day):**
Naive: Each prediction hits GPU = 300 requests/sec = 3 T4 GPUs = $1K/day
Cached (60% hit): 40% hit GPU = 1.2 GPUs = $400/day
Edge inference: $0/day (amortized hardware)
**Trade-offs:**
Raw GPU inference: Highest accuracy; highest cost
Cached: Balance cost and accuracy; stale predictions
Edge: Lowest cost; limited model complexity
The Bottom Line
Real-time ML isn't just about fast models. It's about architecture:
Feature serving (cache + streaming)
Model optimization (quantization, distillation)
Caching (multi-level)
Monitoring (latency, accuracy, drift)
Fallback (if model unavailable)
Build this, and you can run production ML at sub-100ms latency with 99.9% availability.
Skip any piece, and you'll discover it in production when transactions fail and customers complain.
Senthil Kumar
Founder & CEO
Founder & CEO of Sentos Technologies. Passionate about AI-powered IT solutions and helping mid-market enterprises advance beyond.