# Modern Data Warehouse Architecture: Building Analytics Infrastructure
A company has data scattered: CRM in Salesforce, transactions in PostgreSQL, logs in S3, web events in GA4. To answer "How much revenue did each customer generate?" requires querying five systems and manually joining results.
A data warehouse centralizes data. Extract from all sources, transform into consistent schema, load into warehouse. Now one query answers the question in seconds.
Data Warehouse Basics
**Purpose:** Single source of truth for analytics and business intelligence
**Architecture:**
``` Source Systems (CRM, ERP, databases, APIs, logs) ↓ ETL Pipeline (Extract, Transform, Load) ↓ Data Warehouse (Central repository) ↓ Analytics Tools (Dashboards, SQL queries, reports) ```
**Key characteristics:**
Optimized for read queries (not writes)
Denormalized schema (opposite of OLTP databases)
Time-series data (historical tracking)
Large volume (billions of rows)
High concurrency (many analysts querying simultaneously)
Warehouse Schema Patterns
Dimensional Model (Star Schema)
Most common pattern for analytics.
``` Fact table (transactions): transaction_id, customer_id, product_id, date_id, amount, quantity
Dimension tables: customers: customer_id, name, email, region products: product_id, name, category, price dates: date_id, year, month, day, quarter
Query: Total revenue by region last quarter SELECT region, SUM(amount) FROM facts JOIN customers ON facts.customer_id = customers.customer_id JOIN dates ON facts.date_id = dates.date_id WHERE quarter = 'Q1' AND year = 2026 ```
**Advantages:**
Easy to understand
Fast queries (denormalized)
Scalable
Kimball vs. Inmon
**Kimball (bottom-up):**
Build dimensional models per business process (sales, marketing, HR)
Conformed dimensions (shared customer dimension across all models)
Flexible; supports many use cases
Popular; most companies use this
**Inmon (top-down):**
Normalize everything into 3NF
Single "golden source" of truth
Complex queries; slower analytics
Enterprise; traditional
**Most companies:** Kimball (easier to implement)
Data Warehouse Tools
**Cloud-native (popular):**
Snowflake: Easy to use; expensive per query
BigQuery: Google's; serverless; great for large queries
Redshift: AWS; integration with AWS ecosystem
Databricks: Unified analytics; combines warehouse + lakehouse
**Open-source:**
Apache Iceberg: Table format; time travel support
DuckDB: In-process; fast; single-file databases
**Traditional:**
Oracle, SQL Server, Teradata (legacy)
**Evaluation:**
``` Snowflake vs BigQuery vs Redshift
Cost (per TB scanned): Snowflake: $3-4 BigQuery: $6-7 Redshift: $6-10
Speed (analytical query): Snowflake: 5-30 seconds BigQuery: 2-10 seconds Redshift: 10-60 seconds
Ease of use: Snowflake: ⭐⭐⭐⭐⭐ BigQuery: ⭐⭐⭐⭐ Redshift: ⭐⭐⭐ ```
ETL Pipeline Design
Extract data from sources → Transform to warehouse schema → Load into warehouse
**Frequency:**
Real-time (streaming): Data available in warehouse within seconds
Batch (daily): Data available next morning
Hybrid: Streaming for critical data; batch for bulk
**Pipeline tools:**
dbt (data build tool): SQL-based transformations; popular
Apache Airflow: Workflow orchestration
Talend, Informatica: Enterprise ETL tools
Python/Pandas: Custom ETL scripts
**Example dbt model:**
```sql
-- models/customers_summary.sql SELECT c.customer_id, c.name, COUNT(DISTINCT t.transaction_id) as transaction_count, SUM(t.amount) as total_spent, MAX(t.date) as last_purchase_date FROM {{ ref('customers') }} c LEFT JOIN {{ ref('transactions') }} t ON c.customer_id = t.customer_id GROUP BY c.customer_id, c.name ```
dbt handles:
Source data lineage
Testing (not null, uniqueness, foreign keys)
Documentation
Version control
Data Warehouse Optimization
Partitioning
Split large tables by date or region for faster queries.
```sql
-- Partitioned by date CREATE TABLE transactions_partitioned ( transaction_id, customer_id, amount, transaction_date ) PARTITION BY RANGE (YEAR(transaction_date)) ( PARTITION p2024 VALUES LESS THAN (2025), PARTITION p2025 VALUES LESS THAN (2026), PARTITION p2026 VALUES LESS THAN (2027) )
-- Query benefits: only scans p2026 partition SELECT * FROM transactions_partitioned WHERE transaction_date >= '2026-01-01' ```
Clustering
Group similar rows together for faster queries.
```sql CREATE TABLE customers_clustered ( customer_id, name, region, country ) CLUSTER BY (region, country)
-- Query benefits: region/country filters are fast SELECT * FROM customers_clustered WHERE region = 'US' AND country = 'USA' ```
Columnar Storage
Store by column (not row) for analytical workloads.
**Row storage:**
``` [id=1, name='Alice', amount=100] [id=2, name='Bob', amount=200] ```
Scanning for "amount > 150" requires reading all columns.
**Columnar storage:**
``` id: [1, 2] name: ['Alice', 'Bob'] amount: [100, 200] ```
Scanning for "amount > 150" reads only amount column (1/3 I/O).
All modern warehouses (Snowflake, BigQuery, Redshift) use columnar storage.
Real-World Data Warehouse Scenarios
Scenario 1: E-commerce Analytics
Company wants: "Revenue by product category by region by day"
Without warehouse: Scrape 3 systems manually; merge in Excel; stale data.
With warehouse:
ETL loads order data daily
Schema: fact_orders, dim_products, dim_regions, dim_dates
Query: Multi-second response
Dashboard: Auto-refresh daily
Result: Real-time business visibility
Scenario 2: Customer Segmentation
Marketing wants: "How many customers spent $1000+ in last quarter?"
Without warehouse: Database query (takes 10 minutes, locks production tables)
With warehouse:
Pre-computed table: customers_q1_2026
Aggregates: total_spent, purchase_count, days_since_last_purchase
Query: Instant response
Segment updated daily; marketing can run campaigns immediately
Scenario 3: Compliance Reporting
Audit requires: "All transactions from account X, month Y, with customer details"
Without warehouse: Manual query; hours to compile.
With warehouse:
Fact table: Immutable (never deleted)
Historical dimension tables: Track changes (customer moved from region A to B)
Query: Instant compliance report
Audit trail: Full history available
Data Warehouse Costs
**Snowflake (typical SaaS company):**
``` Data storage: 10TB = $200/month Compute (1 warehouse, 8 credits/day): $400/month Total: $600/month ```
**BigQuery:**
``` Data storage: 10TB = $200/month Query cost (500GB scanned/month): $3500/month Total: $3700/month ```
**Optimization:**
Partition tables (reduce scans)
Cluster tables (reduce scans)
Archive old data (reduce storage)
Use caching (avoid re-scanning)
Common Data Warehouse Mistakes
1. **No partitioning** — Queries scan entire table (slow, expensive) 2. **Normalized schema** — Too many joins; slow queries 3. **Poor data quality** — Garbage in, garbage out 4. **No documentation** — Analysts don't know what data means 5. **No governance** — Unauthorized access to sensitive data 6. **Manual ETL** — Unscalable; error-prone 7. **No version control** — Can't rollback bad transforms
The Bottom Line
Data warehouses democratize analytics. Instead of "ask IT for a report (2 weeks)," analysts query data themselves (minutes).
Build warehouse once. Analysts use forever. ROI pays back in months.
Start with: One cloud warehouse (Snowflake or BigQuery). dbt for transforms. BI tool for dashboards. Done.
Senthil Kumar
Founder & CEO
Founder & CEO of Sentos Technologies. Passionate about AI-powered IT solutions and helping mid-market enterprises advance beyond.