Skip to main content

Command Palette

Search for a command to run...

Industry News

Modern Data Warehouse Architecture: Building Analytics Infrastructure

13 May 202615 min readSenthil Kumar

# Modern Data Warehouse Architecture: Building Analytics Infrastructure

A company has data scattered: CRM in Salesforce, transactions in PostgreSQL, logs in S3, web events in GA4. To answer "How much revenue did each customer generate?" requires querying five systems and manually joining results.

A data warehouse centralizes data. Extract from all sources, transform into consistent schema, load into warehouse. Now one query answers the question in seconds.

Data Warehouse Basics

**Purpose:** Single source of truth for analytics and business intelligence

**Architecture:**

``` Source Systems (CRM, ERP, databases, APIs, logs) ↓ ETL Pipeline (Extract, Transform, Load) ↓ Data Warehouse (Central repository) ↓ Analytics Tools (Dashboards, SQL queries, reports) ```

**Key characteristics:**

Optimized for read queries (not writes)

Denormalized schema (opposite of OLTP databases)

Time-series data (historical tracking)

Large volume (billions of rows)

High concurrency (many analysts querying simultaneously)

Warehouse Schema Patterns

Dimensional Model (Star Schema)

Most common pattern for analytics.

``` Fact table (transactions): transaction_id, customer_id, product_id, date_id, amount, quantity

Dimension tables: customers: customer_id, name, email, region products: product_id, name, category, price dates: date_id, year, month, day, quarter

Query: Total revenue by region last quarter SELECT region, SUM(amount) FROM facts JOIN customers ON facts.customer_id = customers.customer_id JOIN dates ON facts.date_id = dates.date_id WHERE quarter = 'Q1' AND year = 2026 ```

**Advantages:**

Easy to understand

Fast queries (denormalized)

Scalable

Kimball vs. Inmon

**Kimball (bottom-up):**

Build dimensional models per business process (sales, marketing, HR)

Conformed dimensions (shared customer dimension across all models)

Flexible; supports many use cases

Popular; most companies use this

**Inmon (top-down):**

Normalize everything into 3NF

Single "golden source" of truth

Complex queries; slower analytics

Enterprise; traditional

**Most companies:** Kimball (easier to implement)

Data Warehouse Tools

**Cloud-native (popular):**

Snowflake: Easy to use; expensive per query

BigQuery: Google's; serverless; great for large queries

Redshift: AWS; integration with AWS ecosystem

Databricks: Unified analytics; combines warehouse + lakehouse

**Open-source:**

Apache Iceberg: Table format; time travel support

DuckDB: In-process; fast; single-file databases

**Traditional:**

Oracle, SQL Server, Teradata (legacy)

**Evaluation:**

``` Snowflake vs BigQuery vs Redshift

Cost (per TB scanned): Snowflake: $3-4 BigQuery: $6-7 Redshift: $6-10

Speed (analytical query): Snowflake: 5-30 seconds BigQuery: 2-10 seconds Redshift: 10-60 seconds

Ease of use: Snowflake: ⭐⭐⭐⭐⭐ BigQuery: ⭐⭐⭐⭐ Redshift: ⭐⭐⭐ ```

ETL Pipeline Design

Extract data from sources → Transform to warehouse schema → Load into warehouse

**Frequency:**

Real-time (streaming): Data available in warehouse within seconds

Batch (daily): Data available next morning

Hybrid: Streaming for critical data; batch for bulk

**Pipeline tools:**

dbt (data build tool): SQL-based transformations; popular

Apache Airflow: Workflow orchestration

Talend, Informatica: Enterprise ETL tools

Python/Pandas: Custom ETL scripts

**Example dbt model:**

```sql

-- models/customers_summary.sql SELECT c.customer_id, c.name, COUNT(DISTINCT t.transaction_id) as transaction_count, SUM(t.amount) as total_spent, MAX(t.date) as last_purchase_date FROM {{ ref('customers') }} c LEFT JOIN {{ ref('transactions') }} t ON c.customer_id = t.customer_id GROUP BY c.customer_id, c.name ```

dbt handles:

Source data lineage

Testing (not null, uniqueness, foreign keys)

Documentation

Version control

Data Warehouse Optimization

Partitioning

Split large tables by date or region for faster queries.

```sql

-- Partitioned by date CREATE TABLE transactions_partitioned ( transaction_id, customer_id, amount, transaction_date ) PARTITION BY RANGE (YEAR(transaction_date)) ( PARTITION p2024 VALUES LESS THAN (2025), PARTITION p2025 VALUES LESS THAN (2026), PARTITION p2026 VALUES LESS THAN (2027) )

-- Query benefits: only scans p2026 partition SELECT * FROM transactions_partitioned WHERE transaction_date >= '2026-01-01' ```

Clustering

Group similar rows together for faster queries.

```sql CREATE TABLE customers_clustered ( customer_id, name, region, country ) CLUSTER BY (region, country)

-- Query benefits: region/country filters are fast SELECT * FROM customers_clustered WHERE region = 'US' AND country = 'USA' ```

Columnar Storage

Store by column (not row) for analytical workloads.

**Row storage:**

``` [id=1, name='Alice', amount=100] [id=2, name='Bob', amount=200] ```

Scanning for "amount > 150" requires reading all columns.

**Columnar storage:**

``` id: [1, 2] name: ['Alice', 'Bob'] amount: [100, 200] ```

Scanning for "amount > 150" reads only amount column (1/3 I/O).

All modern warehouses (Snowflake, BigQuery, Redshift) use columnar storage.

Real-World Data Warehouse Scenarios

Scenario 1: E-commerce Analytics

Company wants: "Revenue by product category by region by day"

Without warehouse: Scrape 3 systems manually; merge in Excel; stale data.

With warehouse:

ETL loads order data daily

Schema: fact_orders, dim_products, dim_regions, dim_dates

Query: Multi-second response

Dashboard: Auto-refresh daily

Result: Real-time business visibility

Scenario 2: Customer Segmentation

Marketing wants: "How many customers spent $1000+ in last quarter?"

Without warehouse: Database query (takes 10 minutes, locks production tables)

With warehouse:

Pre-computed table: customers_q1_2026

Aggregates: total_spent, purchase_count, days_since_last_purchase

Query: Instant response

Segment updated daily; marketing can run campaigns immediately

Scenario 3: Compliance Reporting

Audit requires: "All transactions from account X, month Y, with customer details"

Without warehouse: Manual query; hours to compile.

With warehouse:

Fact table: Immutable (never deleted)

Historical dimension tables: Track changes (customer moved from region A to B)

Query: Instant compliance report

Audit trail: Full history available

Data Warehouse Costs

**Snowflake (typical SaaS company):**

``` Data storage: 10TB = $200/month Compute (1 warehouse, 8 credits/day): $400/month Total: $600/month ```

**BigQuery:**

``` Data storage: 10TB = $200/month Query cost (500GB scanned/month): $3500/month Total: $3700/month ```

**Optimization:**

Partition tables (reduce scans)

Cluster tables (reduce scans)

Archive old data (reduce storage)

Use caching (avoid re-scanning)

Common Data Warehouse Mistakes

1. **No partitioning** — Queries scan entire table (slow, expensive) 2. **Normalized schema** — Too many joins; slow queries 3. **Poor data quality** — Garbage in, garbage out 4. **No documentation** — Analysts don't know what data means 5. **No governance** — Unauthorized access to sensitive data 6. **Manual ETL** — Unscalable; error-prone 7. **No version control** — Can't rollback bad transforms

The Bottom Line

Data warehouses democratize analytics. Instead of "ask IT for a report (2 weeks)," analysts query data themselves (minutes).

Build warehouse once. Analysts use forever. ROI pays back in months.

Start with: One cloud warehouse (Snowflake or BigQuery). dbt for transforms. BI tool for dashboards. Done.

Senthil Kumar

Founder & CEO

Founder & CEO of Sentos Technologies. Passionate about AI-powered IT solutions and helping mid-market enterprises advance beyond.

Share this article

Want more insights?

Subscribe to the Sentos newsletter for expert perspectives on managed IT, cybersecurity, AI, and digital transformation.

Advance Beyond.