The Modern Data Stack: dbt, Snowflake, Airflow
The modern data stack has transformed how companies handle data. Instead of monolithic ETL tools, we now have composable, specialized components. Here’s what you need to know.
The Stack
┌─────────────────────────────────────────────────────────┐
│ BI Layer │
│ (Looker, Metabase, Tableau) │
├─────────────────────────────────────────────────────────┤
│ Transformation (dbt) │
├─────────────────────────────────────────────────────────┤
│ Data Warehouse (Snowflake, BigQuery) │
├─────────────────────────────────────────────────────────┤
│ Orchestration (Airflow) │
├─────────────────────────────────────────────────────────┤
│ Ingestion (Fivetran, Airbyte, Stitch) │
├─────────────────────────────────────────────────────────┤
│ Data Sources │
│ (Databases, APIs, SaaS tools, Events) │
└─────────────────────────────────────────────────────────┘
Component 1: Cloud Data Warehouses
Snowflake
The poster child of modern data warehousing:
-- Separation of storage and compute
USE WAREHOUSE analytics_wh;
SELECT
date_trunc('month', created_at) as month,
count(*) as signups
FROM users
GROUP BY 1
ORDER BY 1;
Why Snowflake won:
- Separate compute and storage
- Near-infinite scaling
- Zero maintenance
- Pay for what you use
BigQuery
Google’s serverless alternative:
- No cluster management
- Automatic scaling
- Slot-based pricing
Why Not Traditional Databases?
| Traditional RDBMS | Cloud Warehouse |
|---|---|
| Provision upfront | Scale on demand |
| Maintenance burden | Managed service |
| Limited scale | Petabyte scale |
| Fixed cost | Pay per use |
Component 2: dbt (Data Build Tool)
dbt transforms data using SQL:
-- models/staging/stg_orders.sql
SELECT
id as order_id,
user_id,
status,
created_at,
updated_at
FROM {{ source('raw', 'orders') }}
WHERE status != 'deleted'
-- models/marts/monthly_revenue.sql
SELECT
date_trunc('month', created_at) as month,
SUM(amount) as revenue,
COUNT(DISTINCT user_id) as customers
FROM {{ ref('stg_orders') }}
WHERE status = 'completed'
GROUP BY 1
Why dbt Matters
Version Control: SQL in git.
git checkout -b feature/add-ltv-model
# Edit models
git commit -m "Add customer LTV model"
git push
# Create PR, review, merge
Testing:
# schema.yml
models:
- name: stg_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: user_id
tests:
- not_null
- relationships:
to: ref('stg_users')
field: user_id
Documentation:
-- models/marts/customer_ltv.sql
{{
config(
materialized='table',
description='Customer lifetime value calculated from order history'
)
}}
Run dbt docs generate for browsable documentation.
Lineage:
dbt automatically tracks dependencies between models. Visualize how data flows through your warehouse.
Component 3: Airflow
Apache Airflow orchestrates pipelines:
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
from datetime import datetime
default_args = {
'owner': 'data-team',
'depends_on_past': False,
'retries': 1,
}
with DAG(
'daily_data_pipeline',
default_args=default_args,
schedule_interval='@daily',
start_date=datetime(2021, 1, 1),
) as dag:
extract = BashOperator(
task_id='extract_from_api',
bash_command='python /scripts/extract.py'
)
load = SnowflakeOperator(
task_id='load_to_warehouse',
sql='COPY INTO raw.events FROM @stage/events/'
)
transform = BashOperator(
task_id='run_dbt',
bash_command='dbt run --models staging marts'
)
extract >> load >> transform
Airflow Concepts
- DAG: Directed Acyclic Graph of tasks
- Operators: Types of tasks (Bash, Python, SQL)
- Sensors: Wait for conditions
- XComs: Pass data between tasks
Component 4: Ingestion
Getting data into the warehouse:
Fivetran / Airbyte / Stitch
Source (Postgres, Salesforce, Stripe)
↓
Connector
↓
Data Warehouse
# Airbyte configuration example
source:
name: postgres_production
type: postgres
config:
host: db.example.com
database: production
destination:
name: snowflake
type: snowflake
config:
warehouse: LOADING_WH
database: RAW
sync:
schedule: "0 */6 * * *" # Every 6 hours
Event Data
For application events:
# Application code
import segment.analytics as analytics
analytics.track(user_id, 'Order Completed', {
'order_id': order.id,
'amount': order.total,
'items': len(order.items)
})
Events flow to warehouse via Segment, Rudderstack, or similar.
Putting It Together
Example Pipeline
Day 0: Setup
- Configure Fivetran for Postgres, Stripe, Salesforce
- Point all to Snowflake RAW schema
- Set up Airflow on managed service
Day 1-2: Staging Models
- dbt models to clean raw data
- Standard naming, types, deduplication
Day 3-5: Marts
- Business logic in dbt models
- Revenue, churn, cohort analysis
Day 6-7: BI
- Connect Looker/Metabase to marts
- Build dashboards
Daily Workflow
6 AM: Fivetran syncs overnight data
7 AM: Airflow triggers dbt run
8 AM: Fresh data in dashboards
9 AM: Team reviews metrics
Why This Stack Wins
Separation of Concerns
Each tool does one thing well:
- Ingestion: Fivetran
- Storage: Snowflake
- Transform: dbt
- Orchestrate: Airflow
- Visualize: Looker
Analytics Engineering
dbt created a new role: analytics engineer. Software engineering practices applied to analytics:
## Analytics Engineering Principles
1. Version control all transformations
2. Test data quality
3. Document everything
4. Code review before deploy
5. Modular, reusable models
Cost Efficiency
Pay for what you use. Scale down when idle:
-- Snowflake: Auto-suspend warehouse
ALTER WAREHOUSE analytics_wh SET AUTO_SUSPEND = 300;
Getting Started
Minimal Stack
PostgreSQL → dbt (local) → Your brain
Learn dbt first. It works with any SQL database.
Growing Stack
Fivetran → Snowflake → dbt → Metabase
Add components as complexity grows.
Final Thoughts
The modern data stack democratized data infrastructure. What once required a team of data engineers now takes one person a week to set up.
Start with your questions. Build the pipeline to answer them. Iterate.
Data without transformation is just noise.