The Modern Data Stack: dbt, Snowflake, Airflow

February 28, 2021

data-engineering dev

The modern data stack has transformed how companies handle data. Instead of monolithic ETL tools, we now have composable, specialized components. Here’s what you need to know.

The Stack

┌─────────────────────────────────────────────────────────┐
│                    BI Layer                              │
│            (Looker, Metabase, Tableau)                  │
├─────────────────────────────────────────────────────────┤
│               Transformation (dbt)                       │
├─────────────────────────────────────────────────────────┤
│         Data Warehouse (Snowflake, BigQuery)            │
├─────────────────────────────────────────────────────────┤
│              Orchestration (Airflow)                     │
├─────────────────────────────────────────────────────────┤
│         Ingestion (Fivetran, Airbyte, Stitch)           │
├─────────────────────────────────────────────────────────┤
│                   Data Sources                           │
│      (Databases, APIs, SaaS tools, Events)              │
└─────────────────────────────────────────────────────────┘

Component 1: Cloud Data Warehouses

Snowflake

The poster child of modern data warehousing:

-- Separation of storage and compute
USE WAREHOUSE analytics_wh;

SELECT 
    date_trunc('month', created_at) as month,
    count(*) as signups
FROM users
GROUP BY 1
ORDER BY 1;

Why Snowflake won:

Separate compute and storage
Near-infinite scaling
Zero maintenance
Pay for what you use

BigQuery

Google’s serverless alternative:

No cluster management
Automatic scaling
Slot-based pricing

Why Not Traditional Databases?

Traditional RDBMS	Cloud Warehouse
Provision upfront	Scale on demand
Maintenance burden	Managed service
Limited scale	Petabyte scale
Fixed cost	Pay per use

Component 2: dbt (Data Build Tool)

dbt transforms data using SQL:

-- models/staging/stg_orders.sql
SELECT
    id as order_id,
    user_id,
    status,
    created_at,
    updated_at
FROM {{ source('raw', 'orders') }}
WHERE status != 'deleted'

-- models/marts/monthly_revenue.sql
SELECT
    date_trunc('month', created_at) as month,
    SUM(amount) as revenue,
    COUNT(DISTINCT user_id) as customers
FROM {{ ref('stg_orders') }}
WHERE status = 'completed'
GROUP BY 1

Why dbt Matters

Version Control: SQL in git.

git checkout -b feature/add-ltv-model
# Edit models
git commit -m "Add customer LTV model"
git push
# Create PR, review, merge

Testing:

# schema.yml
models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: user_id
        tests:
          - not_null
          - relationships:
              to: ref('stg_users')
              field: user_id

Documentation:

-- models/marts/customer_ltv.sql
{{
  config(
    materialized='table',
    description='Customer lifetime value calculated from order history'
  )
}}

Run dbt docs generate for browsable documentation.

Lineage:

dbt automatically tracks dependencies between models. Visualize how data flows through your warehouse.

Component 3: Airflow

Apache Airflow orchestrates pipelines:

from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
from datetime import datetime

default_args = {
    'owner': 'data-team',
    'depends_on_past': False,
    'retries': 1,
}

with DAG(
    'daily_data_pipeline',
    default_args=default_args,
    schedule_interval='@daily',
    start_date=datetime(2021, 1, 1),
) as dag:
    
    extract = BashOperator(
        task_id='extract_from_api',
        bash_command='python /scripts/extract.py'
    )
    
    load = SnowflakeOperator(
        task_id='load_to_warehouse',
        sql='COPY INTO raw.events FROM @stage/events/'
    )
    
    transform = BashOperator(
        task_id='run_dbt',
        bash_command='dbt run --models staging marts'
    )
    
    extract >> load >> transform

Airflow Concepts

DAG: Directed Acyclic Graph of tasks
Operators: Types of tasks (Bash, Python, SQL)
Sensors: Wait for conditions
XComs: Pass data between tasks

Component 4: Ingestion

Getting data into the warehouse:

Fivetran / Airbyte / Stitch

Source (Postgres, Salesforce, Stripe)
        ↓
    Connector
        ↓
  Data Warehouse

# Airbyte configuration example
source:
  name: postgres_production
  type: postgres
  config:
    host: db.example.com
    database: production

destination:
  name: snowflake
  type: snowflake
  config:
    warehouse: LOADING_WH
    database: RAW

sync:
  schedule: "0 */6 * * *"  # Every 6 hours

Event Data

For application events:

# Application code
import segment.analytics as analytics

analytics.track(user_id, 'Order Completed', {
    'order_id': order.id,
    'amount': order.total,
    'items': len(order.items)
})

Events flow to warehouse via Segment, Rudderstack, or similar.

Putting It Together

Example Pipeline

Day 0: Setup
- Configure Fivetran for Postgres, Stripe, Salesforce
- Point all to Snowflake RAW schema
- Set up Airflow on managed service

Day 1-2: Staging Models
- dbt models to clean raw data
- Standard naming, types, deduplication

Day 3-5: Marts
- Business logic in dbt models
- Revenue, churn, cohort analysis

Day 6-7: BI
- Connect Looker/Metabase to marts
- Build dashboards

Daily Workflow

6 AM:   Fivetran syncs overnight data
7 AM:   Airflow triggers dbt run
8 AM:   Fresh data in dashboards
9 AM:   Team reviews metrics

Why This Stack Wins

Separation of Concerns

Each tool does one thing well:

Ingestion: Fivetran
Storage: Snowflake
Transform: dbt
Orchestrate: Airflow
Visualize: Looker

Analytics Engineering

dbt created a new role: analytics engineer. Software engineering practices applied to analytics:

## Analytics Engineering Principles
1. Version control all transformations
2. Test data quality
3. Document everything
4. Code review before deploy
5. Modular, reusable models

Cost Efficiency

Pay for what you use. Scale down when idle:

-- Snowflake: Auto-suspend warehouse
ALTER WAREHOUSE analytics_wh SET AUTO_SUSPEND = 300;

Getting Started

Minimal Stack

PostgreSQL → dbt (local) → Your brain

Learn dbt first. It works with any SQL database.

Growing Stack

Fivetran → Snowflake → dbt → Metabase

Add components as complexity grows.

Final Thoughts

The modern data stack democratized data infrastructure. What once required a team of data engineers now takes one person a week to set up.

Start with your questions. Build the pipeline to answer them. Iterate.

Data without transformation is just noise.