Data Engineering Best Practices
Comprehensive guide to building robust, scalable data pipelines
Table of Contents
1. Data Pipeline Design Patterns
Effective data pipeline design is crucial for building scalable, maintainable data infrastructure. Here are the core patterns every data engineer should know:
Batch Processing Pattern
Best for processing large volumes of data at scheduled intervals.
When to use: Daily aggregations, nightly ETL jobs, historical data processing, monthly reports
-- Example: Daily batch aggregation
WITH daily_metrics AS (
SELECT
DATE(timestamp) as date,
user_id,
COUNT(*) as event_count,
SUM(revenue) as total_revenue
FROM events
WHERE DATE(timestamp) = CURRENT_DATE - 1
GROUP BY 1, 2
)
SELECT * FROM daily_metrics;
Stream Processing Pattern
Real-time processing of data as it arrives.
When to use: Real-time analytics, fraud detection, live dashboards, event-driven architectures
Lambda Architecture
Combines batch and stream processing for both real-time and historical analysis.
- Batch Layer: Processes complete historical dataset
- Speed Layer: Handles real-time data with low latency
- Serving Layer: Merges results from both layers
Medallion Architecture (Bronze/Silver/Gold)
Progressive data refinement through layers:
- Bronze (Raw): Ingested data as-is, minimal transformation
- Silver (Cleaned): Validated, deduplicated, standardized
- Gold (Business): Aggregated, business-ready metrics
-- Bronze Layer: Raw ingestion
CREATE TABLE bronze.events AS
SELECT * FROM source_system;
-- Silver Layer: Cleaned data
CREATE TABLE silver.events AS
SELECT
DISTINCT *,
CAST(timestamp AS TIMESTAMP) as event_time,
LOWER(TRIM(user_email)) as normalized_email
FROM bronze.events
WHERE event_time >= '2026-01-01';
-- Gold Layer: Business metrics
CREATE TABLE gold.daily_user_metrics AS
SELECT
DATE(event_time) as date,
COUNT(DISTINCT user_id) as dau,
SUM(revenue) as daily_revenue
FROM silver.events
GROUP BY 1;
2. ETL vs ELT Strategies
ETL (Extract, Transform, Load)
Transform data before loading into the warehouse.
Pros:
- Reduces warehouse storage costs
- Data arrives clean and ready
- Better for limited compute resources
- Less flexibility for ad-hoc analysis
- Transformation logic scattered
- Harder to reprocess historical data
ELT (Extract, Load, Transform)
Load raw data first, transform within the warehouse.
Pros:
- Raw data always available for reprocessing
- Leverage powerful warehouse compute
- Version control transformations (e.g., dbt)
- Faster time to insights
- Higher storage costs
- Requires robust warehouse
- More compute costs
Modern Recommendation: ELT with dbt
-- models/staging/stg_users.sql
WITH source AS (
SELECT * FROM {{ source('raw', 'users') }}
),
cleaned AS (
SELECT
user_id,
LOWER(TRIM(email)) as email,
created_at,
updated_at
FROM source
WHERE user_id IS NOT NULL
)
SELECT * FROM cleaned
3. Data Quality Frameworks
Data quality is paramount. Implement these checks at every stage:
The Six Dimensions of Data Quality
- Completeness: No missing critical fields
- Validity: Data conforms to business rules
- Accuracy: Data correctly represents reality
- Consistency: Same data across systems matches
- Timeliness: Data is up-to-date
- Uniqueness: No unexpected duplicates
Implementing Data Quality Tests
-- Example dbt tests in schema.yml
version: 2
models:
- name: fct_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- relationships:
to: ref('dim_customers')
field: customer_id
- name: order_total
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 1000000
- name: order_date
tests:
- not_null
- dbt_utils.recency:
datepart: day
field: order_date
interval: 7
Great Expectations Framework
# Python example with Great Expectations
import great_expectations as ge
df = ge.read_csv('data.csv')
# Expect column values to not be null
df.expect_column_values_to_not_be_null('user_id')
# Expect column values to be unique
df.expect_column_values_to_be_unique('email')
# Expect column values to match regex
df.expect_column_values_to_match_regex(
'email',
r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'
)
# Expect column values to be between
df.expect_column_values_to_be_between('age', 0, 120)
4. Schema Design Principles
Star Schema (Recommended for Analytics)
Central fact table surrounded by dimension tables.
Benefits: Simple queries, fast aggregations, easy for BI tools
-- Fact Table
CREATE TABLE fct_sales (
sale_id BIGINT PRIMARY KEY,
date_key INT FOREIGN KEY REFERENCES dim_date,
customer_key INT FOREIGN KEY REFERENCES dim_customer,
product_key INT FOREIGN KEY REFERENCES dim_product,
quantity INT,
revenue DECIMAL(10,2),
cost DECIMAL(10,2)
);
-- Dimension Tables
CREATE TABLE dim_customer (
customer_key INT PRIMARY KEY,
customer_id VARCHAR,
customer_name VARCHAR,
city VARCHAR,
country VARCHAR
);
CREATE TABLE dim_product (
product_key INT PRIMARY KEY,
product_id VARCHAR,
product_name VARCHAR,
category VARCHAR,
subcategory VARCHAR
);
Slowly Changing Dimensions (SCD)
- Type 1: Overwrite old values (no history)
- Type 2: Add new row for each change (full history)
- Type 3: Add columns for previous/current (limited history)
-- SCD Type 2 Example
CREATE TABLE dim_customer_scd2 (
customer_key INT PRIMARY KEY,
customer_id VARCHAR,
customer_name VARCHAR,
email VARCHAR,
valid_from DATE,
valid_to DATE,
is_current BOOLEAN
);
Naming Conventions
- Staging: stg_[source]_[entity] (e.g., stg_salesforce_accounts)
- Intermediate: int_[description] (e.g., int_customer_orders)
- Facts: fct_[business_process] (e.g., fct_orders)
- Dimensions: dim_[entity] (e.g., dim_customers)
5. Orchestration Best Practices
Airflow Best Practices
- Idempotency: Tasks should produce same result when run multiple times
- Incremental Loading: Process only new/changed data
- Backfilling: Design for easy historical data reprocessing
- Monitoring: Set up alerts for failures and SLA breaches
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'data-team',
'retries': 3,
'retry_delay': timedelta(minutes=5),
'email_on_failure': True,
'email': ['data-alerts@company.com'],
'sla': timedelta(hours=2)
}
dag = DAG(
'daily_etl',
default_args=default_args,
schedule_interval='0 2 * * *',
catchup=False, # Don't backfill automatically
max_active_runs=1 # Prevent parallel runs
)
Task Dependencies
# Clear dependency chains
extract >> transform >> load
# Fan-out / Fan-in pattern
extract >> [transform_a, transform_b, transform_c] >> merge >> load
# Conditional execution
extract >> branch_task
branch_task >> [path_a, path_b]
Monitoring & Alerting
- Track task duration trends
- Alert on data freshness
- Monitor data quality metrics
- Set SLAs for critical pipelines
- Use Circuit Breakers for cascading failures