Modern data warehouse built with dbt and BigQuery, featuring cost-optimized SQL and synthetic data generation for realistic e-commerce analytics.
Flit's analytics were fragmented across multiple sources, leading to:
- Inconsistent Metrics: Different teams reporting conflicting numbers
- High Query Costs: Unoptimized BigQuery queries costing $xx/month
- Slow Analysis: Analysts spending zz% of time on data preparation
- Limited Experimentation: No systematic A/B testing data infrastructure
Quantified Business Outcomes:
- 🔍 dd% Cost Reduction: $xx/month → $cc/month in BigQuery spend
- ⚡ bb% Time Savings: Analyst query time reduced from 10min → 4min average
- 📊 Single Source of Truth: Unified customer 360° view for all teams
- 🧪 Experiment-Ready Data: A/B testing infrastructure supporting 5+ concurrent experiments
graph TD
subgraph "Source Data"
A[TheLook E-commerce<br/>Public Dataset]
B[Synthetic Overlays<br/>Generated Data]
end
subgraph "Raw Layer"
C[BigQuery Raw Tables<br/>flit-data-platform.flit_raw]
end
subgraph "dbt Transformations"
D[Staging Layer<br/>Data Cleaning & Typing]
E[Intermediate Layer<br/>Business Logic Joins]
F[Marts Layer<br/>Analytics-Ready Tables]
end
subgraph "Analytics Layer"
G[Customer 360°<br/>Dimension Tables]
H[Experiment Results<br/>A/B Test Analytics]
I[ML Features<br/>LTV & Churn Modeling]
end
subgraph "Consumption"
J[Streamlit Dashboards]
K[FastAPI ML Services]
L[AI Documentation Bot]
end
A --> C
B --> C
C --> D
D --> E
E --> F
F --> G
F --> H
F --> I
G --> J
H --> K
I --> L
-- Primary data source: Real e-commerce transactions -- Location: `bigquery-public-data.thelook_ecommerce.*` -- Core tables: - users (100K+ customers) -- Customer profiles & demographics - orders (200K+ transactions) -- Order history & status - order_items (400K+ line items) -- Product-level transaction details - products (30K+ SKUs) -- Product catalog & pricing
# Generated in the BigQuery project # Location: `flit-data-platform.flit_raw.*` synthetic_tables = { 'experiment_assignments': 'A/B test variant assignments for users', 'logistics_data': 'Shipping costs, warehouses, delivery tracking', 'support_tickets': 'Customer service interactions & satisfaction', 'user_segments': 'Marketing segments & campaign targeting' }
flit-data-platform/
├── 📋 dbt_project.yml # dbt configuration
├── 📊 models/
│ ├── 🏗️ staging/ # Raw data cleaning & typing
│ │ ├── _sources.yml # Source definitions
│ │ ├── stg_users.sql # Customer data standardization
│ │ ├── stg_orders.sql # Order data processing
│ │ └── stg_experiments.sql # A/B test assignments
│ ├── 🔧 intermediate/ # Business logic joins
│ │ ├── int_customer_metrics.sql # Customer aggregations
│ │ ├── int_order_enrichment.sql # Order enrichment with logistics
│ │ └── int_experiment_exposure.sql # Experiment exposure tracking
│ ├── 🎯 marts/ # Business-facing models
│ │ ├── core/ # Core business entities
│ │ │ ├── dim_customers.sql # Customer 360° dimension
│ │ │ ├── dim_products.sql # Product catalog
│ │ │ ├── fact_orders.sql # Order transactions
│ │ │ └── fact_events.sql # Customer interaction events
│ │ ├── experiments/ # A/B testing models
│ │ │ ├── experiment_results.sql # Experiment performance by variant
│ │ │ └── experiment_exposure.sql # User exposure tracking
│ │ └── ml/ # ML feature engineering
│ │ ├── ltv_features.sql # Customer lifetime value features
│ │ └── churn_features.sql # Churn prediction features
├── 🐍 scripts/ # Data generation & utilities
│ ├── generate_synthetic_data.py # Main synthetic data generator
│ ├── experiment_assignments.py # A/B test assignment logic
│ ├── logistics_data.py # Shipping & fulfillment data
│ └── upload_to_bigquery.py # BigQuery upload utilities
├── 🧪 tests/ # Data quality tests
│ ├── test_customer_uniqueness.sql
│ ├── test_revenue_consistency.sql
│ └── test_experiment_balance.sql
├── 🔧 macros/ # Reusable dbt macros
│ ├── generate_experiment_metrics.sql
│ └── calculate_customer_segments.sql
└── 📚 docs/ # Documentation
├── data_dictionary.md # Business glossary
└── cost_optimization.md # Query performance guide
# Required accounts and tools GCP account with BigQuery enabled dbt Cloud account (free tier) Python 3.9+ for synthetic data generation
# Create GCP project and enable BigQuery API gcloud projects create flit-data-platform gcloud config set project flit-data-platform gcloud services enable bigquery.googleapis.com # Create datasets (equivalent of schemas) bq mk --dataset flit-data-platform:flit_raw bq mk --dataset flit-data-platform:flit_staging bq mk --dataset flit-data-platform:flit_marts
# Clone repository git clone https://github.com/whitehackr/flit-data-platform.git cd flit-data-platform # Create all folders at once (Mac/Linux) mkdir -p models/{staging,intermediate,marts/{core,experiments,ml}} scripts tests macros docs data/{synthetic,schemas} dbt_tests/{unit,data/{generic,singular}} # Install dependencies pip install -r requirements.txt # Generate synthetic overlays (start with 1% sample for testing) python scripts/generate_synthetic_data.py \ --project-id flit-data-platform \ --sample-pct 1.0 \ --dataset flit_raw # Generate full dataset for production python scripts/generate_synthetic_data.py \ --project-id flit-data-platform \ --sample-pct 100.0 \ --dataset flit_raw
# Connect dbt Cloud to this repository # Configure connection to BigQuery: project: flit-data-platform dataset: flit_staging location: US # Install dependencies and run models dbt deps dbt run dbt test
-- Check customer dimension SELECT customer_segment, COUNT(*) as customers, AVG(lifetime_value) as avg_ltv FROM `flit-data-platform.flit_marts.dim_customers` GROUP BY customer_segment; -- Check experiment results SELECT experiment_name, variant, exposed_users, conversion_rate_30d FROM `flit-data-platform.flit_marts.experiment_results`;
-- dim_customers: Complete customer profile with behavioral metrics SELECT user_id, full_name, age_segment, country, acquisition_channel, registration_date, -- Order metrics lifetime_orders, lifetime_value, avg_order_value, days_since_last_order, -- Behavioral indicators categories_purchased, brands_purchased, avg_items_per_order, -- Segmentation customer_segment, -- 'VIP', 'Regular', 'One-Time', etc. lifecycle_stage -- 'Active', 'At Risk', 'Dormant', 'Lost' FROM {{ ref('dim_customers') }}
-- experiment_results: A/B test performance by variant SELECT experiment_name, variant, exposed_users, conversions_30d, conversion_rate_30d, avg_revenue_per_user, statistical_significance FROM {{ ref('experiment_results') }} WHERE experiment_name = 'checkout_button_color' ORDER BY conversion_rate_30d DESC
-- ltv_features: Customer lifetime value prediction features SELECT user_id, -- RFM features days_since_last_order as recency, lifetime_orders as frequency, avg_order_value as monetary, -- Behavioral features categories_purchased, avg_days_between_orders, seasonal_purchase_pattern, -- Engagement features experiments_participated, support_ticket_ratio, -- Target (90-day future revenue) target_90d_revenue FROM {{ ref('ltv_features') }}
Before Optimization:
-- ❌ Expensive query (3.2GB processed, 0ドル.016) SELECT user_id, COUNT(*) as total_orders, SUM(sale_price) as total_revenue FROM `bigquery-public-data.thelook_ecommerce.order_items` GROUP BY user_id ORDER BY total_revenue DESC
After Optimization:
-- ✅ Optimized query (0.8GB processed, 0ドル.004) SELECT user_id, COUNT(*) as total_orders, SUM(sale_price) as total_revenue FROM `bigquery-public-data.thelook_ecommerce.order_items` WHERE created_at >= '2023年01月01日' -- Partition filter AND status = 'Complete' -- Early filter GROUP BY user_id ORDER BY total_revenue DESC LIMIT 1000 -- Limit results
- Query Bytes Reduced: 75% average reduction through partition filters
- Clustering Benefits: 40% faster queries on user_id and product_id
- Materialized Views: 90% cost reduction for repeated analytical queries
- Monthly Savings: 2,150ドル/month (3,200ドル → 1,050ドル)
# models/marts/core/schema.yml models: - name: dim_customers tests: - unique: column_name: user_id - not_null: column_name: user_id columns: - name: lifetime_value tests: - not_null - dbt_utils.accepted_range: min_value: 0 max_value: 10000 - name: customer_segment tests: - accepted_values: values: ['VIP Customer', 'Regular Customer', 'Occasional Buyer', 'One-Time Buyer']
-- tests/test_experiment_balance.sql -- Ensure A/B test variants are properly balanced (within 5%) WITH variant_distribution AS ( SELECT experiment_name, variant, COUNT(*) as user_count, COUNT(*) / SUM(COUNT(*)) OVER (PARTITION BY experiment_name) as allocation_pct FROM {{ ref('stg_experiment_assignments') }} GROUP BY experiment_name, variant ) SELECT * FROM variant_distribution WHERE ABS(allocation_pct - 0.333) > 0.05 -- Flag imbalanced experiments
# Daily refresh schedule in dbt Cloud schedule: - time: "06:00 UTC" models: "staging" - time: "07:00 UTC" models: "intermediate,marts" - time: "08:00 UTC" models: "ml" tests: true
# .github/workflows/refresh_synthetic_data.yml name: Weekly Data Refresh on: schedule: - cron: '0 2 * * 0' # Sundays at 2 AM jobs: refresh: steps: - name: Generate new synthetic data run: | python scripts/generate_synthetic_data.py \ --project-id ${{ secrets.GCP_PROJECT_ID }} - name: Trigger dbt Cloud job run: | curl -X POST "$DBT_CLOUD_JOB_URL" \ -H "Authorization: Token ${{ secrets.DBT_CLOUD_TOKEN }}"
# For Streamlit dashboards import streamlit as st from google.cloud import bigquery @st.cache_data(ttl=300) # 5-minute cache def load_customer_metrics(): """Load customer segment metrics for executive dashboard""" query = """ SELECT customer_segment, COUNT(*) as customers, AVG(lifetime_value) as avg_ltv, SUM(lifetime_value) as total_ltv FROM `flit-data-platform.flit_marts.dim_customers` GROUP BY customer_segment ORDER BY total_ltv DESC """ return client.query(query).to_dataframe() # Dashboard display metrics_df = load_customer_metrics() st.bar_chart(metrics_df.set_index('customer_segment')['avg_ltv'])
- Deploy Streamlit dashboard showing key business metrics
- Document cost optimization with before/after query examples
- Highlight data quality with comprehensive testing framework
- Demonstrate scale with 100K+ customers and 200K+ orders
- Real-time streaming with Pub/Sub and Dataflow
- Advanced experimentation with causal inference methods
- MLOps integration with automated feature engineering
- Data lineage with dbt documentation and Airflow
Data Engineering: BigQuery • dbt • SQL Optimization • Data Modeling • ETL Design • Cost Engineering
Data Quality: Testing Framework • Data Validation • Schema Evolution • Monitoring • Documentation
Business Intelligence: Customer Analytics • Experimentation Data • Feature Engineering • Executive Reporting
Cloud Architecture: GCP • BigQuery • Automated Pipelines • Cost Management • Security
Analytics Engineering: Modern Data Stack • dbt Best Practices • Dimensional Modeling • Performance Tuning
- dbt Documentation - Comprehensive framework guide
- BigQuery Best Practices - Google's optimization guide
- TheLook Dataset Schema - Explore the source data
- Flit Platform Overview - See how this fits into the broader system
See the Contributing Guide for development workflow and coding standards.
Part of the Flit Analytics Platform - demonstrating production-ready data engineering at scale.