A production-style EHR data pipeline built with dbt + DuckDB + Streamlit that transforms synthetic FHIR R4-style clinical data through a medallion architecture into a population health analytics dashboard.
All patient data is fully synthetic — generated with a fixed random seed for reproducibility. No real PHI is used or stored.
Population Overview
Risk tier distribution, age bands, insurance mix, and key utilization KPIs across 800 synthetic patients. Condition Prevalence
Condition Prevalence
Top conditions by population prevalence, resolution rates, and ICD-10 chapter breakdown.
Encounter Utilization
Monthly encounter volume trends, encounter class mix, and average length of stay. Medication Adherence
Medication Adherence
Active prescription rates by drug class, days on medication, and adherence by insurance and age band.
seeds/ (CSV)
raw_patients · raw_encounters · raw_conditions · raw_medications
│
▼
┌─────────────────────────────────────────────────┐
│ BRONZE (DuckDB views) │
│ Pass-through views — no transformations │
│ bronze_patients · bronze_encounters │
│ bronze_conditions · bronze_medications │
└──────────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ SILVER (DuckDB tables) │
│ • Mixed EHR date formats parsed (ISO/US/ │
│ compact) via reusable parse_date macro │
│ • ICD-10 chapter classification │
│ • Drug class derivation │
│ • Encounter LOS · condition duration computed │
│ • Age bands · risk-relevant flags added │
│ silver_patients · silver_encounters │
│ silver_conditions · silver_medications │
└──────────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ GOLD (DuckDB tables — analytics-ready) │
│ gold_patient_summary — one row/patient, │
│ risk tier, condition burden, utilization │
│ gold_condition_prevalence — population-level │
│ prevalence rates & resolution stats │
│ gold_encounter_metrics — volume/LOS by class, │
│ insurance, age band, year/month │
│ gold_medication_adherence — active Rx rates │
│ & days-on-med by drug class & demographics │
└──────────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ STREAMLIT DASHBOARD (dashboard/app.py) │
│ 5 tabs: Overview · Conditions · Encounters │
│ Medications · Patient Explorer │
│ Sidebar filters: insurance · age · gender · │
│ risk tier │
└─────────────────────────────────────────────────┘
| Layer | Tool | Purpose |
|---|---|---|
| Transform | dbt-core + dbt-fusion | Medallion models, tests, macros |
| Warehouse | DuckDB | Local analytical warehouse (warehouse.duckdb) |
| Warehouse (prod) | Snowflake | Production profile in profiles.yml |
| Visualization | Streamlit + Plotly | Interactive population health dashboard |
| Data generation | Python + pandas + numpy | Reproducible synthetic FHIR R4-style data |
| Table | Rows | Key fields |
|---|---|---|
raw_patients |
800 | demographics, insurance type, DOB (mixed formats) |
raw_encounters |
~5,600 | encounter class, start/end dates (mixed formats), provider |
raw_conditions |
~3,200 | ICD-10 code, onset/resolved dates, clinical status |
raw_medications |
~2,400 | medication name, RxNorm code, ICD-10 reason, start/stop dates |
Intentional data quality issues baked in to mirror real EHR exports:
- Dates arrive in three formats:
YYYY-MM-DD,MM/DD/YYYY,YYYYMMDD - The
parse_datemacro inmacros/parse_date.sqlhandles all three viaTRY_STRPTIME
| Model | Grain | Notable fields |
|---|---|---|
gold_patient_summary |
1 row / patient | risk_tier, active_conditions, er_visits, total_los_days |
gold_condition_prevalence |
1 row / ICD-10 code | prevalence_pct, resolution_rate_pct, avg_duration_days |
gold_encounter_metrics |
year + month + class + insurance + age band | avg_los_days, encounters_per_patient |
gold_medication_adherence |
medication + drug class + insurance + age band | active_rate_pct, avg_days_on_medication |
clinical-data-pipeline/
├── macros/
│ └── parse_date.sql # handles 3 EHR date formats
├── models/
│ ├── bronze/ # pass-through views + tests
│ ├── silver/ # cleaned & enriched tables
│ └── gold/ # analytics-ready aggregations
├── seeds/ # synthetic CSV data (800 patients)
├── scripts/
│ └── generate_data.py # reproducible data generator
├── dashboard/
│ └── app.py # Streamlit app
├── dbt_project.yml
├── profiles.yml # DuckDB (dev) + Snowflake (prod)
└── requirements.txt
pip install -r requirements.txt
python scripts/generate_data.py
# Load seed CSVs into DuckDB dbt seed --profiles-dir . # Build all 12 models (bronze → silver → gold) dbt run --profiles-dir . # Run data quality tests dbt test --profiles-dir .
streamlit run dashboard/app.py
dbt run output:
✓ bronze_patients (view)
✓ bronze_encounters (view)
✓ bronze_conditions (view)
✓ bronze_medications (view)
✓ silver_patients (table)
✓ silver_encounters (table)
✓ silver_conditions (table)
✓ silver_medications (table)
✓ gold_patient_summary (table)
✓ gold_condition_prevalence (table)
✓ gold_encounter_metrics (table)
✓ gold_medication_adherence (table)
12 models | 12 passed | 0 errors
Why DuckDB?
In-process OLAP engine — zero infrastructure, runs entirely from a single file. The profiles.yml includes a Snowflake profile for production, making the swap a one-line config change.
Why the mixed date formats?
Real EHR system exports (Epic, Cerner, Athena) routinely produce inconsistent date formatting across fields and extract versions. The parse_date macro using TRY_STRPTIME is a pattern directly applicable to production EHR pipelines.
Risk tier logic
The risk_tier field in gold_patient_summary uses active condition count and ER visit frequency as a simple proxy for clinical risk. In production this would be replaced with a validated risk model (e.g., HCC scores, ACG), but the structure — a single column on the patient grain table consumed by the dashboard — is the same pattern.