Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

JeffWilliams2/ehr-data-pipeline

Repository files navigation

EHR Clinical Data Pipeline

A production-style EHR data pipeline built with dbt + DuckDB + Streamlit that transforms synthetic FHIR R4-style clinical data through a medallion architecture into a population health analytics dashboard.

Live Dashboard →

All patient data is fully synthetic — generated with a fixed random seed for reproducibility. No real PHI is used or stored.


Dashboard Preview

Population Overview
Population Overview
Risk tier distribution, age bands, insurance mix, and key utilization KPIs across 800 synthetic patients. Condition Prevalence
Condition Prevalence
Top conditions by population prevalence, resolution rates, and ICD-10 chapter breakdown.
Encounter Utilization
Encounter Utilization
Monthly encounter volume trends, encounter class mix, and average length of stay. Medication Adherence
Medication Adherence
Active prescription rates by drug class, days on medication, and adherence by insurance and age band.

Architecture

seeds/ (CSV)
 raw_patients · raw_encounters · raw_conditions · raw_medications
 │
 ▼
 ┌─────────────────────────────────────────────────┐
 │ BRONZE (DuckDB views) │
 │ Pass-through views — no transformations │
 │ bronze_patients · bronze_encounters │
 │ bronze_conditions · bronze_medications │
 └──────────────────────┬──────────────────────────┘
 │
 ▼
 ┌─────────────────────────────────────────────────┐
 │ SILVER (DuckDB tables) │
 │ • Mixed EHR date formats parsed (ISO/US/ │
 │ compact) via reusable parse_date macro │
 │ • ICD-10 chapter classification │
 │ • Drug class derivation │
 │ • Encounter LOS · condition duration computed │
 │ • Age bands · risk-relevant flags added │
 │ silver_patients · silver_encounters │
 │ silver_conditions · silver_medications │
 └──────────────────────┬──────────────────────────┘
 │
 ▼
 ┌─────────────────────────────────────────────────┐
 │ GOLD (DuckDB tables — analytics-ready) │
 │ gold_patient_summary — one row/patient, │
 │ risk tier, condition burden, utilization │
 │ gold_condition_prevalence — population-level │
 │ prevalence rates & resolution stats │
 │ gold_encounter_metrics — volume/LOS by class, │
 │ insurance, age band, year/month │
 │ gold_medication_adherence — active Rx rates │
 │ & days-on-med by drug class & demographics │
 └──────────────────────┬──────────────────────────┘
 │
 ▼
 ┌─────────────────────────────────────────────────┐
 │ STREAMLIT DASHBOARD (dashboard/app.py) │
 │ 5 tabs: Overview · Conditions · Encounters │
 │ Medications · Patient Explorer │
 │ Sidebar filters: insurance · age · gender · │
 │ risk tier │
 └─────────────────────────────────────────────────┘

Tech Stack

Layer Tool Purpose
Transform dbt-core + dbt-fusion Medallion models, tests, macros
Warehouse DuckDB Local analytical warehouse (warehouse.duckdb)
Warehouse (prod) Snowflake Production profile in profiles.yml
Visualization Streamlit + Plotly Interactive population health dashboard
Data generation Python + pandas + numpy Reproducible synthetic FHIR R4-style data

Data Model

Seeds (raw inputs)

Table Rows Key fields
raw_patients 800 demographics, insurance type, DOB (mixed formats)
raw_encounters ~5,600 encounter class, start/end dates (mixed formats), provider
raw_conditions ~3,200 ICD-10 code, onset/resolved dates, clinical status
raw_medications ~2,400 medication name, RxNorm code, ICD-10 reason, start/stop dates

Intentional data quality issues baked in to mirror real EHR exports:

  • Dates arrive in three formats: YYYY-MM-DD, MM/DD/YYYY, YYYYMMDD
  • The parse_date macro in macros/parse_date.sql handles all three via TRY_STRPTIME

Gold layer outputs

Model Grain Notable fields
gold_patient_summary 1 row / patient risk_tier, active_conditions, er_visits, total_los_days
gold_condition_prevalence 1 row / ICD-10 code prevalence_pct, resolution_rate_pct, avg_duration_days
gold_encounter_metrics year + month + class + insurance + age band avg_los_days, encounters_per_patient
gold_medication_adherence medication + drug class + insurance + age band active_rate_pct, avg_days_on_medication

Project Structure

clinical-data-pipeline/
├── macros/
│ └── parse_date.sql # handles 3 EHR date formats
├── models/
│ ├── bronze/ # pass-through views + tests
│ ├── silver/ # cleaned & enriched tables
│ └── gold/ # analytics-ready aggregations
├── seeds/ # synthetic CSV data (800 patients)
├── scripts/
│ └── generate_data.py # reproducible data generator
├── dashboard/
│ └── app.py # Streamlit app
├── dbt_project.yml
├── profiles.yml # DuckDB (dev) + Snowflake (prod)
└── requirements.txt

Quickstart

1. Install dependencies

pip install -r requirements.txt

2. (Optional) Regenerate synthetic data

python scripts/generate_data.py

3. Run the dbt pipeline

# Load seed CSVs into DuckDB
dbt seed --profiles-dir .
# Build all 12 models (bronze → silver → gold)
dbt run --profiles-dir .
# Run data quality tests
dbt test --profiles-dir .

4. Launch the dashboard

streamlit run dashboard/app.py

Open http://localhost:8501


dbt Models at a Glance

dbt run output:
 ✓ bronze_patients (view)
 ✓ bronze_encounters (view)
 ✓ bronze_conditions (view)
 ✓ bronze_medications (view)
 ✓ silver_patients (table)
 ✓ silver_encounters (table)
 ✓ silver_conditions (table)
 ✓ silver_medications (table)
 ✓ gold_patient_summary (table)
 ✓ gold_condition_prevalence (table)
 ✓ gold_encounter_metrics (table)
 ✓ gold_medication_adherence (table)
12 models | 12 passed | 0 errors

Key Design Decisions

Why DuckDB? In-process OLAP engine — zero infrastructure, runs entirely from a single file. The profiles.yml includes a Snowflake profile for production, making the swap a one-line config change.

Why the mixed date formats? Real EHR system exports (Epic, Cerner, Athena) routinely produce inconsistent date formatting across fields and extract versions. The parse_date macro using TRY_STRPTIME is a pattern directly applicable to production EHR pipelines.

Risk tier logic The risk_tier field in gold_patient_summary uses active condition count and ER visit frequency as a simple proxy for clinical risk. In production this would be replaced with a validated risk model (e.g., HCC scores, ACG), but the structure — a single column on the patient grain table consumed by the dashboard — is the same pattern.

About

EHR clinical data pipeline: dbt + DuckDB + Streamlit · bronze/silver/gold medallion architecture over synthetic FHIR R4-style data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

Languages

AltStyle によって変換されたページ (->オリジナル) /