Name	Name	Last commit message	Last commit date
Latest commit History 2 Commits
.streamlit	.streamlit
dashboard	dashboard
macros	macros
models	models
scripts	scripts
seeds	seeds
.gitignore	.gitignore
README.md	README.md
condition-prev.png	condition-prev.png
dbt_project.yml	dbt_project.yml
encounter-util.png	encounter-util.png
medication-adherence.png	medication-adherence.png
overview.png	overview.png
profiles.yml	profiles.yml
requirements.txt	requirements.txt
runtime.txt	runtime.txt
warehouse.duckdb	warehouse.duckdb

EHR Clinical Data Pipeline

A production-style EHR data pipeline built with dbt + DuckDB + Streamlit that transforms synthetic FHIR R4-style clinical data through a medallion architecture into a population health analytics dashboard.

Live Dashboard →

All patient data is fully synthetic — generated with a fixed random seed for reproducibility. No real PHI is used or stored.

Dashboard Preview

Population Overview
Population Overview
_{Risk tier distribution, age bands, insurance mix, and key utilization KPIs across 800 synthetic patients.} Condition Prevalence
Condition Prevalence
_{Top conditions by population prevalence, resolution rates, and ICD-10 chapter breakdown.}

Encounter Utilization
Encounter Utilization
_{Monthly encounter volume trends, encounter class mix, and average length of stay.} Medication Adherence
Medication Adherence
_{Active prescription rates by drug class, days on medication, and adherence by insurance and age band.}

Architecture

seeds/ (CSV)
 raw_patients · raw_encounters · raw_conditions · raw_medications
 │
 ▼
 ┌─────────────────────────────────────────────────┐
 │ BRONZE (DuckDB views) │
 │ Pass-through views — no transformations │
 │ bronze_patients · bronze_encounters │
 │ bronze_conditions · bronze_medications │
 └──────────────────────┬──────────────────────────┘
 │
 ▼
 ┌─────────────────────────────────────────────────┐
 │ SILVER (DuckDB tables) │
 │ • Mixed EHR date formats parsed (ISO/US/ │
 │ compact) via reusable parse_date macro │
 │ • ICD-10 chapter classification │
 │ • Drug class derivation │
 │ • Encounter LOS · condition duration computed │
 │ • Age bands · risk-relevant flags added │
 │ silver_patients · silver_encounters │
 │ silver_conditions · silver_medications │
 └──────────────────────┬──────────────────────────┘
 │
 ▼
 ┌─────────────────────────────────────────────────┐
 │ GOLD (DuckDB tables — analytics-ready) │
 │ gold_patient_summary — one row/patient, │
 │ risk tier, condition burden, utilization │
 │ gold_condition_prevalence — population-level │
 │ prevalence rates & resolution stats │
 │ gold_encounter_metrics — volume/LOS by class, │
 │ insurance, age band, year/month │
 │ gold_medication_adherence — active Rx rates │
 │ & days-on-med by drug class & demographics │
 └──────────────────────┬──────────────────────────┘
 │
 ▼
 ┌─────────────────────────────────────────────────┐
 │ STREAMLIT DASHBOARD (dashboard/app.py) │
 │ 5 tabs: Overview · Conditions · Encounters │
 │ Medications · Patient Explorer │
 │ Sidebar filters: insurance · age · gender · │
 │ risk tier │
 └─────────────────────────────────────────────────┘

Tech Stack

Layer	Tool	Purpose
Transform	dbt-core + dbt-fusion	Medallion models, tests, macros
Warehouse	DuckDB	Local analytical warehouse (`warehouse.duckdb`)
Warehouse (prod)	Snowflake	Production profile in `profiles.yml`
Visualization	Streamlit + Plotly	Interactive population health dashboard
Data generation	Python + pandas + numpy	Reproducible synthetic FHIR R4-style data

Data Model

Seeds (raw inputs)

Table	Rows	Key fields
`raw_patients`	800	demographics, insurance type, DOB (mixed formats)
`raw_encounters`	~5,600	encounter class, start/end dates (mixed formats), provider
`raw_conditions`	~3,200	ICD-10 code, onset/resolved dates, clinical status
`raw_medications`	~2,400	medication name, RxNorm code, ICD-10 reason, start/stop dates

Intentional data quality issues baked in to mirror real EHR exports:

Dates arrive in three formats: YYYY-MM-DD, MM/DD/YYYY, YYYYMMDD
The parse_date macro in macros/parse_date.sql handles all three via TRY_STRPTIME

Gold layer outputs

Model	Grain	Notable fields
`gold_patient_summary`	1 row / patient	`risk_tier`, `active_conditions`, `er_visits`, `total_los_days`
`gold_condition_prevalence`	1 row / ICD-10 code	`prevalence_pct`, `resolution_rate_pct`, `avg_duration_days`
`gold_encounter_metrics`	year + month + class + insurance + age band	`avg_los_days`, `encounters_per_patient`
`gold_medication_adherence`	medication + drug class + insurance + age band	`active_rate_pct`, `avg_days_on_medication`

Project Structure

clinical-data-pipeline/
├── macros/
│ └── parse_date.sql # handles 3 EHR date formats
├── models/
│ ├── bronze/ # pass-through views + tests
│ ├── silver/ # cleaned & enriched tables
│ └── gold/ # analytics-ready aggregations
├── seeds/ # synthetic CSV data (800 patients)
├── scripts/
│ └── generate_data.py # reproducible data generator
├── dashboard/
│ └── app.py # Streamlit app
├── dbt_project.yml
├── profiles.yml # DuckDB (dev) + Snowflake (prod)
└── requirements.txt

Quickstart

1. Install dependencies

pip install -r requirements.txt

2. (Optional) Regenerate synthetic data

python scripts/generate_data.py

3. Run the dbt pipeline

# Load seed CSVs into DuckDB
dbt seed --profiles-dir .
# Build all 12 models (bronze → silver → gold)
dbt run --profiles-dir .
# Run data quality tests
dbt test --profiles-dir .

4. Launch the dashboard

streamlit run dashboard/app.py

Open http://localhost:8501

dbt Models at a Glance

dbt run output:
 ✓ bronze_patients (view)
 ✓ bronze_encounters (view)
 ✓ bronze_conditions (view)
 ✓ bronze_medications (view)
 ✓ silver_patients (table)
 ✓ silver_encounters (table)
 ✓ silver_conditions (table)
 ✓ silver_medications (table)
 ✓ gold_patient_summary (table)
 ✓ gold_condition_prevalence (table)
 ✓ gold_encounter_metrics (table)
 ✓ gold_medication_adherence (table)
12 models | 12 passed | 0 errors

Key Design Decisions

Why DuckDB? In-process OLAP engine — zero infrastructure, runs entirely from a single file. The profiles.yml includes a Snowflake profile for production, making the swap a one-line config change.

Why the mixed date formats? Real EHR system exports (Epic, Cerner, Athena) routinely produce inconsistent date formatting across fields and extract versions. The parse_date macro using TRY_STRPTIME is a pattern directly applicable to production EHR pipelines.

Risk tier logic The risk_tier field in gold_patient_summary uses active condition count and ER visit frequency as a simple proxy for clinical risk. In production this would be replaced with a validated risk model (e.g., HCC scores, ACG), but the structure — a single column on the patient grain table consumed by the dashboard — is the same pattern.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JeffWilliams2/ehr-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

EHR Clinical Data Pipeline

Dashboard Preview

Architecture

Tech Stack

Data Model

Seeds (raw inputs)

Gold layer outputs

Project Structure

Quickstart

1. Install dependencies

2. (Optional) Regenerate synthetic data

3. Run the dbt pipeline

4. Launch the dashboard

dbt Models at a Glance

Key Design Decisions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EHR Clinical Data Pipeline

Dashboard Preview

Architecture

Tech Stack

Data Model

Seeds (raw inputs)

Gold layer outputs

Project Structure

Quickstart

1. Install dependencies

2. (Optional) Regenerate synthetic data

3. Run the dbt pipeline

4. Launch the dashboard

dbt Models at a Glance

Key Design Decisions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages