Preprocessing

Manu Murugesan edited this page Mar 13, 2026 · 1 revision

Preprocessing

The preprocessing module is the foundation of medicaid-utils. It loads raw Parquet claim files into Dask DataFrames and applies validated cleaning and variable construction routines.

Supported File Types

MAX Format (Pre-2016)

Type	Description	Class
`ip`	Inpatient claims	`max_ip.MAXIP`
`ot`	Outpatient claims	`max_ot.MAXOT`
`ps`	Person Summary	`max_ps.MAXPS`
`cc`	Chronic Conditions	`max_cc.MAXCC`

TAF Format (2016+)

Type	Description	Class
`ip`	Inpatient claims	`taf_ip.TAFIP`
`ot`	Outpatient claims	`taf_ot.TAFOT`
`lt`	Long-Term Care	`taf_lt.TAFLT`
`rx`	Pharmacy claims	`taf_rx.TAFRX`
`ps`	Person Summary (Demographics)	`taf_ps.TAFPS`

What Cleaning Does

Each file type has tailored cleaning routines that run automatically when clean=True (the default):

Date standardization — converts date columns to consistent datetime types
Diagnosis code cleaning — strips whitespace, normalizes formatting, handles ICD-9/10 differences
Procedure code cleaning — validates procedure code systems (CPT, HCPCS, ICD)
Demographic derivation — computes age, gender flags, and date-of-birth validation
Duplicate flagging — identifies exact duplicate claims for exclusion
Encounter/capitation classification — flags FFS, encounter, and capitation claims

What Preprocessing Adds

Additional derived variables computed when preprocess=True (the default):

Payment calculation — standardized payment amount from available payment fields
ED use flags — emergency department utilization indicators (CPT, UB-92, revenue center, place of service)
IP overlap detection — flags outpatient claims that overlap with inpatient stays
Length of stay — computed from admission and discharge dates
Eligibility patterns — monthly enrollment strings and gap detection
Rural classification — RUCA or RUCC codes via ZIP code crosswalk
Dual eligibility — Medicare-Medicaid dual enrollment flags
Basis of eligibility — categorization by eligibility group (aged, blind/disabled, child, adult)

Controlling the Pipeline

from medicaid_utils.preprocessing import max_ip
# Full pipeline (default)
ip = max_ip.MAXIP(year=2012, state="WY", data_root="/data/cms")
# Raw data only
ip = max_ip.MAXIP(year=2012, state="WY", data_root="/data/cms", clean=False, preprocess=False)
# Clean but skip variable construction
ip = max_ip.MAXIP(year=2012, state="WY", data_root="/data/cms", preprocess=False)
# Cache intermediate results to disk
ip = max_ip.MAXIP(year=2012, state="WY", data_root="/data/cms", tmp_folder="/tmp/cache")

Exporting Processed Data

# Export to Parquet (recommended)
ip.export("/path/to/output", output_format="parquet", repartition=True)
# Export to CSV
ip.export("/path/to/output", output_format="csv")

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing

Preprocessing

Supported File Types

MAX Format (Pre-2016)

TAF Format (2016+)

What Cleaning Does

What Preprocessing Adds

Controlling the Pipeline

Exporting Processed Data

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally