Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Preprocessing

Manu Murugesan edited this page Mar 13, 2026 · 1 revision

Preprocessing

The preprocessing module is the foundation of medicaid-utils. It loads raw Parquet claim files into Dask DataFrames and applies validated cleaning and variable construction routines.

Supported File Types

MAX Format (Pre-2016)

Type Description Class
ip Inpatient claims max_ip.MAXIP
ot Outpatient claims max_ot.MAXOT
ps Person Summary max_ps.MAXPS
cc Chronic Conditions max_cc.MAXCC

TAF Format (2016+)

Type Description Class
ip Inpatient claims taf_ip.TAFIP
ot Outpatient claims taf_ot.TAFOT
lt Long-Term Care taf_lt.TAFLT
rx Pharmacy claims taf_rx.TAFRX
ps Person Summary (Demographics) taf_ps.TAFPS

What Cleaning Does

Each file type has tailored cleaning routines that run automatically when clean=True (the default):

  • Date standardization — converts date columns to consistent datetime types
  • Diagnosis code cleaning — strips whitespace, normalizes formatting, handles ICD-9/10 differences
  • Procedure code cleaning — validates procedure code systems (CPT, HCPCS, ICD)
  • Demographic derivation — computes age, gender flags, and date-of-birth validation
  • Duplicate flagging — identifies exact duplicate claims for exclusion
  • Encounter/capitation classification — flags FFS, encounter, and capitation claims

What Preprocessing Adds

Additional derived variables computed when preprocess=True (the default):

  • Payment calculation — standardized payment amount from available payment fields
  • ED use flags — emergency department utilization indicators (CPT, UB-92, revenue center, place of service)
  • IP overlap detection — flags outpatient claims that overlap with inpatient stays
  • Length of stay — computed from admission and discharge dates
  • Eligibility patterns — monthly enrollment strings and gap detection
  • Rural classification — RUCA or RUCC codes via ZIP code crosswalk
  • Dual eligibility — Medicare-Medicaid dual enrollment flags
  • Basis of eligibility — categorization by eligibility group (aged, blind/disabled, child, adult)

Controlling the Pipeline

from medicaid_utils.preprocessing import max_ip
# Full pipeline (default)
ip = max_ip.MAXIP(year=2012, state="WY", data_root="/data/cms")
# Raw data only
ip = max_ip.MAXIP(year=2012, state="WY", data_root="/data/cms", clean=False, preprocess=False)
# Clean but skip variable construction
ip = max_ip.MAXIP(year=2012, state="WY", data_root="/data/cms", preprocess=False)
# Cache intermediate results to disk
ip = max_ip.MAXIP(year=2012, state="WY", data_root="/data/cms", tmp_folder="/tmp/cache")

Exporting Processed Data

# Export to Parquet (recommended)
ip.export("/path/to/output", output_format="parquet", repartition=True)
# Export to CSV
ip.export("/path/to/output", output_format="csv")

See Also

  • MAX vs TAF — Key differences between file formats
  • Glossary — Column naming conventions

Clone this wiki locally

AltStyle によって変換されたページ (->オリジナル) /