Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Data Layout

Manu Murugesan edited this page Mar 13, 2026 · 1 revision

Data Layout

medicaid-utils expects Medicaid claim files stored as Parquet datasets, organized by year and state, and sorted by beneficiary ID (BENE_MSIS or MSIS_ID).

MAX Files (Pre-2016, ICD-9 Era)

data_root/
 medicaid/
 {YEAR}/
 {STATE}/
 max/
 ip/parquet/ # Inpatient claims
 ot/parquet/ # Outpatient claims
 ps/parquet/ # Person Summary
 cc/parquet/ # Chronic Conditions

Example: data_root/medicaid/2012/WY/max/ip/parquet/

TAF Files (2016+, ICD-10 Era)

TAF claims are split into multiple subtypes per claim type:

data_root/
 medicaid/
 {YEAR}/
 {STATE}/
 taf/
 ip/ # Inpatient
 iph/parquet/ # Header (base)
 ipl/parquet/ # Line
 ipoccr/parquet/ # Occurrence codes
 ipdx/parquet/ # Diagnosis codes
 ipndc/parquet/ # NDC codes
 ot/ # Outpatient (oth, otl, otoccr, otdx, otndc)
 lt/ # Long-Term Care (lth, ltl, ltoccr, ltdx, ltndc)
 rx/ # Pharmacy (rxh, rxl, rxndc)
 de/ # Demographics/Eligibility (Person Summary)
 debse/parquet/ # Base demographics
 dedts/parquet/ # Dates
 demc/parquet/ # Managed care
 dedsb/parquet/ # Disability
 demfp/parquet/ # Money Follows the Person
 dewvr/parquet/ # Waiver
 dehsp/parquet/ # Home health/SPF
 dedxndc/parquet/ # Diagnosis & NDC codes

Notes

  • Each Parquet dataset can be a single file or a directory of partitioned Parquet files.
  • Files must be pre-sorted by beneficiary ID to enable efficient partition-level operations.
  • The package uses pyarrow as the default Parquet engine.
  • {YEAR} is a four-digit year (e.g., 2012, 2019).
  • {STATE} is a two-letter uppercase state abbreviation (e.g., WY, AL, IL).

Converting Raw CMS Data

If your CMS data is in SAS (.sas7bdat), CSV, or other formats, you need to convert it to Parquet first. Example using pandas:

import pandas as pd
# Read SAS file
df = pd.read_sas("max_ip_2012_wy.sas7bdat")
# Sort by beneficiary ID
df = df.sort_values("MSIS_ID")
# Write to Parquet
df.to_parquet("data_root/medicaid/2012/WY/max/ip/parquet/data.parquet", index=False)

For large files, use Dask:

import dask.dataframe as dd
df = dd.read_csv("max_ip_2012_wy.csv")
df = df.set_index("MSIS_ID").reset_index()
df.to_parquet("data_root/medicaid/2012/WY/max/ip/parquet/")

Next Steps

  • Quick Start — Load and process your first claims
  • MAX vs TAF — Understand the differences between file formats

Clone this wiki locally

AltStyle によって変換されたページ (->オリジナル) /