-
Notifications
You must be signed in to change notification settings - Fork 3
Cohort Extraction
The cohort extraction module is the primary tool for building patient-level analytic files. It identifies patients matching diagnosis/procedure criteria, applies inclusion/exclusion filters, and exports the resulting claim files.
from medicaid_utils.filters.patients.cohort_extraction import extract_cohort extract_cohort( state="AL", lst_year=[2016, 2017, 2018], dct_diag_proc_codes=dct_codes, dct_filters=dct_filters, lst_types_to_export=["ip", "ot", "ps"], dct_data_paths=dct_paths, cms_format="TAF", )
Use ICD-9 and/or ICD-10 codes with inclusion and exclusion logic. Codes are matched using prefix matching — "250" matches "2500", "25000", "25002", etc.
dct_codes = { "diag_codes": { "diabetes_t2": { "incl": { 9: ["250"], # ICD-9 prefix 10: ["E11"], # ICD-10 prefix }, "excl": { 9: ["25001", "25003", "25011", "25013"], # Odd 5th digits = Type 1 10: ["E10"], # Exclude Type 1 }, }, }, "proc_codes": {}, }
Procedure codes are keyed by procedure coding system:
dct_codes = { "diag_codes": {}, "proc_codes": { "methadone": { 7: [ # ICD-10-PCS (system code 7) "HZ81ZZZ", "HZ84ZZZ", "HZ85ZZZ", "HZ86ZZZ", ], }, }, }
Common procedure system codes:
-
1— CPT/HCPCS -
6— ICD-9-CM procedure -
7— ICD-10-PCS
Filters control which claims and patients are included:
dct_filters = { "cohort": { "ip": { "missing_dob": 0, # Exclude missing DOB "range_numeric_age_prncpl_proc": (18, 64), # Age 18-64 }, "ot": { "missing_dob": 0, "range_numeric_age_srvc_bgn": (18, 64), }, }, "export": {}, }
| Type | Example | Description |
|---|---|---|
| Column value | "missing_dob": 0 |
Keep rows where column equals value |
| Numeric range | "range_numeric_age_srvc_bgn": (18, 64) |
Keep rows where column is within range (inclusive) |
| Date range | "range_date_srvc_bgn_date": ("20160101", "20181231") |
Keep rows where date is within range |
| Exclusion | "excl_female": 1 |
Exclude patients with positive exclusion flag |
After extraction, the export folder contains:
-
cohort_{STATE}.csv— patient-level file with condition flags, inclusion indicator, and date of birth -
cohort_{STATE}_{YEAR}.csv— year-specific patient file -
cohort_exclusions_{TYPE}_{STATE}_{YEAR}.parquet— filter statistics - Exported claim files in the requested format (CSV or Parquet)
The cohort_{STATE}_{YEAR}.csv file contains these columns (indexed by BENE_MSIS):
| Column Pattern | Description |
|---|---|
include |
1 if patient is included in the final cohort, 0 if excluded |
YEAR |
Claim year |
STATE_CD |
State code |
birth_date |
Date of birth (merged from PS file) |
{type}_diag_{condition} |
1 if condition found in that claim type (e.g., ip_diag_diabetes_t2) |
{type}_diag_{condition}_date |
Date of first occurrence of the condition |
{type}_proc_{procedure} |
1 if procedure found in that claim type (e.g., ip_proc_methadone) |
{type}_proc_{procedure}_date |
Date of first occurrence of the procedure |
{type}_diag_condn |
1 if ANY diagnosis condition matched in that claim type |
{type}_diag_condn_date |
Date of first ANY diagnosis condition |
{type}_proc_condn |
1 if ANY procedure condition matched in that claim type |
{type}_proc_condn_date |
Date of first ANY procedure condition |
Where {type} is the claim type (ip, ot, ot_line), {condition} comes from your dct_diag_codes keys, and {procedure} comes from your dct_proc_codes keys.
During preprocessing, claims are flagged with exclusion columns (prefixed excl_). Filter keys in dct_filters omit the excl_ prefix:
IP claims: excl_missing_dob, excl_missing_admsn_date, excl_encounter_claim, excl_capitation_claim, excl_ffs_claim, excl_delivery, excl_female, excl_duplicated
OT claims: excl_missing_dob, excl_missing_srvc_bgn_date, excl_encounter_claim, excl_capitation_claim, excl_ffs_claim, excl_female, excl_duplicated
PS claims: excl_duplicated_bene_id
for state in ["AL", "IL", "CA", "NY", "TX"]: extract_cohort( state=state, lst_year=[2016, 2017, 2018], dct_diag_proc_codes=dct_codes, dct_filters=dct_filters, lst_types_to_export=["ip", "ot", "ps"], dct_data_paths={ "source_root": "/data/cms/", "export_folder": f"/output/cohort/{state}/", }, cms_format="TAF", )
- Common Recipes — More code examples
- Risk Adjustment Algorithms — Apply after cohort extraction
Getting Started
User Guide
Recipes & How-Tos
Reference
Links