Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Cohort Extraction

Manu Murugesan edited this page Mar 14, 2026 · 2 revisions

Cohort Extraction

The cohort extraction module is the primary tool for building patient-level analytic files. It identifies patients matching diagnosis/procedure criteria, applies inclusion/exclusion filters, and exports the resulting claim files.

Basic Usage

from medicaid_utils.filters.patients.cohort_extraction import extract_cohort
extract_cohort(
 state="AL",
 lst_year=[2016, 2017, 2018],
 dct_diag_proc_codes=dct_codes,
 dct_filters=dct_filters,
 lst_types_to_export=["ip", "ot", "ps"],
 dct_data_paths=dct_paths,
 cms_format="TAF",
)

Defining Diagnosis Codes

Use ICD-9 and/or ICD-10 codes with inclusion and exclusion logic. Codes are matched using prefix matching"250" matches "2500", "25000", "25002", etc.

dct_codes = {
 "diag_codes": {
 "diabetes_t2": {
 "incl": {
 9: ["250"], # ICD-9 prefix
 10: ["E11"], # ICD-10 prefix
 },
 "excl": {
 9: ["25001", "25003", "25011", "25013"], # Odd 5th digits = Type 1
 10: ["E10"], # Exclude Type 1
 },
 },
 },
 "proc_codes": {},
}

Defining Procedure Codes

Procedure codes are keyed by procedure coding system:

dct_codes = {
 "diag_codes": {},
 "proc_codes": {
 "methadone": {
 7: [ # ICD-10-PCS (system code 7)
 "HZ81ZZZ", "HZ84ZZZ", "HZ85ZZZ", "HZ86ZZZ",
 ],
 },
 },
}

Common procedure system codes:

  • 1 — CPT/HCPCS
  • 6 — ICD-9-CM procedure
  • 7 — ICD-10-PCS

Defining Filters

Filters control which claims and patients are included:

dct_filters = {
 "cohort": {
 "ip": {
 "missing_dob": 0, # Exclude missing DOB
 "range_numeric_age_prncpl_proc": (18, 64), # Age 18-64
 },
 "ot": {
 "missing_dob": 0,
 "range_numeric_age_srvc_bgn": (18, 64),
 },
 },
 "export": {},
}

Filter Types

Type Example Description
Column value "missing_dob": 0 Keep rows where column equals value
Numeric range "range_numeric_age_srvc_bgn": (18, 64) Keep rows where column is within range (inclusive)
Date range "range_date_srvc_bgn_date": ("20160101", "20181231") Keep rows where date is within range
Exclusion "excl_female": 1 Exclude patients with positive exclusion flag

Output Files

After extraction, the export folder contains:

  • cohort_{STATE}.csv — patient-level file with condition flags, inclusion indicator, and date of birth
  • cohort_{STATE}_{YEAR}.csv — year-specific patient file
  • cohort_exclusions_{TYPE}_{STATE}_{YEAR}.parquet — filter statistics
  • Exported claim files in the requested format (CSV or Parquet)

Cohort File Columns

The cohort_{STATE}_{YEAR}.csv file contains these columns (indexed by BENE_MSIS):

Column Pattern Description
include 1 if patient is included in the final cohort, 0 if excluded
YEAR Claim year
STATE_CD State code
birth_date Date of birth (merged from PS file)
{type}_diag_{condition} 1 if condition found in that claim type (e.g., ip_diag_diabetes_t2)
{type}_diag_{condition}_date Date of first occurrence of the condition
{type}_proc_{procedure} 1 if procedure found in that claim type (e.g., ip_proc_methadone)
{type}_proc_{procedure}_date Date of first occurrence of the procedure
{type}_diag_condn 1 if ANY diagnosis condition matched in that claim type
{type}_diag_condn_date Date of first ANY diagnosis condition
{type}_proc_condn 1 if ANY procedure condition matched in that claim type
{type}_proc_condn_date Date of first ANY procedure condition

Where {type} is the claim type (ip, ot, ot_line), {condition} comes from your dct_diag_codes keys, and {procedure} comes from your dct_proc_codes keys.

Exclusion Columns on Claim Files

During preprocessing, claims are flagged with exclusion columns (prefixed excl_). Filter keys in dct_filters omit the excl_ prefix:

IP claims: excl_missing_dob, excl_missing_admsn_date, excl_encounter_claim, excl_capitation_claim, excl_ffs_claim, excl_delivery, excl_female, excl_duplicated

OT claims: excl_missing_dob, excl_missing_srvc_bgn_date, excl_encounter_claim, excl_capitation_claim, excl_ffs_claim, excl_female, excl_duplicated

PS claims: excl_duplicated_bene_id

Multiple States

for state in ["AL", "IL", "CA", "NY", "TX"]:
 extract_cohort(
 state=state,
 lst_year=[2016, 2017, 2018],
 dct_diag_proc_codes=dct_codes,
 dct_filters=dct_filters,
 lst_types_to_export=["ip", "ot", "ps"],
 dct_data_paths={
 "source_root": "/data/cms/",
 "export_folder": f"/output/cohort/{state}/",
 },
 cms_format="TAF",
 )

See Also

Clone this wiki locally

AltStyle によって変換されたページ (->オリジナル) /