Software Credentialed Access

Code for generating the HAIM multimodal dataset of MIMIC-IV clinical data and x-rays

" data-html="true" style="cursor: pointer;"> , Yu Ma ymml2020
" data-html="true" style="cursor: pointer;"> , Cynthia Zeng cynthiazeng
" data-html="true" style="cursor: pointer;"> , Leonard David Jean Boussioux leobix
" data-html="true" style="cursor: pointer;"> , Kimberly Villalobos Carballo kimvc7
" data-html="true" style="cursor: pointer;"> , Liangyuan Na lyna
" data-html="true" style="cursor: pointer;"> , Holly Wiberg hwiberg
" data-html="true" style="cursor: pointer;"> , Michael Li shaznl
" data-html="true" style="cursor: pointer;"> , Ignacio Fuentes ifuentes
" data-html="true" style="cursor: pointer;"> , Dimitris Bertsimas dbertsim
" data-html="true" style="cursor: pointer;">

Published: Aug. 23, 2022. Version: 1.0.1

When using this resource, please cite: (show more options)
Soenksen, L. R., Ma, Y., Zeng, C., Boussioux, L. D. J., Villalobos Carballo, K., Na, L., Wiberg, H., Li, M., Fuentes, I., & Bertsimas, D. (2022). Code for generating the HAIM multimodal dataset of MIMIC-IV clinical data and x-rays (version 1.0.1). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/3f8d-qe93

Cite

MLA	Soenksen, Luis R, et al. "Code for generating the HAIM multimodal dataset of MIMIC-IV clinical data and x-rays" (version 1.0.1). PhysioNet (2022). RRID:SCR_007345. https://doi.org/10.13026/3f8d-qe93
APA	Soenksen, L. R., Ma, Y., Zeng, C., Boussioux, L. D. J., Villalobos Carballo, K., Na, L., Wiberg, H., Li, M., Fuentes, I., & Bertsimas, D. (2022). Code for generating the HAIM multimodal dataset of MIMIC-IV clinical data and x-rays (version 1.0.1). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/3f8d-qe93
Chicago	Soenksen, Luis R, Ma, Yu, Zeng, Cynthia, Boussioux, Leonard David Jean, Villalobos Carballo, Kimberly, Na, Liangyuan, Wiberg, Holly, Li, Michael, Fuentes, Ignacio, and Dimitris Bertsimas. "Code for generating the HAIM multimodal dataset of MIMIC-IV clinical data and x-rays" (version 1.0.1). PhysioNet (2022). RRID:SCR_007345. https://doi.org/10.13026/3f8d-qe93
Harvard	Soenksen, L. R., Ma, Y., Zeng, C., Boussioux, L. D. J., Villalobos Carballo, K., Na, L., Wiberg, H., Li, M., Fuentes, I., and Bertsimas, D. (2022) 'Code for generating the HAIM multimodal dataset of MIMIC-IV clinical data and x-rays' (version 1.0.1), PhysioNet. RRID:SCR_007345. Available at: https://doi.org/10.13026/3f8d-qe93
Vancouver	Soenksen L R, Ma Y, Zeng C, Boussioux L D J, Villalobos Carballo K, Na L, Wiberg H, Li M, Fuentes I, Bertsimas D. Code for generating the HAIM multimodal dataset of MIMIC-IV clinical data and x-rays (version 1.0.1). PhysioNet. 2022. RRID:SCR_007345. Available from: https://doi.org/10.13026/3f8d-qe93

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Cite

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000). RRID:SCR_007345.
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000). RRID:SCR_007345.
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Abstract

A multimodal combination of the MIMIC-IV v1.0.0 and MIMIC Chest X-ray (MIMIC-CXR-JPG) v2.0.0 databases filtered to only include patients that have at least one chest X-ray performed with the goal of validating multi-modal predictive analytics in healthcare operations can be generated with the present resource. This multimodal dataset generated through this code contains 34,540 individual patient files in the form of "pickle" Python object structures, which covers a total of 7,279 hospitalization stays involving 6,485 unique patients. Additionally, code to extract feature embeddings as well as the list of pre-processed features are included in this repository.

Background

As described in Soenksen et al 2022 [3], the MIMIC datasets can be used for the purpose of testing multimodal machine learning systems. To generate a multimodal dataset, our project utilizes the Medical Information Cart for Intensive Care (MIMIC)-IV v1.0 [1] resource, which contains de-identified records of 383,220 individual patients admitted to the intensive care unit (ICU) or emergency department (ED) of Beth Israel Deaconess Medical Center (BIDMC), in combination the MIMIC Chest X-ray (MIMIC-CXR-JPG) database v2.0.0 [2] containing 377,110 radiology images with free-text reports representing 227,835 medical imaging events that can be matched to corresponding patients included in MIMIC-IV v1.0.

We combined MIMIC-IV v1.0 [1] and MIMIC-CXR-JPG v2.0.0 [2] into a unified multimodal dataset, which we identify as HAIM-MIMIC-MM in Soenksen et al 2022 [3], based on matched patient, admission, and imaging-study identifiers (i.e., subject_id, stay_id, study_id from the MIMIC-IV and MIMIC-CXR-JPG databases). We used this multimodal dataset to assist in the systematic evaluation of improvements in predictive value from multi-modality in canonical artificial intelligence models for healthcare. The file format produced by the present code includes structured patient information, time-series data, medical images, and unstructured text notes for each patient.

Building this combination of MIMIC-IV and MIMIC-CXR-JPG into independent patient files for use with the Holistic Artificial Intelligence in Medicine (HAIM) framework presented in Soenksen et al 2022 [3] requires credentialed access to MIMIC-IV v1.0 [1] and MIMIC-CXR-JPG v2.0.0 [2]. A GitHub repository describing the use of this multimodal combination database as a canonical example to train multimodal artificial intelligence models for clinical use and healthcare operations can be found online [4].

Software Description

The multimodal clinical database used in Soenksen et al 2022 [3], contains N=34,537 samples, spanning 7,279 unique hospitalizations and 6,485 patients. This database contains 4 distinct data modalities (i.e., tabular data, time-series information, text notes, and X-ray images).

Every patient file in this multimodal dataset includes information extracted from the following fields in MIMIC-IV v1.0 [1] and MIMIC-CXR-JPG v2.0.0 [2]: admissions, demographics, transfers, core, diagnoses icd, drgcodes, emar, emar detail, hcpcsevents, labevents, microbiologyevents, poe, poe detail, prescriptions, procedures icd, ser- vices, procedureevents, outputevents, inputevents, icustays, datetimeevents, chartevents, cxr, imcxr, noteevents, dsnotes, ecgnotes, echonotes, rad-notes. We have created sample Jupyter notebooks and python files to showcase how this structure is generated based on MIMIC-IV v1.0 [1] and MIMIC-CXR-JPG v2.0.0 [2].

Our selected structure based on individual patient files in pickle format provides several advantages for training artificial intelligence and machine learning models based on this multi-modal dataset. For instance, the high compatibility and handling speed that python has for pickle files, allows for fast loading, while the individualized patient files allow for easier input of patient samples of selected criteria into training algorithms for standard open-source machine learning libraries written in a python programming language.

The code provided processes and saves all the individual patient files locally as “pickle” python-language object structures for ease of processing in subsequent sampling and modeling tasks. The final file structure should be organized in folders of 1000 files each, where the file name is organized as haim-ID.pkl, and the mapping between haim-ID and the MIMIC patient IDs is recorded in the file "haim_mimiciv_key_ids.csv".

The definition of the data structure for all patient files in relation to the individualized data in each pickle file is as follows:

Patient class structure

class Patient_ICU(object):
 def __init__(self, admissions, demographics, transfers, core,
 diagnoses_icd, drgcodes, emar, emar_detail, hcpcsevents,
 labevents, microbiologyevents, poe, poe_detail,
 prescriptions, procedures_icd, services, procedureevents,
 outputevents, inputevents, icustays, datetimeevents,
 chartevents, cxr, imcxr, noteevents, dsnotes, ecgnotes,
 echonotes, radnotes):
 ## CORE
 self.admissions = admissions
 self.demographics = demographics
 self.transfers = transfers
 self.core = core
 ## HOSP
 self.diagnoses_icd = diagnoses_icd
 self.drgcodes = drgcodes
 self.emar = emar
 self.emar_detail = emar_detail
 self.hcpcsevents = hcpcsevents
 self.labevents = labevents
 self.microbiologyevents = microbiologyevents
 self.poe = poe
 self.poe_detail = poe_detail
 self.prescriptions = prescriptions
 self.procedures_icd = procedures_icd
 self.services = services
 ## ICU
 self.procedureevents = procedureevents
 self.outputevents = outputevents
 self.inputevents = inputevents
 self.icustays = icustays
 self.datetimeevents = datetimeevents
 self.chartevents = chartevents
 ## CXR
 self.cxr = cxr
 self.imcxr = imcxr
 ## NOTES
 self.noteevents = noteevents
 self.dsnotes = dsnotes
 self.ecgnotes = ecgnotes
 self.echonotes = echonotes
 self.radnotes = radnotes

In the specific context of Soenksen et al 2022 [3], the selection of "pickle files" for per-patient data format was done to provide an interface with common machine learning and artificial intelligence modeling techniques which heavily rely on Python to conduct computational experiments.

In addition to the code to generate the multimodal patient files, we include the extracted HAIM embeddings on such files for convenience. We hope this format allows a wide audience for more direct access to this merged dataset.

Technical Implementation

The multimodal combination of the MIMIC-IV v1.0.0 [1] and MIMIC Chest X-ray (MIMIC-CXR-JPG) v2.0.0 [2] is processed in our code by first importing all MIMIC-IV tables in combination with compressed JPG formatted images from the MIMIC-CXR-JPG database, which need to be downloaded locally via credentialed access on PhysioNet. Both data sources have been previously independently de-identified by deleting all personal health information (PHI), following the US Health Insurance Portability and Accountability Act of 1996 (HIPAA) Safe Harbor requirements.

After getting access from PhysioNet, our code unifies registries on MIMIC-IV v1.0.0 [1] and MIMIC Chest X-ray (MIMIC-CXR-JPG) v2.0.0 [2] based on matched patient, admission, and imaging-study identifiers (i.e., subject_id, stay_id, study_id). We have created a HAIM GitHub repository [4] for collaborative code development on our HAIM framework, testing, and reproduction of the results presented in Soenksen et al 2022 [3]. We welcome code contributions from all users, and we encourage discussion of the data via the GitHub issues.

Installation and Requirements

As specified in Soenksen et al 2022 [3], the individual multimodal HAIM patient files were generated using a computer with 8 cores and 32Gb of available random access memory (RAM) for this processing task, and a minimum of 20Gb in RAM is usually required for processing. All required installations for this project are specified in the "env.yaml" file within the "env" folder. A sample of 5 folders with previously generated multimodal patient files (Folder00 to Folder04) is included as part of the "Sample_Multimodal_Patient_Files" repository to facilitate testing and validation of the merged dataset in the form of individual patient files along with the presented multi-modal machine learning techniques in Soenksen et al 2022 [3].

Usage Notes

Three Jupyter notebook files demonstrate:

The generation of the merged HAIM-MIMIC-MM dataset ("1_Generate_HAIM-MIMIC-MM.ipynb")
The generation of embeddings based on the individual patient files from HAIM-MIMIC-MM ("Generate_Embeddings_from_Pickle_Files.ipynb"), and
Sample utilization of such embeddings for the creation of a predictive task using machine learning ("3_Use_Embedding_for_Prediction.ipynb")

All other code needed to evaluate this multimodal database and reproduce the conclusions in its companion modeling work in Soenksen et al 2022 [3] are available in our GitHub repository [4].

Release Notes

Version 1.0.0: First release of the software and sample data.

Version 1.0.1: Updated release of the software and sample data.

Ethics

The authors declare no ethics concerns.

Acknowledgements

We thank the PhysioNet team from the MIT Laboratory for Computational Physiology for providing our researchers with credentialed access to the MIMIC-IV v1.0.0 [1] and MIMIC Chest X-ray (MIMIC-CXR-JPG) v2.0.0 [2] datasets and for their support in guiding multimodal data interrogation and consolidation. We especially thank Leo A. Celi and Sicheng Hao for their support on review of the HAIM data, as well as the Harvard TH Chan School of Public Health, Harvard Medical School, the Institute for Medical Engineering and Science at MIT, and the Beth Israel Deaconess Medical Centre for their continued support of this work. We thank MIT Supercloud services for their support and help in setting up a workspace as well as for offering technical advice throughout the project.

Conflicts of Interest

Authors declare no competing interests.

References

Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2021). MIMIC-IV (version 1.0). PhysioNet. https://doi.org/10.13026/s6n6-xd98.
Johnson, A., Lungren, M., Peng, Y., Lu, Z., Mark, R., Berkowitz, S., & Horng, S. (2019). MIMIC-CXR-JPG - chest radiographs with structured labels (version 2.0.0). PhysioNet. https://doi.org/10.13026/8360-t248.
Soenksen, L.R., Ma, Y., Zeng, C., Boussioux, L.D., Carballo, K.V., Na, L., Wiberg, H.M., Li, M.L., Fuentes, I. and Bertsimas, D., 2022. Integrated multimodal artificial intelligence framework for healthcare applications. arXiv preprint arXiv:2202.12998.
Soenksen, L.R., Ma, Y., Zeng, C., Boussioux, L.D., Carballo, K.V., Na, L., Wiberg, H.M., Li, M.L., Fuentes, I. and Bertsimas, D., Holistic Artificial Intelligence in Medicine, (2022), GitHub repository https://github.com/lrsoenksen/HAIM