MedGen Overview
MedGen Documentation Pages
| Overview | Searching MedGen | Navigating MedGen records | MedGen Data Processing and Curation | FAQ | Definition Sources |
What is MedGen?
MedGen is a portal to information about conditions, phenotypes, and findings in humans related to Medical Genetics. Diseases represented in MedGen include Mendelian disorders, multi-factorial disorders, chronic disease susceptibilities, somatic phenotypes, and pharmacogenetic responses. MedGen includes infectious disease terms in order to support submitters of the NIH Genetic Testing Registry (GTR), that register infectious disease tests and clinicians looking for tests and terms related to infectious agents in human samples. Terms from GTR and ClinVar submissions, Unified Medical Language System (UMLS),Online Mendelian Inheritance in Man (OMIM,), Human Phenotype Ontology (HPO), Mondo Disease Ontology (Mondo), rare disease terms from Orphanet Rare Disease Ontology (ORDO), and other sources are integrated into unique concepts. Each concept is assigned a distinct identifier called the concept unique identifier (CUI) and a preferred name.
The core content of a concept record includes names, identifiers used by other databases with links to the external resources (e.g. ontologies such as HPO, ORDO and Mondo), modes of inheritance, clinical features, and map location of the loci affecting the phenotype. The integrated information about each concept is presented along with concept-related information from various sources including resources for clinicians (e.g., OMIM), resources for customers (e.g., MedlinePlus), genetic tests registered in GTR, medical literature (e.g., GeneReviews, PubMed), and molecular resources (e.g., ClinVar, RefSeqGene).
Links to GTR, GeneReviews, and Professional Guidelines are based on automated internal processes and curation. Data from all other sources are retrieved periodically in automated processes and reviewed as well as curated when necessary. MedGen data is also informed by feedback from the community. You can contact the MedGen team with any questions or suggestions.
MedGen’s Data Sources
MedGen includes data from multiple authoritative sources as well as community submissions to GTR and ClinVar and curated terms from the MedGen subject matter expert curators.
Source | Frequency of update | Data from source | Percent of MedGen terms mapped by source* |
UMLS | 2x/year | CUI, descriptions, name and ID | 96% |
HPO | Monthly | Name and ID, P:D | 8% |
Mondo | Monthly | Name, ID; Orphanet and GARD | 10% |
Orphanet (ORDO) | Monthly** | Name, ID, MoI | 4% |
OMIM | Daily | Name, ID, G:D, description | 5% |
GeneReviews | Weekly | Name, G:D, description | 0.4% |
PharmGKB | Monthly | Drug name, ID, Rx:G | 0.5% |
MedGen (internal) | Daily | Name, CN CUI, Rx:D, D:D | 2% |
NCBI Gene | Daily | Gene symbol, chromosome location | N/A |
UMLS- Unified Medical Language Service Metathesaurus; HPO- Human Phenotype Ontology; OMIM- Online Mendelian Inheritance in Man; CUI- Concept Unique Identifier; ID- external source identification code; GARD- Genetic and Rare Disease information center; MoI- Mode of Inheritance; P:D- Phenotype to disease relationship data; G:D- gene to disease relationship data; Rx:G- Drug to gene relationship data; Rx:D – Drug to disease (drug-response) relationship; D:D – Disease to disease relationship.
* MedGen terms can have one or more external identifiers/sources mapped, column will not total to 100%. Percentages are averaged from data analyzed in 2024.
**Processed with Mondo data release; also manually curated against 2x/year data release from Orphanet.
From these sources, MedGen processes the preferred or primary name used to describe a record as well as the identifier (ID) used by the source to facilitate automated mapping of these sources to one another. For data sources processed exclusively from UMLS, mapping in MedGen aligns with UMLS CUI mapping of the same term. [Records] (/docs/navigating) in MedGen display additional data from these sources based on the harmonized mapping of all appropriate terms and sourcing to one MedGen entry.
From the data in UMLS, MedGen includes terms for human diseases, phenotypes, genes, and pharmacologic substances (drugs). UMLS is the only source for MedGen mapping of names and source identifier codes from SNOMED CT (US Edition, primary and synonym names), MeSH (Medical Subject Headings), and NCI. Whenever possible, CUIs from UMLS are used for MedGen terms. Disease descriptions or ‘definitions’ are also retrieved from UMLS for display in MedGen, the specifics sources and prioritization rank for these descriptions is provided in detail here.
For more details regarding the processes and types of information incorporated into MedGen, please see MedGen Data Processing and Curation.
Data from MedGen (internal)
Rarely, MedGen is unable to align a submitted or curated disease record with UMLS or other sources. Each record in MedGen must have a Concept Unique Identifier (CUI), which are primarily drawn from the UMLS data set. If no CUI can be found in UMLS to match a record in MedGen, then an NCBI-generated CUI is provided instead; these begin with "CN" to clarify they are not UMLS provided CUIs (which all begin with "C"). These CN-type CUIs may also be created from submissions in the NIH Genetic Testing Registry or ClinVar that do not match a record in UMLS.
MedGen also seeks to provide standardized terminology for the field of pharmacogenomics, or the interaction between an individual’s genetic code and how they respond to medications. The PharmGKB resource provides a comprehensive list of medications and potential gene-drug interactions; MedGen creates disease records to describe abnormal responses to these drugs, which can be driven by genetic or environmental factors. These terms are created and maintained by MedGen and not an external source, though whenever possible links are made to expert clinical recommendations, FDA approved drug labels, and PharmGKB pages for those medications.
MedGen Data Processing and Curation Overview
MedGen source data are mapped to common terms or records based on the names used by these sources, reported mappings between sources, and expert curation.
Figure 1: Overview of automated record updates and alignment across data sources and manual curation
MedGen Data Processing and Curation Overview
Figure 1 legend: A: Automated Alignment process: When new versions of source data become available, automated pipelines download and process the data into MedGen individually, making local copies. The relevant data is then subset and existing terms are updated as needed; any new terms are added. Depending on the data sources involved, mapping is done between the identifiers or concept preferred names, or both. Some sources provide identifier mappings between them and another data source while others do not. For example, Mondo and UMLS do not have identifier mappings between them. Reports for manual resolution are generated when there are conflicts.
B: Manual Curation process: The manual curation process follows a standard decision tree to evaluate the source term and potential match(es) in MedGen. If the source data is unclear, curators contact the source to clarify the term scope and meaning. If MedGen terms are the source of the conflict, terms are curated in MedGen to align with the source by either splitting or merging the errant term or creating a novel MedGen record to support the source term. X-ref: Cross-reference from source to other sources/identifiers.
A fuller description of the data import, mapping, and curation processes to create the MedGen database are described on MedGen Data Processing and Curation.
Last Update: March 2025