Warning: The NCBI web site requires JavaScript to function. more...

U.S. flag

An official website of the United States government

Log in

MedGen Data Processing and Curation

MedGen Documentation Pages

| Overview | Searching MedGen | Navigating MedGen records | MedGen Data Processing and Curation | FAQ | Definition Sources |

Building the database

MedGen is built by integrating data from multiple resources. The primary foundation is a subset of relevant medical genetics concepts from UMLS provided without restriction. This subset is updated according the UMLS biannual release cycle (May and November). Other sources are added and mapped to this primary subset as summarized in the Data Sources table below. When there are concepts from these sources or submitters not covered by UMLS, MedGen fills gaps by creating concepts

The data model of MedGen is patterned after UMLS, sets of terms thought to be equivalent are grouped by a concept unique identifier (CUI). Equivalency depends on the term and its definition, and the type of concept it represents. For example, the same term may be the name of disorder and a description of a clinical feature. Because those are different types of terms, they may be assigned different CUI values. Terms from sources not in scope for UMLS may be integrated into concepts already created by UMLS, or new 'concepts' may be created. Concepts from UMLS begin with the letter C followed by numerals. New concepts from MedGen's processing start with CN.

Note: The CN CUI generated by MedGen may be retired if UMLS generates a concept that corresponds to one initiated by MedGen. The history of those changes is reported as MedGen_CUI_history.txt.

The automated integration of terms in MedGen can be reviewed by NCBI medical genetics curators . If NCBI staff members or external user question a mapping, or identify a gap, curators will review data sources and the data flow to identify a solution. If you see information on a MedGen record that does not appear to be accurate. Please contact the MedGen team.

Selected sources with frequency of update

Data are integrated into MedGen at the frequency of data release from the individual sources to provide the most current information to users. For example, MedGen is updated daily for OMIM data, but not all data sources update that often. Updates may include adding terms, adding connections to related concepts in other databases, or term updates.

Table 1: MedGen Data Sources and Frequency of Updates

Table 1: MedGen data sources and frequency of updates
Source Frequency Comments
ClinVar Daily A subset of terms provided by submitters to ClinVar. Terms are reviewed by NCBI staff before releasing to MedGen. Thus there may be conditions reported to ClinVar that are not represented in MedGen.
GTR Daily Terms provided by those registering tests in the NIH Genetic Testing Registry (GTR)
GeneReviews Daily Terms may be reviewed by GTR staff. Definitions are added based on the MIM number relationship.
OMIM Daily Terms from OMIM are processed from both UMLS (which releases information twice a year), and daily updates directly from OMIM. The direct updates from OMIM are also used as a foundation of reporting gene-disease relationships. CUI assigned to records defined by MIM numbers may change with updates from UMLS, i.e. when MedGen-generated CUI is replaced with one from UMLS.
Human Phenotype Ontology (HPO) Monthly A primary source for clinical features of Mendelian disorders. MedGen uses the mapping of preferred terms from HPO to CUI provided by UMLS. Until one is available, MedGen assigns a CUI starting with CN. Thus, CUI used in MedGen for HPO-specific data may change with updates from UMLS.
Mondo Monthly Terms used by Mondo, their identifiers, their definitions, and their mappings between Mondo IDs and IDs from GARD, OMIM Phenotypic Series and Orphanet.
OMIM phenotypic series Monthly Concepts represented by OMIM's phenotypic series are integrated into MedGen as part of releases from Mondo.
ORDO Monthly/Annually Terms and definitions from the Orphanet Rare Disease ontology (ORDO) are processed into MedGen automatically based on mappings establshed by Mondo each month. An additional dataflow updates Orphanet concepts outside of Monodo's scope at least once per year. Concepts include disease records and the modes of inheritance characteristic of any disorder.
PharmGKB Monthly Drug names and identifiers from PharmGKB are processed and drug response term are automatically computed based on the drug names. When provided, descriptions of drug responses are also shown and attributed to PharmGKB based on the drug identifier.
UMLS twice a year

Representation of terms from UMLS is restricted to a subset of vocabulary sources, and categories of terms (semantic types). Vocabulary sources included in the UMLS data flow include

  • GARD,
  • MeSH,
  • NCI,
  • OMIM,
  • SNOMED CT
The categories in UMLS used by MedGen are processed as properties and include
  • Congenital Abnormality
  • findings
  • Molecular Function
  • Pathologic Function
  • Disease or Syndrome
  • Mental or Behavioral Dysfunction
  • Pharmacologic Substance
  • sign or symptom
  • Anatomical Abnormality
  • Neoplastic process
Medical Genetics Summaries When published to the NCBI BookShelf (unscheduled) Definitions are submitted based on CUI.
Elements of Morphology unscheduled Images from Elements of Morphology: Human Malformation Terminology mapped to HPO terms are displayed on Clinical feature records, as available. Data are updated as needed based on releases from EOM.

AltStyle によって変換されたページ (->オリジナル) /

Sources of definitions

For a complete listing of the sources of definitions used by MedGen, please refer to the Sources of definitions page.

Sources of relationships between disorders and their clinical features

MedGen processes disease to clinical feature relationships from OMIM and HPO data. OMIM, based on data from UMLS, represented as 'has_manifestation' in the relationships file (MGREL).

Human Phenotype Ontology, based on data from HPO and UMLS, represented as 'has_manifestation' in the relationships file (MGREL).

Sources of relationships between disorders (hierarchies)

On a MedGen record, there may be a section for Term Hierarchy that shows hierarchical disease relationships to other diseases. These can come from MedGen’s curated hierarchies (GTR table), MeSH, or Orphanet (note this links to the Orphanet site). Curated hierarchies are managed by curators as needed to clarify record relationships between diseases. These are not exhaustive of known relationships between terms in MedGen. Users are recommended to consult additional relationship data from our FTP reports MGREL.RRF.gz. Additional disease relationship information can be found in Mondo or OrphaNet (ORDO) and phenotype term hierarchies from HPO via the EMBL-EBI Ontology Lookup Service (OLS).

Sources of relationships between disorders and genes

MedGen displays related and associated genes for records based on reported gene-disease and gene-drug relationships from these sources:

  • OMIM
  • GeneReviews
  • PharmGKB*
  • Clinical Pharmacogenetics Implementation Consortium, CPIC *
  • Medical Genetics Summaries*
  • Expert curation (MedGen, GTR, or ClinVar staff)^

*Note: these gene-disease or gene-drug relationships are limited to pharmacogenetic concepts.

^Expert curation to add gene to disease relationship information directly is limited. Curators will first contact one of the sources listed above to determine if that resource can review and update their relationships based on newer literature or evidence.) Gene information such as the official gene symbol, cytogenetic location, and chromosomal location are pulled from NCBI’s Gene database and the Gene ID (used in NCBI’s Entrez search interface) is provided as well as a hyperlink to the human gene entry.

Automated processing of source updates

Figure 1 shows the automated (A) and manual curation (B) data processing for external data sources in MedGen.

MedGen Data Processing and Curation Overview

Monthly and Daily releases:

MedGen synchronizes with our external sources regularly, with each source having its own release timing and synchronization frequency (See the update frequency table). Generally, there are a few possible scenarios for records that are brought into the MedGen Database based on mapping of the source identifier (i.e. Mondo ID or MIM number) and the preferred/primary name for the record that is provided by the source.

When new versions of source data become available, automated pipelines download and process the data into MedGen individually, making local copies (Figure 1A). The relevant data is then subset and existing terms are updated as needed (Table 2, scenarios B and C); any new terms are added (Table 2, scenario A). Depending on the data sources involved, mapping is done between the identifiers or concept names or both. Some sources provide identifier mappings between them and another data source while others do not. For example, Mondo and UMLS do not have identifier mappings between them. Reports for manual resolution are generated when there are conflicts (Table 2, scenario E). For the vast majority of data processed in each update, there is no change from the external source compared to the data already in MedGen (Table 2, scenario D).

Table 2: Potential automated data processing outcomes

ScenarioName Match?ID Match?Outcome
Table 2: Potential automated data processing outcomes
ANoNoStore as a new record
BYesNoStore ID on record matched by name
CNoYesStore name on record match by ID
DYesYesNo change is made, data is already properly mapped
EYes (record 1)Yes (record 2)Conflict in mapping is flagged for curatorial review

The UMLS update process follows a slightly different logic, as there may be multiple "preferred" names from sources in UMLS and the mapped ID from UMLS is the CUI. Thus, the names and identifiers from sources like OMIM, HPO, and Orphanet that are processed by both MedGen and UMLS are used to match UMLS CUI to the appropriate record in MedGen. Names from other sources (such as GTR or ClinVar submissions or Mondo) are compared to UMLS record names to facilitate CUI assignment for those records. Similar to Scenario E above, a conflict in UMLS CUI mapping is reviewed and resolved by a curator (see below).

MedGen Curation

Conflicts in data mapping from the automated pipelines are reviewed by expert curators with extensive knowledge and expertise in medical genetics. There is a standard decision tree that is followed by curators to resolve data conflicts. (Figure 1B) If the source data is unclear, curators contact the source to clarify the term scope and meaning. If MedGen terms are the source of the conflict, terms are curated in MedGen to align with the source by either splitting or merging the errant term or creating a novel MedGen record to support the source term. X-ref: Cross-reference from source to other sources/identifiers.

MedGen curators contribute to community open discussion forums on GitHub for HPO and Mondo as a means to resolve potentially conflicting or unclear data from these sources. Communication with the OMIM, OrphaNet, and NCI teams is achieved via publicly available forms or email addresses on those respective websites. MedGen collaborates directly with NIH resources such as GARD, UMLS, and GeneReviews when questions arise regarding the terms from these resources. MedGen curators can also review and update the names (primary and synonyms), definitions, disease hierarchies, professional practice guidelines, suggested reading, linked review articles, and associated genes as needed to supplement automated processing. This level of review is limited and is used only when needed to fill gaps, resolve conflicts in automated processes or clarify potentially confusing terminology. We welcome your feedback or suggestions, if you see information on a MedGen record that does not appear to be accurate. Please contact the MedGen team.

Last Update: March 2025

Last updated: 2025年03月11日T15:39:28Z