Overview of Different UMLS Data Filesby Nadeem Nazeer
In this post we will provide an overview of different data files included in UMLS Metathesaurus, as well as summarize information from ncbi.nlm.nih.gov articles (here and here) related to these files.
The data files included in Metathesaurus entry may be represented in more than 20 different relations, or files.
These files broadly fall under:
- Files defining Concepts, Concept Names, and their sources. Metathesaurus is organized by concept. One of its primary purposes is to connect different names for the same concept from many different vocabularies, so data file defining Concepts is at the core of Metathesaurus.
- Concepts, Concept Names, and their sources = MRCONSO.RRF
- Files defining attributes of concepts, attributes include every discrete piece of information about a concept, an atom, or a relationship that is not (1) part of the basic Metathesaurus concept structure or (2) distributed in one of the relationship files.
- Attributes = MRSAT.RRF, MRDEF.RRF, MRSTY.RRF, MRHIST.RRF
- Files defining relationships. The Metathesaurus includes many relationships between different concepts (in addition to the synonymous relationships in the Metathesaurus concept structure).
- Relationships = MRREL.RRF, MRCOC.RRF, MRCXT.RRF, MRHIER.RRF, MRMAP.RRF, MRSMAP.RRF
- Files containing Data about the Metathesaurus, these files provide useful metadata, i.e., data about the Metathesaurus itself. The metadata files describe (1) characteristics of the current version of the Metathesaurus; (2) changes between the current version and the previous version; and (3) the history of concept identifiers (CUIs) from 1991 to the present.
- Data about the Metathesaurus = MRFILES.RRF, MRCOLS.RRF, MRDOC.RRF, MRRANK.RRF, MRSAB.RRF, AMBIGLUI.RRF, AMBIGSUI.RRF, CHANGE/MERGEDCUI.RRF, CHANGE/MERGEDLUI.RRF, CHANGE/DELETEDCUI.RRF, CHANGE/DELETEDLUI.RRF, CHANGE/DELETEDSUI.RRF, MRCUI.RRF
- Files defining Indexes. To assist system developers in building applications that retrieve all strings or concept names which include specific words or groups of words, three indexes to the concept names are provided: a Word Index, a Normalized Word Index (for English words only), and a Normalized String Index (for English strings only).
- Indexes = MRXW_BAQ.RRF, MRXW_DAN.RRF, MRXW_DUT.RRF, MRXW_ENG.RRF, MRXW_FIN.RRF, MRXW_FRE.RRF, MRXW_GER.RRF, MRXW_HEB.RRF, MRXW_HUN.RRF, MRXW_ITA.RRF, MRXW_NOR.RRF, MRXW_POR.RRF, MRXW_RUS.RRF, MRXW_SPA.RRF, MRXW_SWE.RRF, MRXNW_ENG.RRF, MRXNS_ENG.RRF
While playing with these data we should know that Metathesaurus is not fully normalized. There is duplication of data among different files and within certain files. In particular, relationships between different Metathesaurus concepts appear twice (e.g., from entry A to entry B and from entry B to entry A). Developers will need to make their own decisions about the extent to which this redundancy should be retained, reduced, or increased for their specific applications.
And one more thing is all files except MRRANK.RRF are sorted by row.