Filtering Out Noise from Healthcare Terms Using UMLS

by Nadeem Nazeer

While analyzing healthcare related data we may come across terms like  “human”, “male”, “female” which are just noise terms within the context of our domain, as these do now allow deduction of information needed for many processes like record linkage, categorizing certain documents etc.  Therefore, it is imperative to filter these out. Definitely, this follows that there may be a decrease in number of terms we need to process which would enable a much more cleaner classification in later stages of data processing and mining.

Having said that, there is a UMLS dataset known as MRSTY containing semantic information for concepts. There is exactly one row in this file for each Semantic Type assigned to each concept. All Metathesaurus concepts have at least one entry in this file. Many have more than one entry.

Sample Record

C0001175|T047|B2.|Disease or Syndrome|AT17683839|3840|

So now let’s say we have a term like “Male”, which is contained in a group of documents and obviously such a term doesn’t specify anything  informative about any particular doc (supposing we are not classifying based on gender).

So if we query this term and get its semantic type we get:

‘Male’ is categorized as one of the following semantic type:

  • Organism Attribute
  • Qualitative Concept

Now, if we know that we are concerned with data related to disease, we may shun terms other than ‘Disease or Syndrome’ semantic type. Likewise we may choose other types as needed (shown in image below) and we can get term frequency from our docs. This will tell us which terms show high count; knowing its semantic type may help us to select or deselect that type.


Fig: MRSTY Tree


This way we may employ  UMLS MRSTY data to filter unwanted terms. Learn more about UMLS related stuff  here.

For any queries shoot me an email at nadeem@trialx.comor drop me a comment here. I will get back to you.

To learn more about

Contact Us

One thought on “Filtering Out Noise from Healthcare Terms Using UMLS”

Leave a Reply

Your email address will not be published. Required fields are marked *

Data Science & PopHealth

Methods, tools, systems for healthcare data analysis

Contact us now

Popular Posts