Quick Guide to ETL of PubMed Data

by Applied Informatics

The PubMed dataset is one of the largest public healthcare data sets available. It comprises more than 24 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites. Each year around 1 million new citations are added.

Doing ETL (Extraction, Load, transformation) process on PubMed data is quite different from the other open data sets such as clinical trials data from ClinicalTrials.gov, data on physician payments or NIH grants dataset that we have been working on at Applied.

In this blog I will describe the key steps in ETL of Pubmed and describe some of its unique challenges.


Challenge #1. Downloading XML files.

Before starting ETL you need to download the dataset from the PubMed servers using sftp process or by Entrez api process. Beware: this is a large dataset, so the mere size of it can be overwhelming. 

Challenge #2. Parsing large XMLs. 

As mentioned above PubMed requires a different way to manage the ETL process than say handling CT.gov data. It is because PubMed has large data files which require large amounts of space during parsing the xml files. Each file contains 30,000 records.

To take up this ETL task, firstly each file is zipped in a folder and then that folder is passed for extraction. During the extraction process, each file uncompresses it and starts process on it and then recompresses.

Secondly we optimized the XML parsing by using Python ElementTree module, which creates chunks of the file and then parses the chunk individually rather than creating a single file object.


In this step, certain rules are applied on the extracted data. The main aim of this step is to load the data to the target database in a cleaned and general format (depending on the organization’s requirement). Here we consider a  few things,

  1. Improving Data quality/Data enhancing/Data dedupe:
  • In Pubmed the records are deduped by selecting certain keys from the raw data. Like we have generated on [“NAME_FIRST”,”NAME_LAST”,”PUB_DATE”].
  • Regarding  iproving the data quality we are adding few things to data. Like:
  • Adding the MESH TOP LEVEL of the disease. This helps us during record linking process.
  • Adding the impact factor to journal.
  •  Adding full journal title to journals eg : “Monatsschr Unfallheilkd Versicherungsmed”: “Monatsschrift fur Unfallheilkunde und Versicherungsmedizin”

     2. Data scrubbing/Data Cleaning:

Here in this step the raw data is cleaned and is formatted in generalized format.

  • Date format into ISO format (yyyy-mm-dd).
  • Gender format into Male/Female format.
  • Splitting a column into multiple columns and vice versa.

For example the Name fields are added to get the Full name. Also the Affiliation key is parsed to Affiliation Name, Affiliation City, Affiliation State, Affiliation Country. 


In this step the extracted data and processed data is loaded into the mysql database. The LOV (list of values) data eg: mesh top level is stored as semi-colon separated in db. Here we index the database with record_id key to make data load more efficient. This is because during insertion of data we check whether this record is present or not, if present, it is not entered into db otherwise record is added to database.

During the whole process the main aim is to store data in an ordered and cleaned format so that it will be easy for further processing.

Leave a Reply

Your email address will not be published. Required fields are marked *

Data Science & PopHealth

Methods, tools, systems for healthcare data analysis

Contact us now

Popular Posts