Data Mining: Challenges in Data Cleaning

by Faizaan Yousuf

Data Cleaning or Scrubbing is one of the major activities during ETL process. Data cleaning deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. Data cleaning plays major role during decision-making process or data analysis. The problem becomes more complex when data comes from heterogeneous sources.

Screen Shot 2015-08-13 at 4.43.50 PM

A data cleaning approach should satisfy several requirements. First of all, it should detect and remove all major errors and inconsistencies both in individual data sources and when integrating multiple sources. The approach should be supported by tools to limit manual inspection and programming effort and be extensible to easily cover additional sources. Furthermore, data cleaning should not be performed in isolation but together with schema-related data transformations based on comprehensive metadata. Mapping functions for data cleaning and other data transformations should be specified in a declarative way and be reusable for other data sources as well as for query processing. For example, we have three data sources – CTGOV, PubMed, NIHR.

"CTGOV" : {
"AFF_ADDRESS":["address_cleaning"],
"AFF_CITY": ["address_cleaning"],
"AFF_STATEABV": ["address_cleaning"],
"AFF_COUNTRY": ["address_cleaning"],
"START_DATE": ["format_date_mysql"]
},
"PUBMED": {
"DATE_CREATED": ["format_date_mysql"]
},
"NIHR": {
"PROGRAM_OFFICER_NAME": ["name_validate"],
"ORG_STATE": ["address_cleaning"],
"ORG_ZIPCODE":["phone_zip_cleaning"],
"AWARD_NOTICE_DATE":["format_date_mysql"],
"PROJECT_START":["format_date_mysql"],
"BUDGET_START":["format_date_mysql"]
}

In the above code snippet we are calling the same functions (“format_date_mysql”/”address_cleaning”) for different sources.

Data Cleaning Problems

As mentioned above, the data cleaning process gets more complex when data comes from heterogeneous sources. Here, data quality problem has to be solved by data cleaning and data transformation.

Screen Shot 2015-08-25 at 6.04.41 PM

*NF: name first, NM: name middle, NL :name last.

In the above scenario, data cleaning can be done by generalizing name, address, and date format. In many other cases the name field is as Name first, Name middle, Name last or date field is given as YY-MM-DD.

Data Cleaning Approaches

Data cleaning involves the following steps:

  • Data auditing/Data analysis: In this step, sample data is taken from all the sources to detect errors or data inconsistencies. This can be also done using statistical and database methods.
  • Definition of transformation workflow and mapping rules: Depending on the number of data sources, their degree of heterogeneity, and “dirtyness” of the data, a large number of data transformation and cleaning steps may have to be executed.
  • Verification: In this step, correctness and effectiveness of data is verified. This is done by taking data samples from each source. This step can be run multiple times until data is cleaned in the right way.
  • Transformation: In this step, data is transformed by running ETL process.
  • Backflow of cleaned data: After errors are removed, the cleaned data should also replace dirty data in the original sources in order to give legacy applications to improved data and to avoid redoing cleaning work for future data extractions.

As all the data cleaning steps play a vital role in decision making, poor quality data cost businesses vast amounts of money every year. Defective data leads poor business decisions, and inferior customer relationship management. Data is the core business asset that needs to be managed if an organization is to generate returns.

Leave a Reply

Your email address will not be published. Required fields are marked *

Data Science & PopHealth

Methods, tools, systems for healthcare data analysis

Contact us now

Popular Posts