How To Build Config Driven Architecture

by Shahid Ashraf

There are instances where we all have used settings file to configure various aspects of application like database passwords, file locations etc. We can extend this notion, in which application or modules of code which gets executed is driven by the configuration files. It helps removing code duplication and provides flexible execution of code. Also, if you have certain business rules which must be applied consistently across your application, you can use configuration files to configure and apply those rules. In this blog post, I will try to provide how at Applied we are building pipelines, applications, and apis which leverage config driven architecture.

Step 1:- Define the configuration files.

Following is the Config File we use in ETL process. It is used at Extract step of the ETL to normalize the data. The configuration file is built as we have defined the data source and its data based on key value. It helps us to normalize the data to particular Key Fields. Based on this file code, on execution, based on source, loads the fields and maps them to particular field defined as value of key.

{
"DATA_FIELD_MAP_CORE":
{
"CTGOV":
{

"RECORD_ID" : "RECORD_ID",
"SOURCE" : "",
"SOURCE_ID" : "NCT_IDENTIFIER",
"TYPE" : "",
"FULL_NAME" : "FULL_NAME",
"NAME_FIRST" : "NAME_FIRST",
"NAME_MIDDLE" : "NAME_MIDDLE",
"NAME_LAST" : "NAME_LAST",
"PREFIX" : "",
"SUFFIX" : "",
"GENDER" : "",
"EMAILS" : "EMAIL",
"PHONE" : "PHONE",
"FAX" : "",
"ADDRESS" : "AFF_ADDRESS",
"AFF_TAG": "AFFILIATION_CANON",
"INSERT_DATE" : "INSERT_DATE"

},

"NPI":
{
"RECORD_ID" : "RECORD_ID",
"SOURCE" : "",
"SOURCE_ID" : "SOURCE_ID",
"TYPE" : "TYPE",
"FULL_NAME" : "STD_FULL_NAME",
"NAME_FIRST": "NAME_FIRST",
"NAME_MIDDLE": "NAME_MIDDLE",
"NAME_LAST" : "NAME_LAST",
"PREFIX" : "PREFIX",
"SUFFIX" : "SUFFIX",
"GENDER" : "GENDER",
"EMAILS" : "",
"PHONE" : "PHONE",
"FAX" : "FAX",
"ADDRESS" : "ADDRESS",
"AFF_TAG": "AFFILIATION_CANON",
"INSERT_DATE" : "INSERT_DATE"

},

"PUBMED":
{
"RECORD_ID" : "RECORD_ID",
"SOURCE" : "",
"SOURCE_ID" : "PMID",
"TYPE" : "TYPE",
"FULL_NAME" : "STD_FULL_NAME",
"NAME_FIRST" : "NAME_FIRST",
"NAME_MIDDLE" : "",
"NAME_LAST" : "NAME_LAST",
"PREFIX" : "",
"SUFFIX" : "",
"GENDER" : "",
"EMAILS" : "AFF_EMAILS",
"PHONE" : "",
"FAX" : "",
"ADDRESS" : "AFF_ADDRESS",

"AFF_TAG": "AFFILIATION_CANON",
"INSERT_DATE" : "INSERT_DATE"
},

….
}

Also while transforming data, we have defined another configuration file. This configuration is very interesting as we have configured functions to be called with data fields, since there are various types of data, which need to be transformed/cleaned.
{
"clean_module_map" :
{
"CTGOV" :
{
"NCT_IDENTIFIER":["address_cleaning","clean_string"],
"AFF_ADDRESS":["address_cleaning"],
"AFF_CITY": ["address_cleaning"],
"AFF_STATEABV": ["address_cleaning"],
"AFF_COUNTRY": ["address_cleaning"],
"START_DATE": ["format_date_mysql"],
"FULL_NAME":["general_cleaning"]

},
"PUBMED":
{
"DATE_CREATED": ["format_date_mysql"],
"NAME_FIRST":["general_cleaning"],
"NAME_LAST":["general_cleaning"]
},
"NPI":{
"DEGREES": [],
"STREET_ADDRESS1": ["address_cleaning"],
"STREET_ADDRESS2": ["address_cleaning"],
"CITY": ["address_cleaning"],
"STATE": ["address_cleaning"],
"COUNTRY": ["address_cleaning"]
}
}

As shown here, based on the source and the key which is a field in the data, the list of values are function names which get executed and data values are passed as arguments.

Step 2: – Validate the Config Files.
It involves validating and making sure we are using same main keys. And also we are using actual keys which are in data records.

Step 3: – Building the Code
In the following sample code we are invoking cleaning functions based on the config described in step one.

from lib import GenCleaner
class clean_fields(GenCleaner):
""" this class cleans the ctgov clean fields """
def __init__(self, **kwargs):
self.source = kwargs.get("source", None)
with open(kwargs["config_file"]) as json_data:
self.rmap = simplejson.load(json_data)
self.clean_map = self.rmap['clean_module_map'][self.source.upper()]
logging.debug("CLEAN MAP ")
# print self.clean_map

def do_magic(self, jsonl):
""" calls functions based..."""
for key, value in self.clean_map.items():
# print key,value
if key in jsonl:
for func in value:
try:
jsonl[key] = getattr(self, func)(jsonl[key])
except AttributeError as e:
logging.error("Method %s not implemented" %(func))
return jsonl

def run(self, input):
""" this does the cleaning magic"""
logging.debug("RUNNING CLEAN PIPELINE FOR %s", self.source)
for jsonl in input:
yield self.do_magic(jsonl)

if __name__ == '__main__':
data_list = []
c = clean_fields("clean_module_map.json")
c.run(data_list)

What this code snippet does is that it first reads the config file “clean_module_map.json” and then iterates through the data list and based on the data field/key it calls and executes the code function defined in the config file. The cleaning functions are defined in the GenCleaner Library Class which is extended by clean_fields.

Rules to Live by:

All of us are familiar with the following rules, in configuration-driven development we leverage them differently.

1. Keep it simple
The configuration files must be easy to understand and evolved. This is why we recommend json files over XML files.

2. Evolve as required
No predefined layout is going to comply with every developer’s needs. The solution to this problem is to adapt your json layout to fit your needs. Depending on the domain or the software architecture, the json attributes used in classes or field definitions can vary a lot. Also try not to use multi level json configuration.

3. Validate early and often
Common sense dictates that errors caught earlier in the development process are the cheapest to resolve. Following this principle, it makes sense to validate your configuration as early and as extensively as possible. Also validate that ,the function names in config files are utmost correct and we should use unit tests to test them, config files are calling correct function and there is no issue with paths.

In this article I’ve proposed a simple and efficient way to achieve a functional and successful configuration-driven development process. Drop me a line in the comment section if you have any queries.

To learn more about Data Science & PopHealth, Public Health Datasets

Contact Us

Leave a Reply

Your email address will not be published. Required fields are marked *

Data Science & PopHealth

Methods, tools, systems for healthcare data analysis

Contact us now

Popular Posts