Airflow, Meta Data Engineering, and a Data Platform for the World’s Largest Democracy

I originally wrote this post for the SocialCops engineering blog, and then published it on Hacker Noon.


In our last post on Apache Airflow, we mentioned how it has taken the data engineering ecosystem by storm. We also talked about how we’ve been using it to move data across our internal systems and explained the steps we took to create an internal workflow. The ETL workflow (e)xtracted PDFs from a website, (t)ransformed them into CSVs and (l)oaded the CSVs into a store. We also touched briefly on the breadth of ETL use cases you can solve for, using the Airflow platform.

In this post, we will talk about how one of Airflow’s principles, of being ‘Dynamic’, offers configuration-as-code as a powerful construct to automate workflow generation. We’ll also talk about how that helped us use Airflow to power DISHA, a national data platform where Indian MPs and MLAs monitor the progress of 42 national level schemes. In the end, we will discuss briefly some of our reflections from the project on today’s public data technology.

Why Airflow?

To recap from the previous post, Airflow is a workflow management platform created by Maxime Beauchemin at Airbnb. We have been using Airflow to set up batching data workflows in production for more than a year, during which we have found the following points, some of which are also its core principles, to be very useful.

An Airflow DAG

Furthermore, Airflow comes with a web interface that gives you all the context you need about your workflow’s execution, from each task’s state (running, success, failed, etc.) to logs that the task generated!

The problem with static code

Here at SocialCops, we’ve observed a recurring use case of extracting data from various systems using web services, as a component of our ETL workflows. One of the ways to go forward with this task is to write Python code, which can be used with the PythonOperator to integrate the data into a workflow. Let’s look at a very rudimentary DAG file that illustrates this.