GII Data Pipeline

By Jack Gregory in R SQL AWS Innovation WIPO

February 8, 2021

Objective

To transition the Global Innovation Index (GII) from a disjointed data collection and quality control process to an industry-standard, cloud-based pipeline that is collaborative, integrated, reproducible, standardized, secure and documented.

See related workshop here.

Description

The high-level workflow of the GII can be grouped into three phases:

  1. Data;
  2. Analysis; and,
  3. Output.

Figure 1 outlines these phases along with their constituent steps.

Figure 1: High-level GII workflow

This project focuses on the Data phase, where we collect, clean and audit all data necessary for the construction of the Index. Some of the Output steps are covered as separate projects, including the Profiles and Briefs.

WIPO and the GII team had historically utilized an ad hoc system for preparing the annual GII rankings and report. The Index relied on a disparate set of MS Excel workbooks, along with non-standardized data collection and cleaning processes scattered across Stata, R and Excel. This exposed the report to serious risks with respect to the accuracy and reproducibility of the associated analyses. For a product with the global reach and influence of the GII, this situation was untenable.

The natural solution involved transitioning to a data pipeline to collect, transform, and store the necessary data. Once functional, the data can then be surfaced to collaborators for a variety of data projects, including those listed in the Analysis and Output phases in Figure 1.

The GII Data Pipeline is undergirded by its GitHub repository, as well as the development of the GII Database (GIIDB). The database is built in MariaDB using R and hosted in the cloud on AWS Relational Database Service (RDS). It provides the “single source of truth” for all input and output data associated with the GII. The Pipeline relies on batch processing, whereby “batches” of data are piped into storage at set time intervals. For the GII, this process occurs once per report cycle (i.e., per year).

The Pipeline consists of three main steps: collection, cleaning, and audit. Collection and cleaning are performed concurrently. Data for the GII is sourced from a wide variety of places–including APIs, databases, files, etc.–but unfortunately, such data typically isn’t ready for immediate use. First, we collect the raw data in its native format (e.g., csv or xlsx) and store it in a data lake in the cloud. Next, we clean the data and prepare it for ingestion to the GIIDB. Collection and cleaning are performed through a series of R scripts and Rmarkdown documents.

The audit represents the quality assurace and control step in the Pipeline. It involves describing the data for a particular indicator, as well as identifying potential outliers and missing data points. The process is standardized across all indicators and involves a series of internal R Shiny dashboards. At this point, the data is considered “model-ready” and provides the foundation for our composite indicator model as well as a range of data outputs.

Posted on:
February 8, 2021
Length:
3 minute read, 480 words
Categories:
R SQL AWS Innovation WIPO
Tags:
Workflow
See Also: