Renzo Frigato, Data Engineer
Pipeline jungles are a common source of technical debt in advanced Machine Learning systems, and at Lumiata, with our fast iterations on model evolution, we were not exempt from this curse.
In this blog post, I want to share what we did at Lumiata and why it is important in this relatively new field to try to follow good engineering design principles. Our example will focus on how the Dependency Inversion Principle could be applied to ML Pipelines to control technical debt. If you want to go right into the code, this is the Github repository we created to show a simplified version of our Airflow infrastructure.
The Lumiata pipeline that creates features and generates predictions is for the most part owned and managed independently by the Data Science team. This allows them to quickly experiment and iterate over parameter tuning, new features and new models.
This supports the principle that Lumiata’s researchers should have a high degree of freedom. However, when experimenting and iterating, many problems could occur: experiments might not be reproducible, code is untested and has bugs when it is not run by the person who developed it, and so on.
In particular, as the Lumiata Data Science team was growing fast, it was hard to unify contributions from multiple people and run the full prediction pipeline in an automated and reproducible way. There were remarkable efforts in the Data Science team to have a unique configuration that would handle data and model management within the pipeline. But was that enough to get out of the pipeline jungle debt?
Very early, when the Data Science team was still relatively small, we chose Airflow as the tool to orchestrate our pipelines. With its Sensors, Actions, Tasks, Schedulers, it looked like the Swiss Army Knife we were looking for. We compared it to other pipeline orchestration tools, but we found Airflow superior:
- Operators, the Airflow work unit, are very extensible
- DAG are easy to build and define
- the UI helps in visualizing and debugging jobs
- the open source community and user base is very rich and active
However, any tool, especially if it looks like an Army Knife, can get you in trouble if used incorrectly: you don’t want to use the bottle opener for cutting!
With Airflow, it’s very easy to add any type of computational job to your pipeline: a simple bash script, a spark job on your local machine or on your cloud provider, a data transfer and many more. We realized that this could lead to having a very wild pipeline, and even if Airflow helps in visualizing how the different pieces are connected together, that is not enough to maintain strict engineering principles.
As stated in this Google paper¹: “Managing these pipelines, detecting errors and recovering from failures are all difficult and costly. Testing such pipelines often requires expensive end-to-end integration tests.” As described here², pipeline jungles could be due to a violation of one of the SOLID Engineering principles, Dependency Inversion.
The Dependency Inversion Principle (DIP)
The definition of the DIP from its Wikipedia page could look quite obscure:
A. High-level modules should not depend on low-level modules. Both should depend on abstractions.
B. Abstractions should not depend on details. Details should depend on abstractions.
A simple real life example could be one in which the high-level module is a car and the low-level module its engine. The DIP says that the car shouldn’t depend on a particular engine model. There should be an abstract definition or agreement of how an engine would be plugged into the car. This way there is no need to change the design of a car to add a particular engine or vice versa to change the internal details of an engine to fit some car model.
Implementing the DIP allows the development of a new car or engine without having to think about the integration with the other. As long as the engine-car abstract requirements are respected, the two components can be developed independently.
Inverting the Dependency in ML Pipelines
In object-oriented programming, the DIP could have multiple implementations: for example, It is possible to abstract the low-level module into an interface to define how the high-level module should interact with the low-level module. The low-level module will need to respect this interface to be called successfully by the high-level module.
Our ML pipelines infrastructure, of which you can find a simplified version in this Github repository, follows a similar approach: pipelines, the high-level modules, would depend on abstractions of the steps, the low-level modules, that need to get executed.
In more details we have that:
- Each step is abstracted as a script reading one or more input datasets and writing to one or more output datasets. To generate a representation of this abstraction, each step has a “dry-run” mode that returns input and output storage paths.
- Each pipeline is a DAG (Directed Acyclic Graph) of low-level steps in which links correspond to the step direct dependencies.
In this way the details of each step are independent of the full pipeline, as long as the step receives its inputs and produces its outputs. Abstracting each step allows us, as a bonus, to compute the DAG dependencies and generate the pipeline automatically, as each step depends directly only on the steps that produce its required inputs as shown in the Airflow diagram below.
Three Easy Pieces
In the Github repository, we created three examples to demonstrate the DIP Airflow implementation.
The first one is just a toy example with two scripts: one creating a file (“create”) and another copying it (“copy”). The only dependency we have is from the “create” to the “copy” step.
There are 4 tasks in the basic DAG:
- Load training and test data
- Get informative terms for sentiment analysis
- Train the embedding on sentiment analysis
- Evaluate the performance of the embedding
The third example extends the previous one, comparing the performance of multiple embeddings with different hyperparameters and informative term sets:
We have trained 4 different models, tuning on embedding dimension and the informative terms (default has 50 terms, all_informative_terms is the full vocabulary of the documents with 30,716 terms).
A final task compares the 4 models on the test set and returns the model with the best accuracy.
A couple of takeaways we can get from this are the following:
- Machine learning pipelines are hard to maintain. This is well known and documented in this paper¹.
- Design principles look very abstract, but are very powerful when applied concretely to Software Systems. A simple example we made is on how the DIP applies to adapting engines to cars, but it does have countless other applications.
- The DIP could also be used in Machine Learning Systems to avoid pipeline jungle issues. Check our Github repo to have a concrete example, inspired by our Airflow infrastructure.
If this post got your interest and you would like to challenge yourself with similar problems, we are hiring! Please check our open positions.
Find us on LinkedIn: www.linkedin.com/company/lumiata
1. D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary and Michael Young: “Machine Learning: The High Interest Credit Card of Technical Debt”, SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop)
2. Matthew Kirk: “Probably Approximately Correct Software”, Thoughtful Machine Learning with Python, Chapter 1