Last week, we discussed the rise of Apache Airflow as the de facto orchestrator for data science and machine learning and its critical shortcomings: It ignores data, exposes (and exacerbates) infrastructure complexity, and deals poorly with quickly evolving pipelines.
We’ve been focused on solving these problems with Aqueduct; our goal is to make running machine learning in the cloud simple. However, many data teams already have spent time and resources setting up Airflow, so ripping-and-replacing this infrastructure can be extremely costly. In this context, we’ve been thinking about how to get the best of both worlds.
We’ve built an Aqueduct integration with Airflow that allows you to define, deploy, and monitor a workflow in Aqueduct and use your Airflow cluster as the execution engine. In this post, we’ll use a simple prediction pipeline as a running example to show you:
A typical ML workflow can have many steps: data retrieval, cleaning, featurization, inference, post-processing, publishing, etc. Each step may have different resource requirements and more complicated pipelines will have parallel operators. Both Aqueduct and Airflow support any DAG structure, so to keep our diagrams & code simple, we’re going to abstract the intermediary stages into a single box. Our simplified pipelines has 3 stages: extracting data from the warehouse, running a model, and publishing the predictions: