The pipeline is the spine of a machine learning org. It is how an organization goes from “collection of one-off data science projects” to a production machine learning system.
Knowing how to build pipelines is, therefore, pretty important.
In this guide, we’re going to introduce two tools for building a machine learning pipeline that can (re)train and deploy models. We’re also going to take a slightly different approach to discussing pipeline building than you commonly see in guides.
A generalizable approach to building pipelines
If you’ve read articles on pipeline building before, you’ve no doubt seen a lot of diagrams with 80 different icons all labeled with jargon you half-recognize. This isn’t a critique of those guides—pipeline building, at a granular level, can get complex—but it highlights a difference in approach.
In this guide, I’m going to assume that you know the specifics of your machine learning system better than me. I will make no assumptions about the operations involved in your data ingestion, the kinds of predictions you are serving, or your broader infrastructure. The only two things I will assume are:
- You need to train models.
- You need to deploy them to production.
As such, this guide is going to introduce two tools that can be used in concert to accomplish both of those tasks regardless of your specific needs. In fact, if you need to perform an operation in your training/deployment pipeline that cannot be done while using these tools, I would be really interested in hearing about it.
Let’s start by defining our pipeline.
Defining a reproducible machine learning pipeline
A production pipeline needs to be reproducible and transparent. Given identical inputs, it should produce identical outputs, and it should be easy for an engineer to explain what happens at each stage of the workflow and debug issues. This is why software deployment pipelines are often built with tools like Apache Airflow, or use explicit configuration files (i.e. YAML manifests).
In an ideal world, our interface for defining a machine learning pipeline would also be flexible, highly usable with respect to machine learning engineers and data scientists, and would be version-controlled and debuggable .
Enter Metaflow.
Metaflow is a data science framework developed at Netflix that enables developers to define machine learning pipelines as dataflows, with a series of steps (each representing a transformation of the data):
There’s a lot to love about Metaflow (I’d recommend digging in more detail via their documentation), but at a high level, it achieves a lot of things we want:
- Defines a pipeline as a series of clear, atomic actions while allowing for lots of flexibility and complexity (branching logic, parallelism, etc.)
- Uses a Python interface, which is a big deal when you consider Python is the language most ML practitioners will be familiar with.
- Provides a powerful client for versioning and analytics out of the box.
That last point is especially important. Metaflow’s client API automatically versions and records all artifacts produced or consumed in a single execution of a workflow. By importing the client, you can look up dataset processed by a given run or the model produced by it through simple Python:
Because of Metaflow’s simplicity, you can use any importable Python libraries you want in a flow, or integrate with any third party client. For example, if you want to write your training code using a more readable, manageable library—like PyTorch Lightning instead of PyTorch—you can do that.
Mentioned/relevant tools:
- Metaflow
- PyTorch Lightning
- DVC (Data Version Control)
Deploying models at production scale
At some point, we ideally will be deploying our models to production. Actually serving a model at production scale, however, requires quite a bit of infrastructure work. To build deployment into our pipeline, we need something that will:
- Handle the deployment process, from packaging and containerizing our prediction serving API to deploying it to our cloud.
- Automate our cloud infrastructure. This means implementing things like load balancing, autoscaling, support for GPU/ASIC instances, etc.
- Provide a smooth interface for the whole process, one that can be used productively by data scientists, ML engineers, and platform engineers.
I’m obviously biased here, but this is exactly what Cortex is built to do.
With Cortex, we can automate all of our cloud infrastructure for inference by spinning up a Cortex cluster on AWS or GCP:
Once live, we can define our model serving API as a simple Python function and deploy it:
On deploy, Cortex packages and containerizes our predictor, deploys it as a web service on our cluster, and configures load balancing, autoscaling, log streaming, and more.
In the above example, we used Cortex's Python client to define our API spec and deploy our model, mostly to give a sense of how a Cortex deployment could be triggered within Metaflow. However, for the engineers who prefer to handle deployments outside of Python, Cortex also can use YAML manifests to define API specs:
I’ve included a small sample of the available configuration fields here. For more information, see the docs.
Once configured, the API can also be deployed via the Cortex CLI:
Because Cortex allows us to log predictions and set any configuration variables we want, we can easily connect deployments to training flows in Metaflow and audit our entire pipeline.
The importance of configurability in machine learning pipelines
Cortex and Metaflow both emphasize a very open interface. Within a Metaflow step or a Cortex predictor, you can import any 3rd party library you’d like, query any other web service, or load any kind of data you need. Flows and deployments can also be triggered in any number of ways—as part of a CI/CD pipeline, by a button on a dashboard, via a CLI, or even from a notebook.
While configurability is in general a good thing, it is even more important in machine learning.
First, as we alluded to earlier, a machine learning pipeline is going to involve multiple people in a variety of roles, all of whom may be most comfortable with drastically different interfaces. A data scientist may prefer to work from notebooks, while a platform engineer may feel most comfortable at their terminal. You want to be able to configure your pipeline to work for everyone.
Second, the machine learning landscape changes rapidly. From the tooling you use for monitoring to the actual models you train, things will likely change at a fairly quick pace. It is imperative that you be able to update and modify your pipeline—in a safe, version-controlled way—without having to undergo major rearchitecting.
Metaflow and Cortex allow you to construct your pipeline in a way that satisfies both of these concerns.
If you enjoy either of these projects, consider leaving a star on the Cortex or Metaflow GitHub repositories.