How to deploy machine learning models from a notebook to production

Caleb Kaiser

Founding Team @ Cortex Labs

Notebooks are the default development environment for data scientists, and as a result, notebooks are where most models begin their lives. But models eventually—hopefully—end up in production, and unfortunately, the path from notebook to production is not exactly smooth.

There are some platforms that offer and “end-to-end” experience, but they require you to migrate your entire stack to their platform, and leave you trapped within whatever constraints they impose.

On the other hand, there are platforms like Cortex, which while being open source and fully configurable, require you to leave the notebook environment to trigger a deployment.

With Cortex 0.21, we took a big step in bridging this gap. After hearing from data scientists and machine learning engineers who wanted to deploy directly from a notebook (instead of the CLI), we’ve officially released the Cortex Python client, providing a Python interface for triggering deployments.

Below, I’ve written up a quick guide to deploying a model directly from a notebook with Cortex.

1. Defining an inference serving API for our model

If you’re already familiar with Cortex deployments, feel free to skip to the next section. Otherwise, read on.

The first thing we need to do to deploy to production with Cortex is to define an API for serving predictions from our model. In Cortex, we write prediction serving code in what we call Predictors. You can read more about Cortex Predictors in the docs, but suffice it to say that if you’ve written a Flask service for a model, this will be familiar (Cortex uses FastAPI under the hood).

A Cortex Predictor does a couple of things. First, it defines an init() method, which is called on deploy, that can be used to initialize your models, encoders, etc. Second, it defines a predict() method, which is the actual request handling code. It also exposes some optional pre and post-predict hooks, should you need to run async operations before or after inference.

For example, this Predictor will initialize and serve inference from a finetuned ALBERT stored in S3:

‍(Obviously, the S3 setup doesn’t really matter—however you store models works fine. Also, there’s no particular reason that I used TorchScript here. I just like it.)

Now, in addition to a Predictor, our API needs a configuration file, which we provide in the form of a YAML manifest (though, this will also be achievable in Python very soon—see the conclusion details):

The above is pretty simple, but Cortex exposes a lot of other optional knobs for configuring things like concurrency limits, autoscaling behavior, monitoring, prediction tracking, and more.

On deploy, Cortex packages these files and their dependencies into a Docker image, deploys them to a Cortex cluster (running on your AWS account), and exposes your API as a web service behind a load balancer. Deployments are versioned, allowing you to reproduce and audit them later.

Now that everything is ready, we can deploy.

2. Deploying models to production from a notebook

In the past, the only way to deploy with Cortex was from the CLI, like this:

However, now deployments can be triggered directly from Python, like so:

The deploy() method returns an object with which we can access our APIs directly. For example, if we build on the above cell, we can also generate and display predictions:

Of course, your notebook environment will need access to your AWS account and you’ll need to install the Cortex Python package: $ pip install cortex

And with that, you have a new deployment, directly from your notebook.

The deployment is a normal Cortex deployment, meaning features like rolling updates are implemented automatically. This means that to update your model, all you need to do is run your deploy code again. Cortex will replace the model with zero downtime.

Different interfaces for different developers

One of the lessons we’ve learned working on Cortex for the last two years is that the ideal ergonomics for data scientists are oftentimes different from those of software engineers.

Of course, this isn’t a hard and fast rule, but it has been our experience that a large portion of data scientists’ primary exposure to programming has been in a notebook environment, and they haven’t had much experience—and therefore have very little to desire to start—working from a CLI.

Because Cortex, as a deployment platform, is used by machine learning engineers, MLOps specialists, and data scientists, it needs to have interfaces comfortable for every developer. With the Python client, we think we’ve taken a big step forward in doing just that for data scientists, and this is just a start.

In our next release, we’re planning on expanding Cortex’s Python interface even more, allowing users to define their API configurations as Python dictionaries instead of YAML files.

To get an update when this feature releases, Watch and Star the Cortex repo on GitHub.

‍

How to deploy machine learning models from a notebook to production

1. Defining an inference serving API for our model

2. Deploying models to production from a notebook

Different interfaces for different developers

Like Cortex? Leave us a Star on GitHub

Continue Reading

Google’s JAX: Flexible, High-Performance Machine Learning

Interested in production machine learning?

Learn

Follow

Connect

Company