Why we don’t deploy machine learning models with Flask

Go Gopher mascot with code

When we first built Cortex, we used the same web framework as everyone else, Flask. Near the start of this year, however, we started running into some limitations, mostly stemming from Flask’s lack of native ASGI support.

We wound up transitioning to FastAPI, an ASGI framework built on top of Starlette, and Uvicorn, an ASGI server that boasts performance similar to Node. And nearly a year later, we’re beyond pleased with the results.

In this piece, I want to share our key motivations for transitioning from Flask to an ASGI framework. If you’re currently deploying models with Flask APIs, hopefully the following is helpful.

1. Complex model deployments require asynchronous operations

For non-production deployments—an internal dashboard or a personal project, for example—a simple Flask API is great, as you will probably not be doing much besides feeding inputs to your model and returning its outputs as JSON.

In a production machine learning system, however, a deployment will likely have many more responsibilities:

  • Interfacing with logging and analytics services
  • Integrating with feature stores and model registries
  • Performing optimization functions like model caching

We’ve found that asynchronous operations are the best way to handle these different responsibilities in a performant, manageable way.

For example, Cortex APIs expose pre and post-processing hooks, which allow you to define asynchronous operations to be initiated before and after you generate inference without blocking the main thread:

This is particularly useful for recording metrics or storing results, as neither of those operations need to block the response.

Model caching is another example of an operation that benefits from async capabilities. With Cortex v0.21, we rolled out live reloading and multi-model caching, which allows Cortex to monitor models in an S3 bucket and update the API whenever the model is changed. Multiple models can also be cached to the disk, and then swapped in and out of memory as they’re called.

These features minimize downtime, increase the number of models a single API can serve, and generally increase the efficiency of the system—but they introduce some complexity and overhead. By running as many of their underlying operations as we can in the background, we make that tradeoff less severe.

2. Request-based autoscaling relies on async event loops

Scaling deployed models is a difficult infrastructure challenge. As said before, models are often very large, resource intensive, and have low concurrency thresholds. The core task of scaling deployed models, then, is to scale up to exactly as many are needed to serve incoming requests, while keeping costs as low as possible. There are many exacerbating factors here—the variability of models, the need for GPUs/ASICs, etc.

We designed a custom autoscaler to solve this problem in Cortex, one which enables each API, not instance, to autoscale independently. The autoscaler measures the concurrency threshold of a given model against the length of its request queue, and scales accordingly.

The crux of this request-based autoscaling, as we call it, is an asynchronous event loop that counts each request as it comes in, allowing Cortex to track the total length of each request queue.

It was actually Cortex’s autoscaling, combined with its support for Spot instances in deployments, that allowed AI Dungeon to reduce their infrastructure costs by over 90%:

How we scaled AI Dungeon 2 to support over 1,000,000 users

Back in March 2019, I built a hackathon project called AI Dungeon. The project was a classic text adventure game, with a twist. The text of the story, and the potential actions you were presented, were all generated with machine learning: The game was popular at the hackathon and with a small group of people online, but overall, was still a few steps away from what I envisioned.

3. Every millisecond matters for realtime inference

Inference, particularly on larger models, is a slow, resource intensive process. While we do everything we can to accelerate the operations involved in actually generating predictions, we also look for any opportunity to increase performance within the broader request-response loop.

Uvicorn and Starlette, the ASGI server and framework FastAPI builds on, are currently benchmarked as the best performing Python frameworks available — well beyond Flask:

And while the differences in raw server performance may seem miniscule compared to the difference in performance we can get by optimizing our actual inference operations, they still add up.

This is particularly true in realtime scenarios, in which predictions often need to be served in under 100ms (Google Smart Compose, for example). Many Cortex users have similar performance requirements, and for them, almost any speed up is significant.

Flask is great for getting started—but deploying models at scale

The goal of this article isn’t to deride Flask. It’s an incredibly popular tool for a reason. It is minimal, flexible, and extensible—not to mention the enormous community that supports it. If you’re building an MVP, dabbling in production machine learning, or just want to get something out there quickly, Flask is a great option.

But at scale, production machine learning introduces a host of problems for which Flask is simply not an optimal solution. The native ASGI support of FastAPI makes it much easier to deploy models at scale. As a bonus, FastAPI has a Flask-like interface, allowing us to keep one of things we really liked about working with Flask.

If you’re rolling your own ML infrastructure and are currently using Flask as your web framework, give an ASGI framework like FastAPI a try. Or, if you don’t want to build your own infrastructure but still want to get a feel for the performance benefits, try deploying a model with Cortex.


Like Cortex? Leave us a Star on GitHub

Star Cortex

Interested in production machine learning?