Recently, we built and deployed a machine translation API that used a series of sequence-to-sequence models to translate text in on-demand between over 100 different languages.
Given an input like this:
It returns a response like this:
Building this kind of system cost effectively is a challenge. Each language-to-language translation requires a different model, and each model is over 300 MB. For realtime latency, the API also needs to use GPUs to perform inference (ASICs could potentially also be used—we didn’t explore this).
A naive approach would be to deploy every model in its own API—or to group small numbers of models together in multi-model APIs—and deploy however many instances were needed to take the entire system online. This approach would be insanely expensive.
For example, say we can fit 5 models safely into memory on a g4dn.xlarge (AWS’s cheapest and best GPU instance for inference). To deploy 1,000 models, we’d need 200 simultaneously deployed instances, which at $0.526 per hour each, works out to $105.20 per hour total, or $75,744 per month.
And, to be clear, that’s the minimum. If any particular model became popular, its request queue would grow and the system would scale up more instances to handle the workload, increasing our total spend.
However, we were able, with some experimentation, to get the cost of running this system down to roughly $0.47 per hour. Below is a breakdown of everything we did to achieve this.
1. Deploying models on spot instances
Spot instances, if you are unfamiliar, are unused instances that AWS sells at a steep discount. By switching our deployments to spot, which simply involves setting “spot” to true in our Cortex cluster’s configuration, we were able to immediately the hourly per instance cost of our g4dn.xlarge instances from $0.526 to $0.1578—a full 70% decrease.
Of course, if you’re rolling your own infrastructure, there are a few things you’ll have to consider when using spot instances.
First, spot instances can be recalled by AWS at any time. If you are using Kubernetes, this isn’t much of an issue, as failover should be handled gracefully—and because inference is a read only operation, you don’t have to worry about losing state.
Secondly, availability can be an issue. There’s no guarantee that AWS will have enough spot instances for your particular instance type at any given moment. This is typically more of an issue when you are scaling up to dozens or hundreds of instances, but it can still affect even smaller deployments.
Both of these problems are solved in Cortex. For the first, Cortex is built on top of Kubernetes, and we’ve invested a good deal of work into making the self-healing and autoscaling processes seamless for spot instances. As for the second issue, Cortex exposes knobs for specifying a variety of behaviors around spot deployments, including:
- Backup instance types to switch to if your primary instance type is unavailable for spot (e.g. switching from g4dn.xlarge to g4dn.2xlarge).
- Whether or not on-demand instances are an acceptable backup, in the event that not enough acceptable spot instances are available.
- The exact proportion of on-demand to spot instances acceptable for deployment, if you do use on-demand instances as backups.
If you’re rolling your own infrastructure, feel free to have a look around the Cortex codebase to see how we’ve implemented this. If you’re deploying with Cortex, simply edit your cluster.yaml like the above (more info in the docs).
2. Implementing multi-model caching
Another core bottleneck to the system is the need to treat individual models as inseparable from their API. If the only way to change which model is being served is to change the entire API—which will trigger a redeploy—then we have to deploy enough APIs to keep every model available at all times.
But, if we could update models without affecting the API that performs inference on them, then things would change fundamentally.
We implemented two features related to this:
- Live reloading. Our API can now access a model from S3 and update its in-memory model whenever the remote model changes.
- Multi-model caching. Because models can now be changed in-memory, our API can index an entire bucket of models, downloading and caching each as needed.
Multi-model caching is particularly relevant here. Essentially, instead of having 200 different APIs, each configured to serve 5 specific models, we can now have 1 API configured to serve all 1,000 models.
There’s more information on multi-model caching in the docs, but at a high level, the API monitors a remote S3 bucket, indexing every model contained within. When a prediction is requested, the API checks to see if the model is loaded into memory, then to see if the model is loaded into disk, and finally, assuming the model is not stored locally at all, the API will download the model from S3.
When a new model is downloaded, it is cached into memory. To make room, the least recently used model in memory is moved to the disk, and the least recently used model on the disk is removed entirely.
We can configure how Cortex handles multi-model caching in our API spec using a few fields. If you use Cortex’s CLI to deploy, the YAML manifest will look like this:
And if you prefer the Python client, the above configuration fields translate identically into the dictionary you use to define your API.
Regardless of which interface you prefer, the steps are pretty straightforward. We give Cortex a model directory to index, set a limit on how many models can be kept in memory, and Cortex takes care of the rest.
3. Optimizing our APIs and routing requests
Our use of multi-model caching in the above example is an improvement, but is still naive. We can improve upon it quite a bit here.
It might be obvious, but there is a tradeoff involved in multi-model caching, particularly around latency. Downloading and initializing a new model takes time, and if you are doing it for every request, you run the risk of spiking latency to an unusable degree.
However, there is a relatively simple optimization we can make here, one that has to do with challenging the assumption that all models will have similar usage. Looking at our data, we can see clearly that the bulk of translation requests fit into three buckets:
- X-to-English requests (majority)
- English-to-X requests (large minority)
- All other requests (small minority)
And even within those buckets, we can see that some language-to-language to requests are more popular than others. We can use this information to optimize our system, benefiting from the efficiency of multi-model caching without compromising on latency.
We do this by creating four different APIs: One to handle all X-to-English requests, one to handle all English-to-X requests, one to handle all other requests, and one API to route requests accordingly.
The code for the three prediction APIs is largely identical. We can make some slight optimizations, since we can make more assumptions, but the only thing that really changes is the models we tell each API to index.
The request routing API is very simple. It simply checks which languages are requested, and pings the corresponding API. Additionally, because its not actually performing inference, we can run it on some very cost effective hardware, making its impact on cost negligible.
As a result of all this, our final system is running, in its most minimal state, three model serving APIs (each on g4dn.xlarge spot instances) and one request routing API. Because the request routing API’s cost rounds to zero (particularly as we scale), we’ll focus on the cost of the three model serving APIs, each of which costs $0.1578 per hour to run, resulting in a total cost of $0.474 per hour.
Lowering the cost of machine learning inference
Controlling inference costs is a fundamental challenge of machine learning, especially as models continue to grow in size and realtime inference continues to become standard.
The project we’ve outline above—one which parallels Google Translate and other common tools—is prohibitively expensive for virtually all teams in its naive implementation. Without the above optimizations, it is simply implausible for most to run.
This project is not an outlier. In fact, its one of the simpler systems to optimize. As more complex production systems continue to be built, we need to continue push forward in designing infrastructure that lowers their effective cost, if we want their adoption to be widespread.
If solving this kind of problem is exciting to you, consider checking out (and even contributing to) the work we’re doing on Cortex.