Why we’re writing machine learning infrastructure in Go, not Python

Caleb Kaiser

Founding Team @ Cortex Labs

At this point, it should be a surprise to no one that Python is the most popular language for machine learning projects. While languages like R, C++, and Julia have their proponents—and use cases—Python remains the most universally embraced language, being used in every major machine learning framework.

So, naturally, our codebase at Cortex is 87.5% Go.

Machine learning algorithms, where Python shines, are just one component of a production machine learning system. To actually run a production machine learning API at scale, you need infrastructure that implements features like:

Autoscaling, so that traffic fluctuations don’t break your API
API management, to handle simultaneous API deployments
Rolling updates, so that you can update models while still serving users

Cortex is built to automate all of this infrastructure, along with other concerns like logging and cost optimizations.

Go is ideal for building software with these considerations, for a few reasons:

1. Concurrency is crucial for machine learning infrastructure

A user can have many different models deployed as distinct APIs, all managed in the same Cortex cluster. In order for the Cortex operator to manage these different deployments, it needs to wrangle a few different APIs. To name a couple:

Kubernetes APIs, which Cortex calls to deploy models on the cluster.
Various AWS APIs—EC2, S3, CloudWatch, and others—which Cortex calls to manage deployments on AWS.

The user doesn’t interact with any of these APIs directly. Instead, Cortex programmatically calls these APIs to provision clusters, launch deployments, and monitor APIs.

Making all of these overlapping API calls in a performative, reliable way is a challenge. Handling them concurrently is the most efficient way to do things, but it also introduces complexity, as now we have to worry about things like race conditions.

Goroutines, concurrently executed functions that the Go runtime treats as lightweight threads, and their ability to communicate to each other via channels, provides an elegant, out-of-the-box solution to this problem. We could set things up in Python using asyncio or something similar, but the fact that Go is designed specifically for concurrency makes our lives easier.

2. Building a cross-platform CLI is easier in Go

The Cortex CLI is a cross-platform tool that allows users to deploy models and manage APIs directly from the command line:

Originally, we wrote the CLI in Python, but trying to distribute it across platforms proved to be difficult. Because Go compiles down to a single binary—no dependency management required—it offered us a simple solution.

The performance benefits of a compiled Go binary versus an interpreted language are also significant. According to the computer benchmarks game, Go is dramatically faster than Python.

It’s perhaps not coincidental that many other infrastructure CLI tools are written in Go, which brings us to our next point.

3. The Go ecosystem is great for infrastructure projects

One of the benefits of open source is that you can learn from the projects you admire. For example, Cortex exists within the Kubernetes (which itself is written in Go) ecosystem. We were fortunate to have a number of great open source projects within that ecosystem to learn from, including:

kubectl: Kubernetes’ CLI
minikube: A tool for running Kubernetes locally
helm: A Kubernetes package manager
kops: A tool for managing production Kubernetes
eksctl: The official CLI for Amazon EKS

All of the above are written in Go—and it’s not just Kubernetes projects. Whether you’re looking at CockroachDB or Hashicorp’s infrastructure projects, including Vault, Nomad, Terraform, Consul, and Packer, all of them are written in Go.

The popularity of Go in the infrastructure world has another effect, which is that there are many engineers interested in working on infrastructure who are also familiar with Go. As an open source project, our accessibility to contributors has to factor in to our decisions in selecting our stack, and Go checks this box. In fact, one of the first engineers to join Cortex Labs full time only discovered us because he was researching machine learning projects written in Go.

4. Go is just a pleasure to work with

The final note I’ll make on why we ultimately built Cortex in Go is that Go is just nice.

Relative to Python, Go is a bit more painful to get started with. Go’s unforgiving nature, however, is what makes it such a joy for large projects. We still heavily test our software, but static typing and compilation — two things that make Go a bit less comfortable for beginners — act as sort of guard rails for us, helping us to write (relatively) bug-free code.

There may be other languages you could argue offer a particular advantage, but on balance, Go best satisfies our technical and aesthetic needs.

Python for machine learning, Go for infrastructure

We still love Python, and it has its place within Cortex, specifically around writing prediction APIs. Cortex's Predictor Interface is a Python interface, by which developers can write APIs to serve models from any Pythonic framework—including TensorFlow, PyTorch, ONNX, scikit-learn, and many others.

However, the Python code involved in serving predictions (which is all built on top of FastAPI and Uvicorn) is still eventually packaged up into Docker containers, and ultimately orchestrated by code that is written Go.

If you’re interested in becoming a machine learning engineer, knowing Python is more or less non-negotiable. If you’re interested in working on machine learning infrastructure, however, you should seriously consider using Go.

Why we’re writing machine learning infrastructure in Go, not Python

1. Concurrency is crucial for machine learning infrastructure

2. Building a cross-platform CLI is easier in Go

3. The Go ecosystem is great for infrastructure projects

4. Go is just a pleasure to work with

Python for machine learning, Go for infrastructure

Continue Reading

How to deploy 1,000 models on one CPU with TensorFlow Serving

Want a better way to deploy to production?

Interested in production machine learning?

Product

Follow

Connect

Company