Subscribe

How to serve ML predictions 100x faster

Aug 12, 2024

A very common way to deploy an ML model, and make its predictions accessible to other services, is with a REST API.

It works as follows:

  1. The client requests a prediction -> Give me the price of ETH/EUR in the next 5 minutes

  2. The ML model generates the prediction,

  3. The prediction is sent back to the client -> predicted price = 2,300 USD

This design works, but it can become terribly inefficient in many real-world scenarios.

 

Why?

Because more often than not, your ML model will re-compute the exact same prediction it already computed for a previous request.

So you will be doing the same (costly) work more than once 😵‍💫.

This become a serious bottleneck if the request volume grows, and you model is large, like a Large Language Model.

So the question is:

Is there a way to avoid re-computing costly predictions? 🤔

And the answer is … YES!

 

Solution 🧠

Caching is a standard technique to speed up API response time.

The idea is very simple. You add a fast key-value pair database to your system, for example Redis, and use it to store past predictions.

When the first request hits the API, your cache is still empty, so you

  • generate a new prediction with your ML model

  • store it in the cache, as a key-value pair, and

  • return it to the client

Now, when the second request arrives, you can simply

  • load it from the cache (which is super fast), and

  • return it to the client

To ensure the predictions stored in your cache are still relevant, you can set an expiry date. Whenever a prediction in the cache gets too old, it is replaced by a newly generated prediction.

For example

If your underlying ML model is generating price predictions 5 minutes into the future, you can tolerate predictions that are up to, for example, 1-2 minutes old.

 

Example with full source code 👩‍💻👨🏽‍💻 

In this repository that I created you will find a minimal Python implementation of a REST API with and without caching using FastAPI and Redis.

🔗 → Click here to see the code

Git clone it, and run

$ make install

to install all project dependencies inside an isolated virtual env.

You can spin up the FastAPI server without cache

$ make api-without-cache

or with cache

$ make api-with-cache

send a batch of requests and measure their response time

$ make requests

Time taken: 1029.59ms <-- new prediction
Time taken: 13.09ms <-- very fast
Time taken: 8.47ms <-- very fast
Time taken: 7.74ms <-- very fast
Time taken: 12.98ms <-- very fast

Time taken: 1020.92ms <-- new prediction
Time taken: 8.40ms <-- very fast
Time taken: 12.61ms <-- very fast
Time taken: 10.55ms <-- very fast
...

 

Wanna learn to build real-time ML systems, together? 🏗️🙋🏾‍♂️

On September 16th 150+ brave students and myself will start building, step-by-step a real-time ML system that predicts crypto prices.

After completing this 4-week program (+ A LOT of hard work on your end) you will know how to

  • Build modular and scalable real-time ML systems

  • For the business problem you care about

  • Using any real-time data source

  • Following MLOps best practices, for fast iteration and short-time-to-market. 

And of course, we will implement REST API caching ⚡

 

Wanna know more about

Building a Real-time ML System. Together?
↓↓↓

👉 Click HERE to learn more

Talk to you next week,

Peace and Love

Wanna learn more Real World ML?

Subscribe to my weekly newsletter

Every Saturday

For FREE

Join 22k+ ML engineers ↓