Subscribe

Let's build an Incremental Learning system ⚙️

Aug 20, 2024

Let me show you step-by-step how to design an ML system that continuously re-trains its serving model (aka incremental learning).

This is, by the way, one of the key ingredients behind Tiktok’s recommender system.

Let’s start!

 

The problem

ML models are pattern finding machines, that try to capture the relationship between

  • a set of inputs available at prediction time (aka features), and

  • a metric you want to predict (aka target)

For most real-world problems these patterns between the features and the target are not static, but change over time. So, if you don’t re-train your ML models, their accuracy degrades over time. This is commonly known as concept drift.

Now, the speed at which patterns change, and you model degrades, depends on the particular phenomena you are modelling.

 

For example 💁

If you are trying to predict rainfall, re-training your ML model daily is more than enough. Rainfall patterns obey the laws of physics, and these do not change too much from one day to the next.

 

On the other hand, if you are trying to predict short-term crypto prices, where patterns between

  • available market data (aka features), and

  • future asset prices (aka target)

are short-lived, you must re-train your ML model very frequently. Ideally, in real-time.

A similar situation happens when you want to build a real-time recommender system, like Tiktok’s famous monolith, where user preferences change in the blink of an eye, and your ML models needs to be refreshed as often as possible.

 So now the question is

How do you build an ML system that continuously re-trains the ML model that serves the predictions ❓

Here is how ↓

 

Solution

Let’s design an ML system to predict short-term crypto prices using real-time market data. Moreover, we want the system to continuously update the ML model using the latest features and targets (aka price changes).

What about infrastructure?

In terms of infrastructure we need 4 services:

  1. Feature Store to store and serve consistently the features and targets the ML model needs for training and for generating fresh predictions.

  1. model registry to store and serve your ML model artifacts, and bridge the gap between your training pipeline and your inference pipeline.

  1. streaming data platform, like Apache Kafka or Redpanda, for fast and scalable data transfer between your pipelines.

  1. compute platform (e.g. Kubernetes) where your pipelines run as dockerized microservices. A popular choice is Kubernetes.

 

Ok, but how does the system work?

 

As any ML system, our system can be decomposed into 3 types of pipelines:

  1. Feature pipelines that generate the input features and targets the ML model at training time and at inference time. In our case we have 2 pipelines

  • Training pipeline implemented as a streaming application that

    • Trains an initial model using historical data from the feature store,

    • Incrementally updates this model using the latest feature coming from the Kafka topic, and

    • Pushes each model update to the model registry

       

  1. Inference pipeline implemented as a streaming application that

    • Initially loads the latest model from the registry,

    • Listens to incoming features from the Kafka topic,

    • Generates and serves a prediction, and

    • Regulary updates the model from the registry.

 

Attention 📣

Not all ML models can be used for incremental learning. For example, XGBoost is a very powerful algorithm for tabular data, but it's not inherently designed for incremental learning.

On the other hand, linear models and neural networks are well suited for incremental learning.

 

Example with source code 👨‍💻 👩‍💻 🧑🏻‍💻 🧑🏼‍💻 🧑🏽‍💻 🧑🏾‍💻 

In this Github repository that I created you will find an implementation of this system in Python.

→ Give it a star on ⭐ Github to support my work 

The 2 key libraries I used are

  • Quix Streams for stream processing. The training and inference pipelines are Quix Streams applications that continuously listen and process incoming data from the Kafka topics.

  • River for doing online Machine Learning on streaming data. River has an easy-to-use API that resembles a lot the classical scikit-learn.

 

To run the code on your end you just need to git clone the repo and then

  • Install all project dependencies in an isolated Python environment with

    $ make install
  • Start the feature pipelines with

    $ make producers
  • Start the training pipeline with

    $ make training
  • And start the inference pipeline

    $ make predict

 

Attention 📣

I simplified the original system by

  • Mocking the 2 feature pipelines with the producers.py script

  • Mocking the feature store with a local CSV file, and

  • Mocking the model registry with model_registry.py and the local file system.

 

Wanna learn to build real-time ML systems, with me? 

On September 16th, 162 students and I will start building a real-time ML system like this. Step by step. In Python.

It will be a tough hike, but the payoff at the end will me immense.

After completing this 4-week program (+ A LOT of hard work on your end) you will know how to

  • Build modular and scalable real-time ML systems

  • For the business problem you care about

  • Using any real-time data source

  • Following MLOps best practices, for fast iteration and short-time-to-market.

Wanna know more about
Building a Real-time ML System. Together?
↓↓↓

👉 Click HERE to learn more

Talk to you next Saturday,

Enjoy the weekend,

Pau

The Real World ML Newsletter

Every Saturday

For FREE

Join 20k+ ML engineers ↓