Modular Real-time Feature Pipelines in Python

Aug 07, 2023

A real-time feature pipeline is a program that constantly transforms

raw data, e.g. stock trades from an external web socket, into
features, e.g. OHLC (open-high-low-close) data for stock trading

and saves these features into a Feature Store.

Real-time feature pipelines are used for real-time ML problems like fraud detection, or cutting-edge recommender systems.

Once the features are in the store, you can fetch them to

train an ML model, from the offline feature store or
generate predictions with your deployed model, from the online feature store.

The problem

To ensure your deployed model performance matches the test metrics you get at training time, you need to generate features IN THE EXACT SAME WAY.

This is especially tricky for real-time feature pipelines, where

live raw data often comes from an external web socket, while
historical data comes from external offline storage, like a data warehouse.

The anti-pattern

You might be tempted to have 2 separate pipelines

the real-time pipeline for live data (e.g built with Bytewax), which runs 24/7 and generates the online features, and
a batch-scoring pipeline for historical data (e.g. Spark job), that you use to backfill your offline feature store on a daily frequency.

However, this solution has 2 main drawbacks:

Code duplication, which makes your ML system harder to debug and maintain.
Potential Offline-Online skew, which can cause your deployed model to underperform and fail silently.

Online-offline feature skew ⚠️

If your offline feature engineering code is different from your online feature engineering code (aka online-offline skew), your ML model will perform worse than expected.

This is one of the most common silent bugs in ML systems, called online-offline feature skew.

Is there a better solution?

Yes, there is.

The solution

We would like to re-use as much code as possible, and only re-write pre-processing and post-processing logic, depending on

the input source, either web socket or data warehouse, and
the output sink, either printing on the console (for debugging), the online feature store (for real-time inference), or the offline feature store (for ML Model training)

The simplest way to achieve this is to modularize your real-time pipeline, decoupling:

business logic transformations (aka aggregation of trades in tumbling windows), from
pre-processing and post-processing logic, to retrieve raw input and serve the final output.

Hands-on implementation in Python

In this repository, you will find a fully working implementation of a modular real-time feature pipeline using Python and Bytewax.

Enjoy it, and give it a star ⭐ on GitHub if you found it useful

Let’s go real time! ⚡

The only way to learn real-time ML is to get your hands dirty.

→ Go pip install bytewax
→ Support their open-source project and give them a star on GitHub ⭐ and
→ Start building 🛠️

Let’s keep on learning,

Peace and Love

Pau

Wanna learn more Real World ML?

Subscribe to my weekly newsletter

Every Saturday

For FREE

Join 26k+ ML engineers ↓