Modular Real-time Feature Pipelines in Python
Aug 07, 2023A real-time feature pipeline is a program that constantly transforms
-
raw data, e.g. stock trades from an external web socket, into
-
features, e.g. OHLC (open-high-low-close) data for stock trading
and saves these features into a Feature Store.
Real-time feature pipelines are used for real-time ML problems like fraud detection, or cutting-edge recommender systems.
Once the features are in the store, you can fetch them to
-
train an ML model, from the offline feature store or
-
generate predictions with your deployed model, from the online feature store.
The problem
To ensure your deployed model performance matches the test metrics you get at training time, you need to generate features IN THE EXACT SAME WAY.
This is especially tricky for real-time feature pipelines, where
-
live raw data often comes from an external web socket, while
-
historical data comes from external offline storage, like a data warehouse.
The anti-pattern
You might be tempted to have 2 separate pipelines
-
the real-time pipeline for live data (e.g built with Bytewax), which runs 24/7 and generates the online features, and
-
a batch-scoring pipeline for historical data (e.g. Spark job), that you use to backfill your offline feature store on a daily frequency.
However, this solution has 2 main drawbacks:
-
Code duplication, which makes your ML system harder to debug and maintain.
-
Potential Offline-Online skew, which can cause your deployed model to underperform and fail silently.
Online-offline feature skew ⚠️
If your offline feature engineering code is different from your online feature engineering code (aka online-offline skew), your ML model will perform worse than expected.
This is one of the most common silent bugs in ML systems, called online-offline feature skew.
Is there a better solution?
Yes, there is.
The solution
We would like to re-use as much code as possible, and only re-write pre-processing and post-processing logic, depending on
-
the input source, either web socket or data warehouse, and
-
the output sink, either printing on the console (for debugging), the online feature store (for real-time inference), or the offline feature store (for ML Model training)
The simplest way to achieve this is to modularize your real-time pipeline, decoupling:
-
business logic transformations (aka aggregation of trades in tumbling windows), from
-
pre-processing and post-processing logic, to retrieve raw input and serve the final output.
Hands-on implementation in Python
In this repository, you will find a fully working implementation of a modular real-time feature pipeline using Python and Bytewax.
Enjoy it, and give it a star ⭐ on GitHub if you found it useful
Let’s go real time! ⚡
The only way to learn real-time ML is to get your hands dirty.
→ Go pip install bytewax
→ Support their open-source project and give them a star on GitHub ⭐ and
→ Start building 🛠️
Let’s keep on learning,
Peace and Love
Pau