Subscribe

Real-time feature engineering

Dec 02, 2024

Say you want to build a real time ML system to predict short term crypto prices, like the one my students and I build in my course Building a Real Time ML System. Together.

Before you even get to think

  • if an LSTM model is good for this task

  • if XGBoost will work better with this kind of data, or

  • if a Websocket API is a better way to serve the model predictions, than a traditional REST API…

you FIRST NEED TO THINK of the data engineering work necessary to feed your ML models with

  • predictive signals for the target metric you want to predict, and

  • do it fast so to that you get predictions on time.

This is precisely what a real time feature pipeline does.

Let me show you with an example ↓

 

Example 💁

Let’s design a scalable real-time feature pipeline that can transform

  • A continuous stream of crypto market trades, from one or more crypto exchanges,

into

  • A continuous stream of technical indicators, that can potentially help you predict crypto price changes, and find arbitrage opportunities.

 

What are technical indicators?

Technical indicators are mathematical calculations based on trade price and volume data, used to forecast potential market movements and identify trading opportunities.

They help traders analyse market trends, momentum, and potential reversal points.

For example, a common crypto trading indicators is the Relative Strength Index (RSI), which measures if an asset is

  • overbought (RSI > 70) → suggest a short-term price decrease.

  • oversold (RSI < 30) → suggests a short-term price increase.

Wikipedia

 

Our goal 🎯

We want a modular design that can help us quickly iterate over our pipeline by playing at least with 2 levers:

  • The sources from which ingest raw data, for example KrakenBinance, or Coinbase.

  • The feature engineering logic to transform these trades into potentially predictive trading signals, using a library like ta-lib.

The simplest design I am aware of that ticks these 2 boxes consists of 2 micro services:

  • trade ingestor, that fetches trades from an external API and pushes them to a Kafka topic ✅

  • feature engineering service that consumes messages from the Kafka topic, transforms them into technical indicators, using stateless and stateful streaming calculations, and pushes them to your Feature Store ✅

The magic infrastructure component here is Kafka, that enables data transfer and decoupling between your ingestion and your transformation steps.

With this design, you can focus your data science work (and team) on the second part of the pipeline.

 

While your data engineering efforts (and team) focus on integrating your pipeline with more sources

As you start experimenting, you will end up building a mesh of loosely coupled micro services, that can help you cover

  • many exchanges

  • trading indicators and

  • trading frequencies.

Tools 🛠️

The fastest way I know to implement a production-ready system like this is:

  • Python to build your microservices, together with a real-time data processing library like Quix Streams or Bytewax

  • Apache Kafka or Redpanda as the message broker.

  • Docker to easily deploy your python services to Kubernetes, and

  • feature store to save and serve your end features, for example Hopsworks.

 

BOOM!

The Real World ML Newsletter

Every Saturday

For FREE

Join 20k+ ML engineers ↓