Scalable Feature engineering with Docker and Kafka

Dec 09, 2024

Last week my students from Building a Real Time ML System. Together built a real-time feature pipeline that transforms

a continuous stream of crypto market trades (raw data)
into a continuous stream of technical indicators (ML model features)

During our live coding session, Vincent, one of my students, asked

How do we horizontally scale the candles services?

To which I answered GREAT QUESTION ❗

Let me explain ↓

The problem 🤔

Once you have a working real-time feature pipeline that transforms raw data into features for one cryptocurrency, for example BTC/USD

it makes A LOT of sense to start experimenting with many cryptocurrencies.

This means your pipeline will start ingesting and processing 10x-100x more trades than before.

And here is where the problem comes.

Ingestion is easy. This is not where the bottleneck of our pipeline is.
On the other hand, transforming these trades into candles (aka candles service), using stateful window aggregations can quickly become a bottleneck.

And this is when horizontal scaling leveraging Kafka topic partitions, consumer groups and Docker containers come into play.

The solution 🧠

Spinning up more instances (aka Docker containers) of your candle service alone won’t work…

.. unless you a make smart use of 3 key Kafka conceps:

Topics and partitions → A topic is like a named channel where data flows through Kafka. It's divided into partitions - ordered, immutable sequences of messages that can be processed in parallel. Think of a topic as a highway and partitions as parallel lanes.
Message keys → Each message in Kafka can have a key. Messages with the same key are guaranteed to go to the same partition, ensuring ordered processing within that partition. In our case, it makes sense to use the cryptocurrency pair (e.g. BTC/USD, ETH/USD…) as the message key.
Consumer groups → a set of services that cooperate to process messages from a topic, with each partition being exclusively processed by one consumer in the group. This enables parallel processing and horizontal scaling.

The first step to add scalability to your system is to add at least one more partition to your Kafka topic. You can do that from your broker console, or from the CLI using Redpanda as follows

$ rpk topic add-partitions trades --num 1

Next, you need to include the trade cryptocurrency (e.g. BTC/USD, ETH/USD, SOL/USD) as your message keys, so all trades for the same cryptocurrency are saved into the same partition.

Now, you can spin up two instances of your candles service, using the same consumer group name, and Kafka will automatically assign one partition to each.

The workload will be balanced between the 2 candle service instances, each of them running on a different lightweight instance.

If you want to scale further, you can increase the number of topic partitions, and spin up even more instances of your transformation service.

In a production environment, each of theses Docker instances runs in another pod, leveraging extra CPU and memory 🧠

The magic of Kubernetes and Kafka solved the problem elegantly ✨✨✨

Beautiful.

Wanna learn more Real World ML?

Subscribe to my weekly newsletter

Every Saturday

For FREE

Join 26k+ ML engineers ↓