4 steps to build real-world ML products - Part 2
Aug 14, 2023This is Part 2 of this mini-series on how to solve real-world business problems using Machine Learning.
Last week we covered Problem framing. Today, we will take the next step, and learn how to prepare the data.
Remember 🙋
The 4 steps to building a real-world ML product are
Problem framing (last week)
Data preparation (today) 📊
Model training (next week)
MLOps
Example
Imagine you work at a ride-sharing app company in NYC as an ML engineer. And you want to help the operations team allocate the fleet of drivers optimally each hour of the day. The end goal is to maximize revenue.
Last week you learned how to frame this business problem as an ML problem.
Problem framing 🖼️
We will build a predictive model for taxi demand. The model will predict how many rides will be requested
on each area of NYC
in the following 60 minutes
Before we can start building any ML model, we need to prepare the data.
Step 2. Data preparation
In real-world ML projects there is no Kaggle-like dataset with N columns for the features and 1 with the target. Instead, you have to create this dataset yourself, starting with raw data.
In this case, you have the list of taxi rides that have happened in NYC in the last 24 months, including
-
the date and time of the ride, and
-
the pickup location
This data is collected by the application backend and sent to data storage (aka data warehouse), where you can read it and use it for your ML service.
However, before doing so, you need to pre-process, by following these 3 steps:
-
Data validation.
Remove wrong or buggy ride events. For example, test events generated by your development team that accidentally ended up in the production data. -
Aggregation of events into time-series data.
Your model will use historical one-hour data intervals -
Transformation of time-series data into pairs (features, target)
How to transform time-series data into Supervised ML data?
Most Supervised ML models (e.g. XGBoost) do not work directly with time-series data. Instead, you need to pre-process time-series data, into pairs (features, target), where
-
features are the model inputs
-
target is the model output
To transform time-series data into (features, target) pairs you define
-
a window length (e.g. last 12 hours) for the size of the input feature vector.
-
a step size (e.g. 1 hour), to control the total number of samples.
and you apply a slice-and-dice operation.
My advice 🧠
It is best to package all the data preprocessing steps into a function, that you can run from the command line, and that you can later use as your feature pipeline (more on this in 2 weeks).
Next steps
So far we have
✅ defined the ML problem to solve and
✅ generated our training data ready.
Next week, we will move on to step 3, aka model training.