Fast And Easy Data Exploration For Machine Learning
Oct 23, 2023Sweetviz is the way
Tired of spending “too much time” doing data exploration before training your Machine Learning models?
Looking for a faster way to understand data issues and patterns, before you dive into the fun part of training your ML model?
Wanna learn how to train better ML models, by finding and fixing issues in your data?
You’ve come to the right place.
In this article, you will learn how to do data exploration at the speed of light, using the amazing open-source library Sweetviz.
Let’s go through a hands-on example and code you can find in this GitHub repository.
The problem
You need to generate your training data at the beginning of every real-world ML project.
Typically, you access an SQL-type database and write a long query that pulls data from several tables, aggregates it, and merges it into the final training set. The dataset contains a set of features and a target metric you want to predict.
If you get stuck at this stage, I recommend you read my previous article on best practices for dataset generation, using SQL and Python
๐ How to generate training data: faster and better
Once you have this data, you are very tempted to train your first ML model.
And this is a BIG mistake.
Instead, you should put a few minutes aside to run a data exploration.
But, why do I need data exploration?
Because, the most effective way to improve your results is NOT by trying more complex models, or by tunning hyper-parameters (real-world ML != Kaggle).
Instead, you should focus on increasing the quality of your data. And you do this with data exploration.
When you explore a dataset, you want to pay special attention to:
- Data bugs: Are there any weird things, that might show a bug in the data?
- Missing data: What is the percentage of missing observations for each feature?
- Data leakage: Are there features that look extremely predictive, and “too good to be true”?
The question is then
Is there a fast way to explore a dataset?
Yes, there is.
Its name is Sweetviz, an open-source library you will fall in love with.
Let’s go through a practical example and a Python script I developed for my ML projects. All the source code I present is publicly available in this GitHub repository. Feel free to use it in your next ML project.
Example
Let’s imagine you work as a data scientist in a Telecommunications company (aka Telco).
A common problem Telcos face is high customer churn. There is high competition in this sector, which means customers often find more attractive deals from competitors, so they switch.
To mitigate this, the marketing team comes to you with an idea:
“Can you develop a model to predict customer churn?”
With that model, they could take precautionary measures, like sending special offers to customers who are about to churn and keep them on board.
That sounds like a plan.
Step 1. Generate the training data
You go back to your laptop and do the first thing you need to do in every real-world ML project: you generate the training set. You can find the exact dataset I am using for this example here.
The dataset has one row per client, and each column has a few categorical and numerical features, plus the binary target Churn
you want to predict, that takes values:
Churn = "Yes"
meaning the customer churned.Churn = "No"
meaning the customer did not churn.
The features you pulled out from the DB are these:
RAW_FEATURES = { | |
'gender': 'string', | |
'SeniorCitizen': 'string', | |
'Partner': 'string', | |
'Dependents': 'string', | |
'tenure': 'numerical', | |
'PhoneService': 'string', | |
'MultipleLines': 'string', | |
'InternetService': 'string', | |
'PaymentMethod': 'string', | |
'MonthlyCharges': 'numeric', | |
'TotalCharges': 'numeric', | |
'date': 'datetime', | |
'activity': 'string,' | |
} |
On top of these raw features, you engineer a few others, like month
, dayOfMonth
, dayOfWeek
or hour
, to capture temporal patterns in churn rates.
All in all, this is the complete set of features (raw + engineered) you end up having at your disposal:
FEATURES_TO_MODEL = OrderedDict({ | |
# raw features | |
'gender': 'string', | |
'SeniorCitizen': 'string', | |
'Partner': 'string', | |
'Dependents': 'string', | |
'tenure': 'string', | |
'PhoneService': 'string', | |
'MultipleLines': 'string', | |
'InternetService': 'string', | |
'PaymentMethod': 'string', | |
'MonthlyCharges': 'numeric', | |
'TotalCharges': 'numeric', | |
'activity': 'string', | |
# engineered features | |
'dayOfWeek': 'numeric', | |
'month': 'numeric', | |
'dayOfMonth': 'numeric', | |
'hour': 'numeric', | |
}) |
Tempted as you are to jump into the modeling part, you (wisely) set some time apart to take a closer look at the dataset.
Step 2. Data exploration
Now that you have the data, you are ready to explore it.
For that, you use the eda.py script you can find in my GitHub repo. It uses Sweetviz, an open-source library that generates data exploration reports in seconds.
from argparse import ArgumentParser | |
from pathlib import Path | |
import os | |
from pdb import set_trace as stop | |
import sweetviz as sv | |
import pandas as pd | |
from features import add_features | |
ROOT_DIR = Path(__file__).parent.resolve().parent | |
DATA_DIR = ROOT_DIR / 'data' | |
VIZ_DIR = ROOT_DIR / 'viz' | |
def eda( | |
file_name: str, | |
target: str, | |
) -> None: | |
# load data into pandas | |
file_path = Path(DATA_DIR) / file_name | |
data = pd.read_csv(file_path) | |
# feature engineering | |
target_data = data[target].tolist() | |
data = add_features(data) | |
data[target] = target_data | |
# Sweetviz generates the HTML report | |
my_report = sv.analyze(data, target_feat=target) | |
output_html = VIZ_DIR / f'{file_name}.html' | |
my_report.show_html(output_html) | |
if __name__ == '__main__': | |
parser = ArgumentParser() | |
parser.add_argument('--file', dest='file', type=str) | |
parser.add_argument('--target', dest='target', type=str) | |
args = parser.parse_args() | |
eda(file_name=args.file, target=args.target) |
To explore a dataset you simply call this file from the command line, passing as parameters:
- the dataset you want to explore,
v1.csv
- the name of the target variable,
Churn
$ python eda.py --file v1.csv --target Churn
After a few seconds, the Sweetviz function analyze()
generates a nice-looking HTML report for you.
Problem #1. Data bugs
If you look at the temporal features dayOfWeek
, month
, dayOfMonth
and hour
you will see they have a very unbalanced distribution.
For example, dayOfWeek
is 1
(meaning Tuesday) for more than 90% of the observations.
This looks weird to you, so you go and ask Mark, the data engineer on the team.
“Hey, Mark! How is it possible that almost 90% of our churn events happen on Tuesday?”
To which he responds:
“That must be a bug in the churndate
field. I had to reprocess the table a couple of weeks ago, and I think I must have overwritten the actual churn date with the date I updated the record in the table.”
And this is exactly what is happening here. If you look at the other temporal features you will quickly realize that Mark overwrote 90% of the date
records on the 1st of February 2022.
You caught a data bug, that can be fixed and that will help you build a stronger model. Good job!
Problem #2. Missing data
Real-world datasets can be plagued with missing data. Sometimes, you cannot do much to remediate that. However, oftentimes, missing data can be addressed upstream, by your data engineer friend Mark.
From the Sweetviz report, you clearly see that tenure
has a strong negative correlation with Churn
. Great, this means tenure
is a predictive feature.
The only catch is that 20% of the samples do not have tenure
.
If you use the data as it is to train your model, you have to either:
- Impute these 20% missing values, using, for example, the sample median
- or simply drop this 20% of clients from our training data.
Either way, your model results will be worse than if you tried to fix this data quality issue upstream. So you go to Mark and ask:
“Do we have tenure
data from all our customers? I am asking because I have lots of missing data in my training set for the churn prediction project”
Mark looks at you, surprised, and says:
“We have tenure
data for all our clients. I bet there is a bug in the SQL query you wrote to generate the training data”
And it turns out that Mark is 100% right.
You fix the query and the percentage of missing tenure
values goes to 0. Super.
Problem #3. Data leakage
You explore the data to understand what features show a high correlation with Churn
. And sometimes, you happen to find features that look too good to be true.
For example, activity
is a categorical feature, with 2 possible values:
activity = "active"
meaning the customer used their phone 2 weeks prior to the churn rate.activity = "inactive"
otherwise.
If you look at the Sweetviz report you will see it has an extreme correlation with Churn
. In particular, all users that were active
did not churn… That seems too good to be true ๐คจ
And you happen to be right: activity
is a user-level feature that gets updated every day, so it reflects the state of the user at the time you generate the training data and not the time period before the churn event.
Hence, activity
is not a feature you can use to train your model, because it uses information from the future.
This is what we call a data leakage, aka a piece of data that you think you can use to train your model, but you should not, because you will not have it at inference time.
Data leakages produce ML models that seem to work impressively well when you train them but fail miserably when you deploy them.
Wrapping it up
Data exploration is the only way to detect 3 big blockers for any ML project:
- Data bugs
- Missing data
- Data leakage
Sweetviz is a very fast way to do so. And the script that I put together is something you can start using right away in your next ML project.
You can find all the code in this repo. Please give it a star on GitHub if you find it useful