Subscribe

Fast And Easy Data Exploration For Machine Learning

Oct 23, 2023

Sweetviz is the way

Tired of spending “too much time” doing data exploration before training your Machine Learning models?

Looking for a faster way to understand data issues and patterns, before you dive into the fun part of training your ML model?

Wanna learn how to train better ML models, by finding and fixing issues in your data?

You’ve come to the right place.

In this article, you will learn how to do data exploration at the speed of light, using the amazing open-source library Sweetviz.

Let’s go through a hands-on example and code you can find in this GitHub repository.

The problem

You need to generate your training data at the beginning of every real-world ML project.

Typically, you access an SQL-type database and write a long query that pulls data from several tables, aggregates it, and merges it into the final training set. The dataset contains a set of features and a target metric you want to predict.

If you get stuck at this stage, I recommend you read my previous article on best practices for dataset generation, using SQL and Python

๐Ÿ“ How to generate training data: faster and better

 

Once you have this data, you are very tempted to train your first ML model.

And this is a BIG mistake.

Instead, you should put a few minutes aside to run a data exploration.

But, why do I need data exploration?

Because, the most effective way to improve your results is NOT by trying more complex models, or by tunning hyper-parameters (real-world ML != Kaggle).

Instead, you should focus on increasing the quality of your data. And you do this with data exploration.

When you explore a dataset, you want to pay special attention to:

  • Data bugs: Are there any weird things, that might show a bug in the data?
  • Missing data: What is the percentage of missing observations for each feature?
  • Data leakage: Are there features that look extremely predictive, and “too good to be true”?

The question is then

Is there a fast way to explore a dataset?

Yes, there is.

Its name is Sweetviz, an open-source library you will fall in love with.

Let’s go through a practical example and a Python script I developed for my ML projects. All the source code I present is publicly available in this GitHub repository. Feel free to use it in your next ML project.

Example

Let’s imagine you work as a data scientist in a Telecommunications company (aka Telco).

A common problem Telcos face is high customer churn. There is high competition in this sector, which means customers often find more attractive deals from competitors, so they switch.

To mitigate this, the marketing team comes to you with an idea:

“Can you develop a model to predict customer churn?”

With that model, they could take precautionary measures, like sending special offers to customers who are about to churn and keep them on board.

That sounds like a plan.

Step 1. Generate the training data

You go back to your laptop and do the first thing you need to do in every real-world ML project: you generate the training set. You can find the exact dataset I am using for this example here.

The dataset has one row per client, and each column has a few categorical and numerical features, plus the binary target Churnyou want to predict, that takes values:

  • Churn = "Yes" meaning the customer churned.
  • Churn = "No" meaning the customer did not churn.

The features you pulled out from the DB are these:

  RAW_FEATURES = {
  'gender': 'string',
  'SeniorCitizen': 'string',
  'Partner': 'string',
  'Dependents': 'string',
  'tenure': 'numerical',
  'PhoneService': 'string',
  'MultipleLines': 'string',
  'InternetService': 'string',
  'PaymentMethod': 'string',
  'MonthlyCharges': 'numeric',
  'TotalCharges': 'numeric',
  'date': 'datetime',
  'activity': 'string,'
  }
view rawraw_features.py hosted with โค by GitHub

 

On top of these raw features, you engineer a few others, like month, dayOfMonth, dayOfWeek or hour, to capture temporal patterns in churn rates.

All in all, this is the complete set of features (raw + engineered) you end up having at your disposal:

  FEATURES_TO_MODEL = OrderedDict({
  # raw features
  'gender': 'string',
  'SeniorCitizen': 'string',
  'Partner': 'string',
  'Dependents': 'string',
  'tenure': 'string',
  'PhoneService': 'string',
  'MultipleLines': 'string',
  'InternetService': 'string',
  'PaymentMethod': 'string',
  'MonthlyCharges': 'numeric',
  'TotalCharges': 'numeric',
  'activity': 'string',
   
  # engineered features
  'dayOfWeek': 'numeric',
  'month': 'numeric',
  'dayOfMonth': 'numeric',
  'hour': 'numeric',
  })
view rawfeature_to_model.py hosted with โค by GitHub

 

Tempted as you are to jump into the modeling part, you (wisely) set some time apart to take a closer look at the dataset.

Step 2. Data exploration

Now that you have the data, you are ready to explore it.

For that, you use the eda.py script you can find in my GitHub repo. It uses Sweetviz, an open-source library that generates data exploration reports in seconds.

  from argparse import ArgumentParser
  from pathlib import Path
  import os
  from pdb import set_trace as stop
   
  import sweetviz as sv
  import pandas as pd
   
  from features import add_features
   
  ROOT_DIR = Path(__file__).parent.resolve().parent
  DATA_DIR = ROOT_DIR / 'data'
  VIZ_DIR = ROOT_DIR / 'viz'
   
   
  def eda(
  file_name: str,
  target: str,
  ) -> None:
   
  # load data into pandas
  file_path = Path(DATA_DIR) / file_name
  data = pd.read_csv(file_path)
   
  # feature engineering
  target_data = data[target].tolist()
  data = add_features(data)
  data[target] = target_data
   
  # Sweetviz generates the HTML report
  my_report = sv.analyze(data, target_feat=target)
  output_html = VIZ_DIR / f'{file_name}.html'
  my_report.show_html(output_html)
   
   
  if __name__ == '__main__':
   
  parser = ArgumentParser()
  parser.add_argument('--file', dest='file', type=str)
  parser.add_argument('--target', dest='target', type=str)
  args = parser.parse_args()
   
  eda(file_name=args.file, target=args.target)
view raweda.py hosted with โค by GitHub

 

To explore a dataset you simply call this file from the command line, passing as parameters:

  • the dataset you want to explore, v1.csv
  • the name of the target variable, Churn
$ python eda.py --file v1.csv --target Churn

After a few seconds, the Sweetviz function analyze() generates a nice-looking HTML report for you.

data exploration with sweetviz
Sweetviz report

 

Problem #1. Data bugs

If you look at the temporal features dayOfWeek, month, dayOfMonth and hour you will see they have a very unbalanced distribution.

data exploration with sweetviz
temporal features form `date`

For example, dayOfWeek is 1 (meaning Tuesday) for more than 90% of the observations.

dayOfWeek feature

This looks weird to you, so you go and ask Mark, the data engineer on the team.

Hey, Mark! How is it possible that almost 90% of our churn events happen on Tuesday?

To which he responds:

That must be a bug in the churndate field. I had to reprocess the table a couple of weeks ago, and I think I must have overwritten the actual churn date with the date I updated the record in the table.

And this is exactly what is happening here. If you look at the other temporal features you will quickly realize that Mark overwrote 90% of the daterecords on the 1st of February 2022.

You caught a data bug, that can be fixed and that will help you build a stronger model. Good job!

Problem #2. Missing data

Real-world datasets can be plagued with missing data. Sometimes, you cannot do much to remediate that. However, oftentimes, missing data can be addressed upstream, by your data engineer friend Mark.

From the Sweetviz report, you clearly see that tenure has a strong negative correlation with Churn. Great, this means tenure is a predictive feature.

The only catch is that 20% of the samples do not have tenure.

feature exploration sweetviz
tenure feature

If you use the data as it is to train your model, you have to either:

  • Impute these 20% missing values, using, for example, the sample median
  • or simply drop this 20% of clients from our training data.

Either way, your model results will be worse than if you tried to fix this data quality issue upstream. So you go to Mark and ask:

Do we have tenure data from all our customers? I am asking because I have lots of missing data in my training set for the churn prediction project

Mark looks at you, surprised, and says:

We have tenure data for all our clients. I bet there is a bug in the SQL query you wrote to generate the training data

And it turns out that Mark is 100% right.

You fix the query and the percentage of missing tenure values goes to 0. Super.

Problem #3. Data leakage

You explore the data to understand what features show a high correlation with Churn. And sometimes, you happen to find features that look too good to be true.

For example, activity is a categorical feature, with 2 possible values:

  • activity = "active" meaning the customer used their phone 2 weeks prior to the churn rate.
  • activity = "inactive" otherwise.

If you look at the Sweetviz report you will see it has an extreme correlation with Churn. In particular, all users that were active did not churn… That seems too good to be true ๐Ÿคจ

data exploration with sweetviz
activity feature looks too good to be true

And you happen to be right: activity is a user-level feature that gets updated every day, so it reflects the state of the user at the time you generate the training data and not the time period before the churn event.

Hence, activity is not a feature you can use to train your model, because it uses information from the future.

This is what we call a data leakage, aka a piece of data that you think you can use to train your model, but you should not, because you will not have it at inference time.

Data leakages produce ML models that seem to work impressively well when you train them but fail miserably when you deploy them.

Wrapping it up

Data exploration is the only way to detect 3 big blockers for any ML project:

  • Data bugs
  • Missing data
  • Data leakage

Sweetviz is a very fast way to do so. And the script that I put together is something you can start using right away in your next ML project.

You can find all the code in this repo. Please give it a star on GitHub if you find it useful

The Real World ML Newsletter

Every Saturday

For FREE

Join 20k+ ML engineers โ†“