How to structure your ML code
Feb 16, 2024
Because real-world ML projects do not fit in one Jupyter notebook
Jupyter notebooks are a great tool for fast iteration and experimentation during your ML development.
However, they are not enough when you go beyond this experimentation phase, and want to build a real-world end-2-end ML app.
The problem
ML apps, like any other piece of software, can only generate business value once they are deployed and used in a production environment.
And the thing is, deploying all-in-one messy Jupyter notebooks from your local machine to a production environment is neither easy, nor recommended from an MLOps perspective.
Often a DevOps or MLOps senior colleague needs to re-write your all-in-on messy notebook, which adds excessive friction and frustration for you and the guy helping you.
So the question is
Is there a better way to develop and package your ML code, so you ship faster and better?
Yes, there is.
Let me show you.
Solution
Let me show you 3 tips to structure your ML project code with the help of Python Poetry.
What is Python Poetry? βοΈ
Python Poetry is an open-source tool that helps you declare, manage and install dependencies of Python projects, ensuring you have the right stack everywhere.
You can install it for free in your system with a one-liner.
Tip 1 β Poetry new ποΈ
Imagine you want to build an ML app that predicts earth quakes.
Go to the command line and type
$ poetry new earth-quake-predictor
With this command Poetry generates the following project structure.
earth-quake-predictor
βββ README.md
βββ earth_quake_predictor
β βββ __init__.py
βββ pyproject.toml
βββ tests
βββ __init__.py
You can now cd into this newly created folder
$ cd earth-quake-predictor
and generate the virtual environment
$ poetry install
where all your project dependencies and code will be installed.
I recommend you build modular code, for different parts of your system, including:
-
data processing and feature engineering.
-
model training
-
model serving
like this β
earth-quake-predictor
βββ README.md
βββ earth_quake_predictor
β βββ __init__.py
β βββ data_processing.py
β βββ plotting.py
β βββ predict.py
β βββ train.py
βββ pyproject.toml
βββ tests
βββ __init__.py
Tip 2 β Doing notebooks the right way π
If you are into notebooks, and want to use them while developing your training script, I recommend you create a separate folder to store them
earth-quake-predictor
βββ README.md
βββ earth_quake_predictor
β βββ __init__.py
β βββ data_processing.py
β βββ plotting.py
β βββ predict.py
β βββ train.py
βββ notebooks
β βββ model_prototyping.ipynb
βββ pyproject.toml
βββ tests
βββ __init__.py
Now, instead of developing spaghetti code inside an all-in-one Jupyter notebook, I suggest you follow these 3 steps
-
Write modular functions inside a regular .py file, for example a function that plots your data
# File -> earth_quake_predictor/plotting.py def my_plotting_function(): # your code goes here # ....
-
Add this cell at the top of your Jupyter notebook to force the Jupyter kernel to autoreload your imports without having to restart the kernel
%load_ext autoreload %autoreload 2
-
Import the function and call it, without having to re-write it.
from earth_quake_predictor.plotting import my_plotting_function my_plotting_function()
Tip 3 β Dockerize your code π¦
To make sure your code will work in production as it works locally, you need to dockerize it.
For example, to dockerize your training script you need to add a Dockerfile
earth-quake-predictor
βββ Dockerfile
βββ README.md
βββ earth_quake_predictor
β βββ __init__.py
β βββ ...
βββ notebooks
β βββ ...
βββ pyproject.toml
βββ tests
βββ __init__.py
The Dockerfile in this case looks as follows

Where each instruction is a layer, that builds on top of the previous layer.

From this Dockerfile you can create a Docker image
$ docker build -t earth-quake-model-training .
and run your model training inside a Docker container
$ docker run earth-quake-model-training
BOOM!
Thatβs it for today guys.
Talk to you next week.
Enjoy the weekend.
Peace, Love and Laugh.
Pau