Which embedding model should you use?
Apr 19, 2024Today you will learn how to find the right embedding model for your RAG application. Let’s get started!
Problemβ
Text embeddings are vector representations of raw text that you compute using an embedding model
These vectors representations are then used for downstream tasks, like
-
Classification → for example, to classify tweet sentiment as either positive or negative.
-
Clustering → for example, to automatically group news into topics.
-
Retrieval → for example, to find similar documents to a given query.
Retrieval (the “R” in RAG) is the task of finding the most relevant documents given an input query. This is one of the most popular use cases for embeddings these days, and the one we will focus on today.
There are many embedding models, both open and proprietary, so the question is:
What embedding model is best for your problem? π€
Let me show you how to find the right model for your RAG application ↓
Solution π§
First, go to the Massive Text Embedding Benchmark (MTEB) Leaderboard, to find the list of best embeddings models for the retrieval task in your language, for example English.
As per today (April 18th 2024) the number 1 model in the leaderboard is Salesforce/SFR-Embedding-Mistral with
-
Embedding quality: 59%, measured with the average Normalized Discounted Cumulative Gain (NDCG) over 15 different datasets.
-
Model size: 7.1 billion parameters
At this point you might think that Salesforce/SFR-Embedding-Mistral is the model you need… and you are probably wrong π΅π«
Why β
Because embedding quality is not the only measure you should look at when you build a real-world RAG app. Model size matters, because larger models are slower and more expensive to run.For example π
The 7th model in the leaderboard is snowflake-arctic-embed-l with
Embedding quality of 55.98% → 5% worse than the leader.
Model size: 331 million parameters → 95% smaller than the leader
So, if you are you willing to trade 5% of quality, for 95% cost reduction, you would pick snowflake-arctic-embed-l.
In general, to find the sweet spot βοΈ between embedding quality and cost, you need to run a proper evaluation of your retrieval step, using your
-
Dataset → e.g. explodinggradients/ragas-wikiqa
-
Vector Db → e.g. Qdrant
-
Other important RAG hyper-parameters, like your chunk size and chunk overlap.
Let’s go through an example with full source code.
Hands-on example π©π½π»π¨π»π»
All the source code shown in the video is available in this Github repository.
Give it a star β on Github to support my work π
Step 1. Git clone the code
From the terminal
$ git clone https://github.com/Paulescu/text-embedding-evaluation.git
Step 2. Install Python dependencies
$ make install
Step 3. Setup external services
Create an .env
file
$ cp .env.example .env
and paste your
-
OPENAI_API_KEY
-
QDRANT_URL
and -
QDRANT_API_KEY
OpenAI GPT-3.5 Turbo
You will need an OpenAI API key, because ragas, the framework for RAG evaluation we use, will be making calls to `GPT-3.5 Turbo` to evaluate the context information quality.
Qdrant
We will use Qdrant as the VectorDB, so you also need to create a FREE account on Qdrant.cloud to get your
QDRANT_URL
andQDRANT_API_KEY
Step 4. Select the models and dataset you want to evaluate
Update the list of models you want to evaluate and the dataset in the config.yml
models:
# 109 million parameter
- sentence-transformers/all-mpnet-base-v2
# 334 million parameter
# - 'Snowflake/snowflake-arctic-embed-l'
# 7.11 billion parameter
# - 'Salesforce/SFR-Embedding-Mistral'
datasets:
- explodinggradients/ragas-wikiqa
Step 5. Run the evaluation
From the command line
$ make run-evals
The Python script behind this command
-
Loads the model and dataset from HuggingFace, with questions, contexts and answers.
-
Embeds the contexts into the Vector DB, in our case Qdrant.
-
For each question retrieves the top K relevant documents from the Vector DB
-
Compares the information overlap between the retrieved documents and the correct answers, using context precision and context recall.
-
Finally logs the results, so you know what worked best.
{ "model_name": "sentence-transformers/all-mpnet-base-v2", "dataset_name": "explodinggradients/ragas-wikiqa", "top_k_to_retrieve": 2, "context_precision": 0.9999999999499998, "context_recall": 0.7666666666666666, "seconds_taken_to_embed": 4.0, "seconds_taken_to_retrieve": 0.0 }
By benchmarking different models you will understand where is the sweet spot between quality and price for your particular dataset and RAG setup.
Bonus π → Which Vector DB should I use?
If you want to build a demo RAG app, any Vector DB will do the job.
However, if you plan on building real-world ML products you need to be more careful with your choice.
My personal recommendation when it comes to Vector DBs is Qdrant.
Why?
Because
Its high-performance (thanks to Rust π¦), so you get the fastest and most accurate results at the cheapest cloud costs π€
It is extremely easy to scale and upgrade, ποΈ and
Gives you the option to keep your data 100% private πΎ, thanks to the new Qdrant Hybrid Cloud
And before you leave…
Generative AI is very cool, but the reality is that most real world business problems are solved using tabular data and predictive ML models.
If you are interested in learning how to build end-2-end ML systems using tabular data and MLOps best practices, join the Real-World ML Tutorial + Community and get lifetime access to
→ 3 hours of video lectures π¬
→ Full source code implementation π¨π»
→ Discord private community, to connect with me and 350+ students π¨π©π¦
π Gift
Use this direct payment link in the next 5 days and get an exclusive 30% discount!