Subscribe

Which embedding model should you use?

Apr 19, 2024

Today you will learn how to find the right embedding model for your RAG application. Let’s get started!

 

Problem❗

Text embeddings are vector representations of raw text that you compute using an embedding model

 
An embedding model maps your raw text into a vector

These vectors representations are then used for downstream tasks, like

  • Classification → for example, to classify tweet sentiment as either positive or negative.

  • Clustering → for example, to automatically group news into topics.

  • Retrieval → for example, to find similar documents to a given query.

Retrieval (the “R” in RAG) is the task of finding the most relevant documents given an input query. This is one of the most popular use cases for embeddings these days, and the one we will focus on today.

 

There are many embedding models, both open and proprietary, so the question is:

What embedding model is best for your problem? πŸ€”

 

Let me show you how to find the right model for your RAG application ↓

 

Solution πŸ§ 

First, go to the Massive Text Embedding Benchmark (MTEB) Leaderboard, to find the list of best embeddings models for the retrieval task in your language, for example English.

As per today (April 18th 2024) the number 1 model in the leaderboard is Salesforce/SFR-Embedding-Mistral with

  • Embedding quality: 59%, measured with the average Normalized Discounted Cumulative Gain (NDCG) over 15 different datasets.

  • Model size: 7.1 billion parameters

 
Top 10 embedding models as per April 18th 2024

At this point you might think that Salesforce/SFR-Embedding-Mistral is the model you need… and you are probably wrong 😡‍πŸ’«

Why ❓
Because embedding quality is not the only measure you should look at when you build a real-world RAG app. Model size matters, because larger models are slower and more expensive to run.

For example πŸ’
The 7th model in the leaderboard is snowflake-arctic-embed-l with

  • Embedding quality of 55.98% → 5% worse than the leader.

  • Model size: 331 million parameters → 95% smaller than the leader

So, if you are you willing to trade 5% of quality, for 95% cost reduction, you would pick snowflake-arctic-embed-l.

 

In general, to find the sweet spot βš–οΈ between embedding quality and cost, you need to run a proper evaluation of your retrieval step, using your

  • Dataset → e.g. explodinggradients/ragas-wikiqa

  • Vector Db → e.g. Qdrant

  • Other important RAG hyper-parameters, like your chunk size and chunk overlap.

Let’s go through an example with full source code.

 

Hands-on example πŸ‘©πŸ½‍πŸ’»πŸ‘¨πŸ»‍πŸ’»

All the source code shown in the video is available in this Github repository.
Give it a star ⭐ on Github to support my work πŸ™

 

Step 1. Git clone the code

From the terminal

$ git clone https://github.com/Paulescu/text-embedding-evaluation.git

 

Step 2. Install Python dependencies

$ make install

 

Step 3. Setup external services

Create an .env file

$ cp .env.example .env

and paste your

  • OPENAI_API_KEY

  • QDRANT_URL and

  • QDRANT_API_KEY

 

OpenAI GPT-3.5 Turbo

You will need an OpenAI API key, because ragas, the framework for RAG evaluation we use, will be making calls to `GPT-3.5 Turbo` to evaluate the context information quality.

 

Qdrant

We will use Qdrant as the VectorDB, so you also need to create a FREE account on Qdrant.cloud to get your QDRANT_URL and QDRANT_API_KEY

 

Step 4. Select the models and dataset you want to evaluate

Update the list of models you want to evaluate and the dataset in the config.yml

models:
  # 109 million parameter
  - sentence-transformers/all-mpnet-base-v2
  
  # 334 million parameter
  # - 'Snowflake/snowflake-arctic-embed-l'
  
  # 7.11 billion parameter
  # - 'Salesforce/SFR-Embedding-Mistral'

datasets:
  - explodinggradients/ragas-wikiqa

 

Step 5. Run the evaluation

From the command line

$ make run-evals

The Python script behind this command

  1. Loads the model and dataset from HuggingFace, with questions, contexts and answers.

     
  1. Embeds the contexts into the Vector DB, in our case Qdrant.

     
  1. For each question retrieves the top K relevant documents from the Vector DB

     
  1. Compares the information overlap between the retrieved documents and the correct answers, using context precision and context recall.

     
  2. Finally logs the results, so you know what worked best.

    {
        "model_name": "sentence-transformers/all-mpnet-base-v2",
        "dataset_name": "explodinggradients/ragas-wikiqa",
        "top_k_to_retrieve": 2,
        "context_precision": 0.9999999999499998,
        "context_recall": 0.7666666666666666,
        "seconds_taken_to_embed": 4.0,
        "seconds_taken_to_retrieve": 0.0
    }

 

By benchmarking different models you will understand where is the sweet spot between quality and price for your particular dataset and RAG setup.

 

Bonus 🎁 → Which Vector DB should I use?

If you want to build a demo RAG app, any Vector DB will do the job.

However, if you plan on building real-world ML products you need to be more careful with your choice.

My personal recommendation when it comes to Vector DBs is Qdrant.

Why?

Because

  • Its high-performance (thanks to Rust πŸ¦€), so you get the fastest and most accurate results at the cheapest cloud costs πŸ€‘

  • It is extremely easy to scale and upgrade, πŸŽ›οΈ and

  • Gives you the option to keep your data 100% private πŸ’Ύ, thanks to the new Qdrant Hybrid Cloud

Click to learn more

 

 

And before you leave…

Generative AI is very cool, but the reality is that most real world business problems are solved using tabular data and predictive ML models.

If you are interested in learning how to build end-2-end ML systems using tabular data and MLOps best practices, join the Real-World ML Tutorial + Community and get lifetime access to

→ 3 hours of video lectures 🎬
→ Full source code implementation πŸ‘¨‍πŸ’»
→ Discord private community, to connect with me and 350+ students πŸ‘¨‍πŸ‘©‍πŸ‘¦

 

🎁 Gift

Use this direct payment link in the next 5 days and get an exclusive 30% discount!

 

The Real World ML Newsletter

Every Saturday

For FREE

Join 19k+ ML engineers ↓