Subscribe

How to deploy a real-world LLM app

Dec 05, 2023

Large Language Models, like any other ML model, have 0 business value until you deploy them, and make their outputs accessible to end-users or other downstream services.

And the thing is, deploying LLMs is especially challenging due to their size and hardware requirements.

Today, I want to share with you the recipe to move your LLM model prototypes to production.

Let’s start!

 

Step-by-step deployment

 

These are the steps to move your LLM model artifact, that your fine-tuning pipeline generated, into a production LLM-based REST API:

 

1️⃣ Push the LLM artifact to the model registry

 

The model registry is the place where you log your model arfifact, so the REST API can load it once, and re-use it across requests:



2️⃣ Wrap your LLM into a LangChain

 

Your LLM alone is not enough to perform the full mapping from user input to system response. For example, in a Retrieval Augmented Generation (aka RAG) your REST API needs to:

  • Fetch relevant context from your Vector DB.

  • Generate an enriched prompt, that contains the original user request, plus the context you extracted, and then

  • Pass this as input to your LLM

Moreover, you often want to perform post-processing steps to guarantee your final outputs conforms to some predefined format, e.g. JSON.

Standard library to build such chain are LangChain or Llamaindex.

3️⃣ Deploy your REST API to a GPU-enabled computing platform

 

To run your LLM at inference time you NEED AT LEAST ONE GPU. These models are heavy, and prohibitively slow for real-world applications if you do not parallelize computations with the help of a GPU.

Once the app is up and running, it can

  • Listen to incoming user requests

  • Process them through the LLM chain, and

  • Send back responses.

Tip ✨ 

I recommend you log every incoming requestchain inputs and outputs, and any other relevant metadata, for example, latency and cost, if you call an external API like OpenAI. This way you can track performance of your app and optimize it.

 

Here is an example 🎬

In the latest lecture of the Hands-on LLM Course that 
Paul IusztinAlexandru Razvan and myself have open-sourced, you will find a full source code implementation of an inference pipeline, using a 100% Serverless stack:
  • CometML as model registry and CometLLM as prompt logging tool.

  • Qdrant as the VectorDB, and

  • Beam as the compute engine where we deploy the REST API.

Click below to watch the lecture 🎬 ↓↓↓

Wanna support this FREE course? → Give it a star  on Github

 

Wanna access more free hands-on lectures like this?

👉🏽 Subscribe to the Real-World ML Youtube channel for more hands-on FREE tutorials like this

The Real World ML Newsletter

Every Saturday

For FREE

Join 20k+ ML engineers ↓