How to test LLM apps in the real-world

Nov 13, 2023

Today I wanna show you step by step, how to test a real-world LLM app using an open-source Python library called Giskard.

Let’s get started!

The problem 🤔

Large Language Models, like any other ML model are bound to make mistakes, not matter how good they are.

Typical mistakes include:

hallucinations
misinformation
harmfulness, or
disclosures of sensitive information

And the thing is, these mistakes are no big deal when you are building a demo.

However, these same mistakes are a deal breaker when you build a production-ready LLM app, that real customers will interact with ❗

Moreover, it is not enough to test for all these things once. Real-world LLM apps, as opposed to demos, are iteratively improved over time. So you need to automatically launch all these tests every time you push a new version of your model source code to your repo.

So the question is:

Is there an automatic way to test an LLM app, including hallucinations, misinformation or harmfulness, before releasing it to the public?

And the answer is … YES!

Say hello to Giskard 👋🐢

Giskard is an open-source testing library for LLMs and traditional ML models.

Giskard provides a scan functionality that is designed to automatically detect a variety of risks associated with your LLMs.

Let me show you how to combine.

the LLM-testing capabilities of Giskard, with
with CI/CD best-practices

to build an automatic testing workflow for your LLM app.

The solution 🧠 = GitHub action + Giskard

👉 You can find all the source code I show in this section in this GitHub repo

Create a custom GitHub action in your repo that runs every time you want to push new code to the master branch.

on:
  pull_request:
    paths:
      - 'hyper-parameters.yaml'

This action needs to build the model, test it, and export your test results somewhere where everyone has access to, including the PR request discussion:

steps:

  - name: Build model 🏗️
    run: make model

  - name: Test model 🐢
    run: make scan

  - name: PR comment 📁
    ....

Let’s go step-by-step.

Step 1. Build your model artifact 🏗️

You typically have a file/function in your repository that builds your model.

In my example src/build_model.py builds a RAG-based application using LangChain that can answer questions about climate change

using OpenAI gpt-3.5-tubo-instruct
The IPCC climate change report, and
a bit of prompt engineering.

And this is an example input-output you will get from this model:

Question:
Is sea level rise avoidable? When will it stop?

Answer:
Sea level rise is not avoidable, as it will continue for millennia. However, the rate and amount of sea level rise depends on future emissions. It is important to reduce greenhouse gas emissions to slow down the rate of sea level rise. It is not possible to predict when sea level rise will stop, but it can be managed through long-term planning and implementation of adaptation actions

This function is the first key step inside our GitHub action, that is triggered every time you open a Pull Request, for example, when you update any of the hyper-parameters of the model, like the prompt template.

on:
  pull_request:
    paths:
      - 'hyper-parameters.yaml'

...

step:
  - name: Build model
    run: make model

The output from this step is a model artifact that you persist on disk, so the next step (testing) can pick it up.

Step 2. Scan your LLM with Giskard 🐢

In this step we load the model artifact we generated in the previous step and run a suite of LLM-specific tests (e.g. hallucinations, prompt injection…) using Giskard.

Detecting vulnerabilities of your LLM using Giskard is as easy as:

Wrapping your model artifact as a Giskard Model

from giskard import Model

giskard_model = Model(load_your_model_artifact())

Wrapping a base dataset with examples as a Giskard Dataset

from giskard import Dataset

giskard_dataset = Dataset(load_your_dataset_examples())

Running a full (or partial) scan of your Model on the reference Dataset

from giskard import scan

# full scan
report = scan(giskard_model, giskard_dataset)

# only hallucinations
report = scan(giskard_model, giskard_dataset, only="hallucination")

Tip ✨

Under the hood, Giskard used LLM assisted detectors, which use another LLM model to probe the model under analysis (GPT-4, at the moment).

Running a full test can be time consuming and expensive, due to a large number of API calls to OpenAI. This is why I recommend you focus on the most relevant tests for your app, as a starter, and then expand as you want to increase robustness.

The report generated by Giskard scan looks like this:

This report can stored as an `html` or `markdown` file as an artifact, that you can embed

in your Pull Request discussion (see next step), or
publish to a service like Giskard Hub, that allows collaboration between members of your ML engineering team.

Step 3. Share test results for PR approval

I recommend you start your LLM testing journey by having a human reviewer.

The idea is simple. You embed the scan results in your PR discussion, so the PR reviewer (either yourself, or a senior colleague) can take a look at them, and decide whether:

✅ they are good enough, so the PR is approved and the code merged to master.
❌ not good enough. In this case, the PR is not approved, and the developer either works further on the feature branch, or moves on to another problem.

Once the code gets merged to master, you can set up another github action (aka continuous deployment) that will automatically deploy the new model to your production environment.

Scan results exporting and downstream review

In a nutshell 🥜

This is what the whole process of automatic LLM testing with Giskard looks like:

Now it is your turn 🫵

It is time to go beyond LLM demos, and build real-world, TESTED, LLM apps.

So I suggest you to

Give a star ⭐ to Giskard on GitHub to support the open-source,
$ pip install giskard, and
Start building your next LLM-powered app 🚀

Wanna learn more Real World ML?

Subscribe to my weekly newsletter

Every Saturday

For FREE

Join 26k+ ML engineers ↓