LLMs in the real world
Mar 17, 2025
Despite all of the hype and buzzwords, the process for building high-quality LLM apps is straightforward.
Let me show you with an example ↓
Example
Say you want to build an LLM app to extract crypto market signals from financial news.
This is something we will build end-2-end in 3 weeks, in the 4th cohort of Building a Real-Time ML System. Together
The idea is simple. You pass a piece of news to your LLM app, for example
"FED to increase interest rates"
and you want your model to output a sentiment score
-
0 = neutral,
-
1 = positive
-
-1 = negative
and (optionally) a reasoning behind the score, to add a layer of interpretability to the results:
{
"sentiment_score": -1,
"reasoning": "The news about FED increasing interest rates is typically bearish for crypto markets for several reasons:\n1. Higher interest rates make borrowing more expensive, reducing liquidity in the market\n2. Higher rates make traditional yield-bearing investments more attractive compared to crypto\n3. Risk assets like cryptocurrencies tend to perform poorly in high interest rate environments\n4. Historically, crypto prices have shown negative correlation with interest rate hikes"
}
The question is
How do you build an LLM app that excels at this task?
These are the steps (which are universal for ANY LLM app you want to build!)
-
Generate a high-quality instruction dataset
-
Build a baseline model
-
Evaluate the baseline model.
-
If you are happy with the results, you are done. Otherwise, come up with a better model and go back to step 3.
Let’s go through each step.
Step 1 → Generate a high-quality instruction dataset
An instruction dataset is a list of (input, output) examples, where:
-
input → what you pass to the LLM. In our case, a piece of news.
-
output → what you expect the LLM to generate. In our case, a JSON object with a sentiment score and text explanation.
Attention 📣
This step is THE MOST CRITICAL step in the entire process of creating your LLM app, and you will spend most of your time here. In my experience, over 80% of the total development time is spent here.
Remember, there is not such thing as spending too much time generating good training data.
In our case we need to find a collection of crypto news (the inputs) and map them to the sentiment scores (the outputs).
You can always use a strong LLM (like Claude) to generate a dataset of both (input, output) examples.
However, I recommend you follow a hybrid-approach.
My recommendation 💡
-
Fetch the crypto news from the same API you will use once you deploy the LLM and use it production.
-
Use a strong LLM to map these news to the score. I like Claude, but feel free to use whatever strong LLM you like.
-
Ask a human-expert (often yourself) to manually check the sentiment scores generated by this LLM, filter and fix them when they are not correct. The more time you spend here, the better your final LLM app will be.
Once you have this instruction dataset, you can move to the next step.
Step 2 → Build a simple solution (aka baseline)
In my experience, a strong general purpose LLM (like Anthropic Claude) with a simple prompt gets you up and running.
If you want to see an actual Python implementation check my previous article.
Step 3 → Evaluate your solution
Pick whatever evaluation metric(s) make(s) sense for the task you want the LLM to solve, and compute it by comparing
-
the model outputs, vs
-
the ground truth outputs from your instruction dataset.
For example, if the output of your LLM is structured output (like in our example) you can compute 2 evaluation metrics:
-
Is the output correctly formatted as a JSON? → Type Error.
-
Do the sentiment scores between the model and the ground truth match? → Mean Absolute Error
If the LLM generates unstructured text, you can use another LLM (aka judge) to compare the model generated output with the ground truth text.
Once you have the aggregate evaluation metric of your model against your dataset you ask yourself this question:
Is this metric good enough for my use case?
If YES, you are done developing the model. You can now onto the deployment step.
How can I deploy an LLM?
A very popular tool to deploy open-source LLMs is vLLM.
In the next cohort of Building a Real-Time ML System. Together we will go through the development to deployment process of an open-source LLM with vLLM.
Otherwise, you move to the next step.
Step 4 → Error analysis
Error analysis is all about understanding why your model (either classic ML or LLM) did not generate the right output.
Can you find the WHY your model did not work well in that particular case?
For example:
-
Does the model lack enough context to generate the correct answer? → A bit of RAG will help
-
Would external tools help the model perform better on those failed examples? → Equip your LLM with the tool.
-
Did the model fail in task that required many intermediate reasoning steps? → A chain of LLMs or a
Here you will need to manually inspect examples when your model generated wrong outputs, and this takes time. Take it easy and don’t stress. This manual work has a value that compounds over time, as your understanding of the data and problem you solve grows and grows and grows… you get my point.
Never done error analysis?
Here is a short lesson by AndrewNg that will help you.
In terms of tools for LLM error analysis, I like open-source Opik.
With the insights you collect at this stage, you can move to the next step.
Step 5 → Model improvement
There are several ways to improve your baseline. Starting with prompt engineering.
Prompt engineering is all about tuning the exact text you send to the LLM, to improve your model evaluation metric (the one you compute in step 3).
And the thing is, LLMs are not humans. Meaning, small changes to the input prompt (that have the exact same meaning for us humans) can produce significantly different results when ask an LLM.
My advice
The best Python library I know to automatically optimize your prompts is AdalFlow. I also think that automatic prompt engineering is a much needed problem with (surprisingly) not enough good open-source tooling. So, if you know something better please share it on the comments below.
Another very popular technique is supervised fine tuning. The idea here is to
-
take a not-so large base LLM (for example up to 7 billion parameters), and
-
use a low-parameter fine-tuning algorithm (like LoRA)
to minimize your evaluation metric.
If none of these help, it probably needs your model lacks enough context in your prompt. And this means you need to move to another more complex LLM app, like a basic RAG system.
But this is something we will cover another day.