Are you wondering what techniques you can employ to improve the performance of a large language model (LLM)? Or maybe you are wondering what tradeoffs you should keep in mind as you work to improve the performance of an LLM? Well either way, you are in the right place! In this article, we tell you everything you need to know to get started with optimizing LLM performance.
We start this article out by discussing some of the tradeoffs that you should consider when optimizing the performance of an LLM. This will help you understand what factors you should keep in mind as you determine how you will improve your LLM. After that, we discuss specific techniques that can be used to improve the performance of a large language model. For each technique that we mention, we provide a brief summary of the specific tradeoffs associated with that technique.
Tradeoffs to consider when building large language models
Before we discuss techniques that you can use to improve the LLMs that you are using, we will first discuss some of the tradeoffs that you should consider when building these models. Specifically, we will consider some of the factors that you should keep an eye on when you decide to move from one model to another.
- Predictive performance. The most obvious factor that you should keep in mind when optimizing systems that use LLMs is predictive performance. When we say predictive performance, we are broadly referring to any measure of how good of a job that your model does at responding to a prompt. There are many different ways that this can be measured and different measurements need to be used for different types of problems. In some cases, there will be one specific answer that is expected and you will be able to use classic performance metrics (such as accuracy, precision, and recall) that determine how often the model produced the specific answer you were looking for. In other cases, there will be many answers that are reasonable and you will have to get more creative when you determine how to measure model performance. For example, you may look at the percentage of cases where a human rater preferred the answer that was generated by the model to another answer that was generated by a human.
- Latency. The next factor that you should keep in mind when optimizing a system that uses LLMs is the latency, or the amount of time that it takes for the system to return an answer to the user. The exact latency requirements of your system will vary based on the use case that you are using the LLM for. If you are building out a system that will block a user from executing an important task, it will likely be more important for your system to have low latency. If you are building out a system that provides supplementary guidance without blocking the user from proceeding forward, then you may have some more breathing room.
- Level of effort. Another important factor that you should keep in mind when optimizing a system that utilizes LLMs is the level of effort that is required to implement different strategies. Some techniques that can be used to optimize LLMs require a lot more effort and labor than others. If you are in a situation where you need to deliver results quickly, then you should default to using techniques that are lower effort. If you have some time to conduct research and experiment, then you will have more techniques available to you.
- Cost. The next concern that you should keep in mind is the monetary costs that the system will incur. This is a particularly large concern if you are using models that are hosted by third party vendors like OpenAI rather than models that you host internally on your own infrastructure. When you are using a third party model, it is important to be aware of the cost structure and optimize for that. For example, if you are using a model that charges based on the number of tokens that you feed into a model then you may want to favor techniques that reduce the number of tokens you need to feed into the model over techniques that increase this number.
- Reliability. Another factor that you should keep in mind when designing systems that use LLMs is the reliability of your overall system. Some techniques that are used to optimize LLMs can impact the resiliency of your systems. For example, if you are integrating outputs from your LLMs with other systems then it may be important to ensure that the output from the LLM is consistently formatted in the same way. There are techniques that help to improve the consistency of your output format, which will positively impact the reliability of your system.
- Context window. The context window is the number of tokens that can be used to guide model behavior. When you work with LLMs, there are usually strict limits on the number of tokens that can be included in any given prompt. Sometimes models can access information from previous prompts to influence their behavior, but there is usually still some type of limitation on how much information a model can access. Some common techniques that are used to improve LLM performance require including a lot of additional tokens in your prompts. This eats into your context window and reduces the number of tokens that are available for other purposes. It is important to keep the number of tokens you can use in mind when deciding what techniques to use.
Techniques for improving large language models
In this section, we will discuss some of the most popular techniques for improving the performance of large language models. These techniques can be used to improve all different facets of LLM performance, ranging from predictive performance to latency and cost.
Note that we will list these techniques roughly in order of the level of effort that is required to implement them, with techniques that are faster and easier to implement appearing toward the top of the list. For each technique we discuss, we will provide a short summary of what types of situations it is useful for. If you want to learn more about when to use a certain technique, click into the article that is linked for that technique.
- Basic prompt engineering. Basic prompt engineering involves modifying the text that you use to prompt the model to provide more information on how it should respond and what a good response looks like. This is an easy to implement tactic that should be your first line of attack when you are looking to improve the predictive performance of an LLM that you are working with. Few shot learning is one example of this type of technique.
- Prompt chaining. This is a technique that is often used to help an LLM break down a big, ambiguous problem into a series of small, concrete problems that are easier to solve. The model can then solve these small problems one at a time rather than trying to solve the large, difficult problem all at once. This generally improves predictive performance, but may increase the latency of your system as well as your costs since it requires multiple calls to the model.
- Function calling. This is a technique that can be used to extract structured data from unstructured text. This is a good technique to reach for if you need to ensure that your output is consistently formatted in the same way to improve the reliability of your systems.
- Retrieval augmented generation. Retrieval augmented generation makes it easy to look at a specific prompt, identify relevant pieces of context that are related to the prompts, and feed this information forward to the model. This technique can also help to reduce model hallucinations by directing the model towards relevant content it should consider when creating a response. This is a good option to look into if you need to provide an LLM with domain specific context that it would not have been exposed to during model training. While it can improve predictive performance, it also has the potential to introduce more latency into your system.
- Fine tuning. This is a more advanced technique that involves training a model on data that you have collected. It is useful when you need to provide a model with additional context on how to perform a very specific task. It can also help to improve operational metrics like latency and cost. That being said, it takes more effort than other techniques.