Are you wondering when you should use retrieval augmented generation (RAG) to enhance your large language models (LLMs)? Or maybe you want to understand when RAG should be avoided? Well then you are in the right place! In this article, we tell you everything you need to know to understand when to use RAG in conjunction with LLMs.
We start out by talking about what RAG is and how it works. After that, we provide more information on what type of data is required in order to make RAG work. Next, we discuss some of the main advantages and disadvantages of RAG. This provides useful context that informs later conversations. Finally, we provide examples of situations where it is and is not a good idea to use RAG.
What is retrieval augmented generation ?
What is RAG? RAG is a method that can be used to provide a model with additional context that it was not exposed to during training. The main idea here is that you can curate a library of documents that contain useful information that a model needs to know in order to respond to prompts appropriately. As user prompts come in, you can look up documents from that library that are relevant to a given prompt and inject text snippets from those documents into the prompt that you send over to the model.
What data is needed for retrieval augmented generation?
What type of data is required in order to use retrieval augmented generation (RAG)? In order to use RAG, you need to put together a collection of text-based data that contains all of the context the model would need in order to be able to respond to reasonable user prompts. For example, if you are teaching a model how to assist users in onboarding onto a particular product then you may assemble a collection of text-based help documents that your support team or sales team would usually use to get a user up and running onto a product.
Once you assemble a corpus of text-based documents that you want your model to have access to, the next step is to create embeddings of these documents. Embeddings are simply compact numeric representations of the text-based documents. It is important to convert your documents into this format because it allows you to quickly find the similarity between two different documents or pieces of text. All you have to do is calculate the distance between the numeric embedding vectors. This makes it easy to find bits of context that are similar to the prompt that the user typed in.
Once you have converted all of your documents into numeric embeddings, you need to store these embeddings in a secure database that can be accessed by your model. When your model receives a prompt, it will then be able to convert that prompt into an embedding and look for similar bits of context that are stored in your database. Once those bits of context have been found, they can be fed into the prompt to inform the model’s answer.
Advantages and disadvantages of retrieval augmented generation
In this section, we will discuss some of the main advantages and disadvantages of retrieval augmented generation. This will help to inform future conversations around when to use this method.
Advantages of retrieval augmented generation
What are some of the main advantages of RAG? In this section, we will describe some of the main advantages of RAG. In particular, we will focus on advantages that make RAG stand apart from other techniques that can be used to improve LLMs.
- Provide domain specific context. One of the main advantages of RAG is that it can be used to provide domain specific context to a model. That provides the model with an avenue to answer questions on topics that were not touched upon in the data that was used to train the model. If you need to provide a model with different bits and pieces of context to respond to different prompts, then this is a great strategy to employ.
- Improves predictive performance. Another advantage of RAG is that it improves predictive performance and reduces the likelihood that the model will hallucinate. By feeding the model hard information about the topic of interest, it grounds the model’s response and encourages the model to craft a response based on that information.
- The model itself is no more expensive. Since RAG queries make use of base foundation models rather than models that have been fine tuned or directly altered by the user, it is not more expensive to use models that have RAG applied. There may be incremental cost associated with the additional tokens that are being fed to the model, but the call to the model itself will not be up-charged because you are still using a base model. This is mostly a concern when you are using third party solutions that charge additional money to use fine tuned models, such as OpenAI.
- Requires less labor than fine tuning LLMs. In general, it is faster to get an initial version of RAG working than it is to get an initial version of a fine tuned model working. That being said, RAG does require some data encoding and storage decisions to be made so it is not as straightforward as prompt engineering that only modifies the text in the prompt.
- Does not require large computational resources. Another advantage of RAG is that it does not require large computational resources to train a model. This is mostly a concern if you are comparing this approach to another approach that requires fine tuning or training a model on a large amount of records on your own infrastructure.
- May be used for state of the art models. This is a technique that can be applied before the prompt itself ever hits the model. That means that it is possible to use on all types of models. That being said, there may be some natively supported implementations for some modeling libraries that are only available for some models.
- Does not remove guardrails that are built into base foundation models. Another advantage of this approach is that since you are not fine tuning any models, it does not remove the guardrails that are built into base models. Some vendors put guardrails on their models to ensure that they behave in a reasonable way and these guardrails may be stripped away when a model is fine tuned.
Disadvantages of retrieval augmented generation
What are some of the main disadvantages of retrieval augmented generation? In this section, we will describe some of the main disadvantages associated with retrieval augmented generation.
- May add additional latency. One of the main advantages of RAG is that it may add additional latency to your model. When you introduce RAG into your model, you are introducing an additional step that needs to be taken before your model can produce a final result. This is always going to add some latency. In this particular case, you are also adding a step that involves a costly search that has the potential to be slow.
- May introduce privacy concerns. You have to be careful when curating the dataset that will be fed into your model in order to provide it context. If your dataset contains sensitive information such as personally identifiable information (PII), then you may run into a situation where the model exposes that sensitive information to another user. This risk can be mitigated by ensuring that you are not including any personal, sensitive data in your documents.
- Need to curate data for retrieval. In general, it does take some time to curate the datasets that you want to use for RAG. There are also decisions that need to be made like what database you want to use and what algorithm you want to use in order to create your embeddings. That means that this will take a little bit of time to get set up.
- Introduces additional tokens into your prompts. Since RAG introduces additional information that needs to be injected into your prompt or fed to your model, it does increase the amount of data that you need to push to the model. If you are using a vendor that charges based on the number of tokens that are fed to a model, then this can increase your costs.
When to use retrieval augmented generation
When is it a good idea to use RAG? In this section, we will discuss situations where it is a good idea to use RAG.
- When your model needs more context to effectively respond to prompts. In general, RAG is one of the first techniques that you should spring for when your model needs additional context (that was not contained in the data it was trained on) in order to respond to prompts effectively. This is one of the most effective methods that can be used to provide the model with small chunks of context that are relevant to user prompts and ground its responses.
- When you want to reduce hallucinations. RAG can also be a good option to look into when you are looking to reduce hallucinations in your model. The more specific, concrete information that is provided to your model as it formulates a response, the less likely the model is to hallucinate. By introducing RAG into your system, you can ensure that your model is provided with relevant context in each prompt.
When not to use retrieval augmented generation
When should you avoid using RAG? In general, we will discuss situations where you should avoid using RAG.
- When your model needs a small bit of context that can be shared across all prompts. When your model just needs a small bit of context that can be shared across all prompts, then you may not need to implement a full RAG system. In these cases, you may be able to get away by introducing that context into all of the prompts that you send off to your model.
- When your model needs a huge amount of context. Conversely, if your model needs a huge amount of context to respond to individual prompts effectively then it may not be as useful to use RAG. RAG is generally designed for use cases where there are bite-sized chunks of context that can be fed to the model to help it formulate a response. If the model needs an extensive amount of context, you may need to fine tune a model or train a model from scratch.
- When you can not afford any additional latency. Introducing RAG into your system generally adds some additional latency. There is generally some additional time that is required to look up the information that the model needs in order to respond to the prompt and feed that information to the model. If you are in a situation where you cannot afford to introduce any additional latency into your system, then RAG may not be the technique for you.
Related articles
- How to improve LLM performance
- When to fine tune an LLM
- When to use prompt chaining for LLMs
- When to use basic prompt engineering for LLMs
- When to use few shot learning for LLMs
- When to use function calling for LLMs