AWS’ new theory on designing an automated RAG evaluation mechanism could not only ease the development of generative AI-based applications but also help enterprises reduce spending on compute infrastructure.
RAG or retrieval augmented generation is one of several techniques used to address hallucinations, which are arbitrary or nonsensical responses generated by large language models (LLMs) when they grow in complexity.
RAG grounds the LLM by feeding the model facts from an external knowledge source or repository to improve the response to a particular query.
There are other ways to handle hallucinations, such as fine-tuning and prompt engineering, but Forrester’s principal analyst Charlie Dai pointed out that RAG has become a critical approach for enterprises to reduce hallucinations in LLMs and drive business outcomes from generative AI.
However, Dai pointed out that RAG pipelines require a range of building blocks and substantial engineering practices, and enterprises are increasingly seeking robust and automated evaluation approaches to accelerate their RAG initiatives, which is why the new AWS paper could interest enterprises.
The approach laid down by AWS researchers in the paper could help enterprises build more performant and cost-efficient solutions around RAG that do not rely on costly fine-tuning efforts, inefficient RAG workflows, and in-context learning overkill (i.e. maxing out big context windows), said Omdia Chief Analyst Bradley Shimmin.
What is AWS’ automated RAG evaluation mechanism?
The paper titled “Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation,” which will be presented at the ICML conference 2024 in July, proposes an automated exam generation process, enhanced by item response theory (IRT), to evaluate the factual accuracy of RAG models on specific tasks.
Item response theory, otherwise known as the latent response theory, is usually used in psychometrics to determine the relationship between unobservable characteristics and observable ones, such as output or responses, with the help of a family of mathematical models.
The evaluation of RAG, according to AWS researchers, is conducted by scoring it on an auto-generated synthetic exam composed of multiple-choice questions based on the corpus of documents associated with a particular task.
“We leverage Item Response Theory to estimate the quality of an exam and its informativeness on task-specific accuracy. IRT also provides a natural way to iteratively improve the exam by eliminating the exam questions that are not sufficiently informative about a model’s ability,” the researchers said.
The new process of evaluating RAG was tried out on four new open-ended Question-Answering tasks based on Arxiv abstracts, StackExchange questions, AWS DevOps troubleshooting guides, and SEC filings, they explained, adding that the experiments revealed more general insights into factors impacting RAG performance such as size, retrieval mechanism, prompting and fine-tuning.
Promising approach
The approach discussed in the AWS paper has several promising points, including addressing the challenge of specialized pipelines requiring specialized tests, according to data security firm Immuta’s AI expert Joe Regensburger.
“This is key since most pipelines will rely on commercial or open-source off-the-shelf LLMs. These models will not have been trained on domain-specific knowledge, so the conventional test sets will not be useful,” Regensburger explained.
However, Regensburger pointed out that though the approach is promising, it will still need to evolve on the exam generation piece as the greatest challenge is not generating a question or the appropriate answer, but rather generating sufficiently challenging distractor questions.
“Automated processes, in general, struggle to rival the level of human-generated questions, particularly in terms of distractor questions. As such, it’s the distractor generation process that could benefit from a more detailed discussion,” Regensburger said, comparing the automatically generated questions with human-generated questions set in the AP (advanced placement) exams.
Questions in the AP exams are set by experts in the field who keep on setting, reviewing, and iterating questions while setting up the examination, according to Regensburger.
Importantly, exam-based probes for LLMs already exist. “A portion of ChatGPT’s documentation measures the model’s performance against a battery of standardized tests,” Regensburger said, adding that the AWS paper extends OpenAI’s premise by suggesting that an exam could be generated against specialized, often private knowledge bases.
“In theory, this will assess how a RAG pipeline could generalize to new and specialized knowledge.”
At the same time, Omdia’s Shimmin pointed out that several vendors, including AWS, Microsoft, IBM, and Salesforce already offer tools or frameworks focused on optimizing and enhancing RAG implementations ranging from basic automation tools like LlamaIndex to advanced tools like Microsoft’s newly launched GraphRAG.
Optimized RAG vs very large language models
Choosing the right retrieval algorithms often leads to bigger performance gains than simply using a larger LLM, wherein the latter approach might be costly, AWS researchers pointed out in the paper.
While recent advancements like “context caching” with Google Gemini Flash makes it easy for enterprises to sidestep the need to build complex and finicky tokenization, chunking, and retrieval processes as a part of the RAG pipeline, this approach can exact a high cost in inferencing compute resources to avoid latency, Omdia’s Shimmin said.
“Techniques like Item Response Theory from AWS promises to help with one of the more tricky aspects of RAG, measuring the effectiveness of the information retrieved before sending it to the model,” Shimmin said, adding that with such optimizations at the ready, enterprises can better optimize their inferencing overhead by sending the best information to a model rather than throwing everything at the model at once.
On the other hand, model size is only one factor influencing the performance of foundation models, Forrester’s Dai said.
“Enterprises should take a systematic approach for foundation model evaluation, spanning technical capabilities (model modality, model performance, model alignment, and model adaptation), business capabilities (open source support, cost-effectiveness, and local availability), and ecosystem capabilities (prompt engineering, RAG support, agent support, plugins and APIs, and ModelOps),” Dai explained.
Copyright © 2024 IDG Communications, Inc.