Join our daily and weekly newsletter for the latest updates and exclusive content on industry-leading AI coverage. learn more
a New research from Google Researchers introduce “sufficient context.” This is a new perspective to understand and improve the Search Extended Generation (RAG) system for large-scale language models (LLMS).
This approach allows you to determine whether LLM has enough information to answer the query accurately. This is a key factor for developers building real-world enterprise applications where reliability and de facto accuracy are the most important.
The enduring challenge of rags
RAG systems are the foundation for building more fact-based and verifiable AI applications. However, these systems can exhibit undesirable properties. Even when the evidence that was searched is presented, you may still be confidently provided with incorrect answers, distracted by unrelated information in the context, or not properly extract answers from long text snippets.
In the paper, the researcher stated: “The ideal result is to output the correct answer if LLM, when combined with parametric knowledge of the model, contains enough information to answer the question in the provided context. Otherwise, the model should refrain from seeking more detailed information.”
To achieve this ideal scenario, we need a constructed model that can help the provided context to answer the question correctly and can be used selectively. Previous attempts to address this have looked at how LLM behaves with varying degrees of information. However, Google Paper states, “While the goal seems to be to understand how LLM behaves when it runs, it doesn’t seem to have enough information to answer the query, but previous work can’t address this front.”
Enough context
To address this, researchers introduce the concept of “sufficient context.” At a high level, input instances are categorized based on whether the provided context contains enough information to answer the query. This divides the context into two cases.
Enough context:Context has all the information needed to provide a definitive answer.
Insufficient context: The context does not have the necessary information. This could be because the query requires specialized knowledge that does not exist in the context, or because the information is incomplete, inconclusive, or inconsistent.
This specification is determined by examining the question and related context without the need for a basic truth answer. This is essential for real-world applications where the true answer of the basis is not readily available during inference.
Researchers developed an LLM-based “car” to automate instance labeling with sufficient or insufficient context. They found that Google’s Gemini 1.5 Pro model in a single example (one shot) performs optimally to classify context sufficiency, achieving high F1 scores and accuracy.
The paper states, “In a real scenario, we cannot expect candidate responses when evaluating model performance. Therefore, it is desirable to use methods that work using only queries and context.”
Important findings on LLM behavior of rags
Analyzing the various models and datasets through this lens of ample context revealed some important insights.
As expected, models generally achieve higher accuracy when context is sufficient. However, even with sufficient context, the model tends to hallucinate more frequently than abstaining. Inadequate context, the situation becomes more complicated, with the model showing both a higher abstention rate and an increase in hallucinations in some models.
Interestingly, RAG generally improves overall performance, but additional context can also reduce the ability of the model to refrain from responding in the absence of sufficient information. “This phenomenon can arise from an increased model’s confidence in the existence of contextual information, with a higher tendency to hallucinate than abstain,” the researchers suggest.
A particularly strange observation was the ability of the model to sometimes provide the correct answer, even when the provided context was deemed inadequate. The natural assumption is that the model already “knows” the answer from pre-training (parametric knowledge), but researchers have discovered other contributing factors. For example, contexts may be useful for abusing gaps in model knowledge queries or bridges, even if they do not contain the complete answer. This ability for models to succeed even when external information is limited, has broad implications for RAG system design.
Cyrus Rashtchian, a co-author of Google’s research and senior research scientist at Google, explains this in detail, highlighting the importance of basic LLM quality. “For a very good enterprise lag system, the model should be evaluated on benchmarks with or without search,” he told VentureBeat. He suggested that searches should be viewed as “an enhancement of that knowledge” rather than the only source of truth. The basic model “needs to properly infer about the searched context using context cues (informed by pre-training knowledge). For example, the model should know well enough whether the question is visually impaired, rather than blindly copying it from the context.”
Reduce hallucinations of the RAG system
Researchers have investigated techniques to mitigate this, especially given the finding that the model is not rags compared to rag settings and that the model may hallucinate rather than abstaining.
They have developed a new “selective generation” framework. This method uses a smaller, separate “intervention model” to determine whether the main LLM generates or abstains an answer, providing a controllable trade-off between accuracy and coverage (percentage of answered questions).
This framework can be combined with any LLM, including proprietary models such as Gemini and GPT. This study found that using sufficient context as an additional signal in this framework significantly increases the accuracy of answer queries across different models and datasets. This method improved the percentage of correct answers between model responses for Gemini, GPT, and Gemma models by 2-10%.
To put this 2-10% improvement into a business perspective, Rashtchian offers a concrete example of customer support AI. “I could imagine a customer asking if they could get a discount,” he said. “In some cases, the searched context is recent and specifically explains ongoing promotions, so the model can answer with confidence. However, in other cases, it is best to explain discounts from a few months ago, explain certain terms and conditions, or say the model is “not sure.”
The team also looked into fine-tuning models to promote abstain. This included a training model for the model for an example that was replaced by “I Don’t Know” instead of the original ground truth, especially when the context was insufficient. Our intuition was that explicit training on such examples could lead the model to abstain from hallucinations rather than hallucinations.
The results were mixed together. Fine-tuned models often had a higher percentage of correct answers, but still hallucinated frequently and often abstained. The paper concludes that tweaks may be helpful, but “more work is needed to develop a reliable strategy that balances these objectives.”
Apply sufficient context to real-world RAG systems
For enterprise teams looking to apply these insights to their own RAG systems, such as those that enhance their internal knowledge base and customer support AI, Rashtchian outlines a practical approach. He proposes first collecting a dataset of query context pairs that represent the types of examples in production. Next, use LLM-based vehicles to label each example as having sufficient or insufficient context.
“This gives a good estimate of already sufficient proportion of context,” Rashtchian said. “If it’s under 80-90%, there could be plenty of room to improve the search or knowledge-based aspects of things. This is a proper observable symptom.”
Rashtchian advises teams to “strat the model responses based on examples with sufficient and insufficient context.” By examining metrics across these two separate datasets, teams can better understand the nuances of performance.
“For example, we found that models are more likely to provide false responses (with regard to ground truth) if the context is insufficient. This is another observable symptom,” he pointed out, adding, “aggregating statistics across the entire dataset could be glossing across a small set of important but not addressed small queries.”
While LLM-based cars show high accuracy, enterprise teams may wonder about the additional computational costs. Rashtchian has made it clear that overhead can be managed for diagnostic purposes.
“It’s important to run an LLM-based car with a small test set (for example, the 500-1000 example) and since this can be done ‘offline’, you don’t have to worry about the amount of time it takes,” he said. For real-time applications, he admits that “it’s better to use heuristics, or at least a smaller model.” According to Rashtchian, important takeaways are “engineers should be looking at things that are beyond similarity scores and more from the search components. Having additional signals from LLM or heuristics can lead to new insights.”