RAG (Retrieval Augmented Generator) Part 2

Click here to read part 1 of the post.

Topics

5. Challenges with RAG
6. What can we do: Improving Data Quality
7. What can we do: Agents

5. Challenges with RAG

Here are some challenges associated with implementing RAG (Retrieval-Augmented Generation). A basic RAG system tends to perform well when answering questions about specific paragraphs in a dataset, as modern embedding search can effectively match the semantics of the given text. This allows the system to retrieve relevant information and the LLM to synthesize a coherent answer based on the context. However, the challenges become more pronounced when dealing with a broader range of questions and larger datasets. For instance, simple questions over complex data can be problematic if the data includes various formats like images, tables, charts, and figures, making accurate information synthesis challenging and prone to hallucinations. Additionally, scaling retrieval quality becomes a significant issue as the dataset size increases, from a few PDFs to hundreds, thousands, or even millions of diverse documents. Furthermore, answering complex, multi-part questions requires a system capable of iterating through the data step-by-step to provide a comprehensive solution, which can be difficult to achieve.

TL;DR

Failure Modes:
– Simple Questions over Complex Data – Simple Questions over Multiple Documents – Complex Questions
Goal:
The top priority goal should be figuring out how to get high-response quality from the set of representative questions you want to ask.

6. What can we do: Improving Data Quality

The data processing layer in a RAG (Retrieval-Augmented Generation) system involves several key steps, starting with parsing. Parsing entails loading data from various file types, such as PowerPoint presentations or PDFs, and extracting relevant information. The quality of parsing is crucial; poor parsing can lead to the “garbage in, garbage out” problem, where poorly formatted text and tables confuse even the most advanced language models. The parsing step is followed by transforming the data, often through a process called chunking, which is particularly important when dealing with unstructured data in RAG systems.

Chunking involves breaking down large datasets into manageable pieces. The goal is to preserve semantically similar content, ensuring that related information is kept together. Techniques for chunking can range from basic text splitting to more advanced methods like semantic chunking, where chunks are defined based on embedding similarity with the previous sentence. A commonly recommended baseline approach is page-level chunking, where each page of a document is treated as a chunk. This approach is effective and straightforward, especially when starting to build a RAG pipeline, as it leverages the natural division of content into digestible pieces. Including entire pages in prompts, rather than optimizing for smaller chunk sizes like 512 or 1024 tokens, often yields better results.

After chunking, the next step is indexing the data. This can involve more than just using pure vectors; for instance, metadata can be used alongside vectors, or a knowledge graph can be implemented. Data can also be stored in document stores, enabling different tiers of retrieval based on the data’s structure and content. General principles for indexing include embedding not just raw text but also related references like summaries or captions. This comprehensive approach ensures more thorough retrieval of information, decoupling indexing from synthesis.

The ultimate goal of these processes is to develop a new type of ETL stack. Unlike traditional data engineering, where significant human effort is required to transform raw data into a usable format, the RAG context allows for a more standardized and less labor-intensive process. An efficient data pipeline can handle large volumes of unstructured data with minimal manual intervention, provided the pipeline parameters are well-tuned to maintain high performance across the dataset.

Looking ahead, there’s an emerging discussion about the impact of multimodality in RAG systems. With the advent of models like GPT-4 and others that support multimodal input, including both text and images, there’s a growing need to consider how to handle chunks that include non-text elements. Currently, image-based chunking introduces significant cost and latency, but this could change as models and technologies improve.

TL;DR

Parsing: 

  • Bad parsers are key cause of garbage in => garbage out
  • Badly formatted text/tables confuse even the best LLMs

Chunking:

refers to breaking down a large document or dataset into smaller, more manageable pieces. This process is particularly useful when dealing with lengthy texts or complex data that need to be processed and queried efficiently.

  • Try to preserve semantically similar content
  • Strong baseline: page-level chunking
  • Some of the chunking strategies include: fixed chunking, recursive chunking, document specific chunking, semantic chunking, agentic chunking.

Indexing:

involves creating a data structure that allows for efficient retrieval of vectors based on similarity searches

Raw text oftentimes confuse the embedding model.

Don’t just embed the raw text, embed references (ex. captions, summary, link to underlying data)

Having multiple embedding point to the same chunk is a good practice.

 

7. What can we do: Agents

Enhancing Retrieval-Augmented Generation (RAG) systems to answer complex questions is an area of growing interest, particularly with the introduction of “agents.” In this context, an agent refers to an autonomous system designed to interact with users, other systems, or the environment based on a set of rules. These systems enable large language models (LLMs) to perform specific tasks more efficiently, make decisions, or facilitate interactions using natural language processing. The question arises: how can we evolve the standard RAG pipeline to incorporate an agentic workflow for better query understanding?

A traditional RAG pipeline has several notable limitations. One of these is its single-shot usage, where the LLM processes a query only once without iterative refinement. Another limitation is the lack of query understanding or planning, meaning the system simply passes the question to a vector database for retrieval without analyzing the best approach. For example, if a question requires scanning entire texts or focusing on specific details within a document, the current system cannot make that distinction.

Additionally, traditional RAG systems lack the ability to use external tools or APIs dynamically; they are fixed to use a vector database for retrieval. There’s also no mechanism for self-correction or reflection, as these systems do not have a process for refining steps over time. Furthermore, the absence of memory or state means the system cannot retain past interactions, which is crucial for maintaining context in conversation settings.

These limitations make the traditional RAG pipeline less flexible and adaptive, reducing its effectiveness in handling complex or dynamic queries. In contrast, a fully agentic system would possess several advanced features. Such a system could support multi-turn interactions, handling ongoing conversations or tasks requiring multiple exchanges. It would also include a query or task planning layer, organizing and planning the workflow based on the query’s nature.

Moreover, agentic systems can interface with various external tools, treating the vector database as just one of many resources. For example, they could interact with data warehouses, CRM systems like Salesforce, or services such as calculators and code interpreters. A reflection layer could validate responses in real-time, allowing the system to make adjustments if the output is not as expected. Additionally, a memory layer would enable the system to remember previous interactions, providing contextually relevant responses and personalization.

Agents are a broad and complex topic, with numerous resources available for those interested in building multi-agent systems. This subject could easily be the focus of several in-depth workshops. For our purposes, we will concentrate on applying agent systems to knowledge synthesis and question answering, exploring how these advanced capabilities can enhance interactions with data.

TL;DR

Traditional RAG has limitations as below

  1. Single-Shot Usage: It typically uses the large language model  only once per query, without iterative refinement.
  2. Lack of Query Understanding or Planning: There is no mechanism for understanding or planning the query; the question is simply passed directly to a vector database for retrieval. To elaborate if a question is given, the model should be able to determine whether it makes sense to look through all texts of documents or look for specific details with in a document.
  3. No Tool Use: The system does not adapt its use of external tools or APIs. The LLM does not dynamically determine how to call an API; it is fixed to use a vector database directly for retrieval.
  4. No Self-Correction or Reflection: There is no process for self-correction or refining steps over time, lacking a layer for reflection or error correction.
  5. No Memory or State: The system does not store any state, meaning it cannot remember past interactions, which can be a limitation in conversation settings where context from previous exchanges is important.

A full agentic system can be quite complex, depending on how it’s defined. Such systems often have several key properties:

  1. Multi-Turn Interactions: They are capable of handling ongoing conversations or tasks that require multiple exchanges.
  2. Query or Task Planning Layer: There is usually a component that plans and organizes the tasks or queries.
  3. Tool Interface: These systems can interact with external tools, treating the vector database as just one of many tools, such as data warehouses, CRM systems like Salesforce, or other services like calculator functions and code interpreters.
  4. Reflection Layer: This involves validating responses as they are generated, making real-time decisions if the output is not as expected.
  5. Memory Layer: For personalization, the system can remember previous interactions to provide contextually relevant responses.

I will go into more depth speaking about the agents in part 3.

Leave a Comment

Your email address will not be published. Required fields are marked *