E2E Workflow of a Retrieval-Augmented Generation (RAG)

Mehmet Ozkaya
6 min readNov 20, 2024

--

We’re going to explore the end-to-end (E2E) workflow of Retrieval-Augmented Generation (RAG). In our previous discussions, we’ve delved into the architecture and individual components of RAG.

E2E Workflow of a Retrieval-Augmented Generation (RAG)

Get Udemy Course with limited discounted coupon — Generative AI Architectures with LLM, Prompt, RAG, Fine-Tuning and Vector DB

Now, we’ll take a comprehensive look at how the entire process unfolds — from the moment a user submits a query to the generation of a tailored, accurate response.

What is the RAG Workflow?

The RAG workflow is a systematic process that combines retrieval mechanisms with generative models to produce contextually accurate and relevant responses. Here’s an overview of the steps involved:

  1. User Query: The process begins when a user submits a question or request.
  2. Query Embeddings: The system converts the user’s query into a vector representation using an embeddings model.
  3. Retrieval: Relevant documents or data are retrieved by comparing the query vector with document vectors in the knowledge base.
  4. Ranking and Filtering: The retrieved documents are ranked based on relevance and filtered to ensure quality.
  5. Context Query Prompt: The system combines the retrieved data with the original query to create a context-rich prompt.
  6. Generation: A Large Language Model (LLM) generates a response using the context query prompt.
  7. Output: The generated response is delivered to the user.

Let’s walk through each of these stages in detail.

Step 1: User Query

The workflow starts when the user submits a question or request.

Examples of User Queries:

  • “How do I reset my Smart TV?”
  • “What are the benefits of using cloud computing?”
  • “How can I improve customer satisfaction in retail?”

At this point, the system receives the query and prepares it for processing. The initial step involves understanding the user’s intent and preparing the query for embedding.

Step 2: Query Embeddings

The system converts the user’s query into a vector representation.

  • Embedding the Query: The query is passed through an Embeddings Model that transforms the textual input into a numerical vector capturing its semantic meaning.
  • Semantic Understanding: This vectorization allows the system to comprehend the concepts behind the query, not just the literal words.
  • Ready for Retrieval: The vectorized query is now set to be compared against other vectors in the system’s knowledge base.

By embedding the query, the system ensures it can perform a meaningful comparison with stored data, focusing on the intent and context rather than exact keyword matches.

Step 3: Retrieval

The system retrieves relevant documents by performing a vector search.

  • Vector Search: The system compares the query vector with vectors representing documents or data points in its knowledge base.
  • Semantic Similarity: It searches for documents that are semantically similar to the query, identifying content that relates closely to the user’s intent.
  • Retrieving Relevant Data: A set of documents or data most likely to contain the required information is retrieved.

Example:

If the user’s query is about resetting a Smart TV, the system might retrieve:

  • The Smart TV’s user manual.
  • Troubleshooting guides.
  • Relevant FAQ sections.

This step ensures the system isn’t just looking for exact word matches but is understanding and retrieving information relevant to the user’s actual needs.

Step 4: Ranking and Filtering

To ensure accuracy and relevance, the system ranks and filters the retrieved documents.

  • Ranking for Relevance: Documents are ranked based on how closely they match the user’s query.
  • Filtering: Irrelevant or low-quality data is filtered out.
  • Quality Assurance: This step enhances the likelihood that the generated response will be both accurate and helpful.

By ranking and filtering, the system prioritizes the most useful information, ensuring the final response is of high quality.

Step 5: Context Query Prompt

The system combines the retrieved data with the original query to create a context-rich prompt.

  • Enriching the Query: The original query is augmented with insights from the retrieved documents.
  • Enhanced Understanding: This enriched prompt provides the system with a deeper understanding of the user’s needs.
  • Preparation for Generation: The context-rich prompt sets the stage for the LLM to generate a precise and relevant response.

This step is crucial as it bridges the gap between the user’s request and the supporting information, allowing for a more nuanced and accurate response.

Step 6: Generation

The Large Language Model (LLM) generates a response using the context query prompt.

  • LLM Processing: The LLM synthesizes the enriched prompt, leveraging both its pre-trained general knowledge and the specific information retrieved.
  • Combining Knowledge: It merges general understanding with context-specific data.
  • Producing the Response: Generates a coherent, context-aware answer tailored to the user’s query.

Example:

For the Smart TV reset query, the LLM might generate:

“To reset your Smart TV, press and hold the power button on the remote control for 10 seconds. If that doesn’t work, navigate to Settings > System > Reset and follow the on-screen instructions.”

The LLM’s ability to blend general knowledge with specific, retrieved information results in a response that’s both accurate and user-friendly.

Step 7: Output

The generated response is delivered to the user.

  • Delivery: The user receives a response that directly addresses their query.
  • Accuracy and Relevance: The answer is both accurate and tailored to the user’s needs, thanks to the combination of retrieved data and the LLM’s capabilities.
  • User Satisfaction: This process enhances user experience by providing timely and precise information.

The final output reflects the effectiveness of the RAG workflow, demonstrating how each step contributes to delivering high-quality responses.

Practical Example: RAG in Action for Customer Support

Let’s see how RAG works in a real-world scenario.

AI-powered Ticket Detail Page with implementing RAG

User Query:

“How do I update the firmware on my AirPro Max Blender?”

RAG Process:

Ingestion:

  • The system has previously ingested product manuals, FAQs, and support documents related to the AirPro Max Blender.

Retrieval:

  • Searches the knowledge base and retrieves the firmware update instructions from the product manual.

Generation:

  • Uses the retrieved information to generate a response:
"To update the firmware on your AirPro Max Blender, please follow these steps:
Download the latest firmware from the KitchenMaster website.
Connect your blender to your computer using the USB cable provided.
Open the firmware update application and select 'Update Firmware.'
Follow the on-screen instructions to complete the process."

This response is accurate, up-to-date, and directly addresses the user’s question by utilizing specific information from the product documentation.

Conclusion: The RAG Workflow

The Retrieval-Augmented Generation workflow is a powerful approach that ensures AI systems provide accurate, contextually relevant responses by seamlessly integrating retrieval and generation processes.

Summary of Steps:

  1. User Query: Initiates the process with a question or request.
  2. Query Embeddings: Transforms the query into a vector for semantic understanding.
  3. Retrieval: Finds relevant information from the knowledge base.
  4. Ranking and Filtering: Prioritizes and refines the retrieved data.
  5. Context Query Prompt: Creates an enriched prompt combining query and data.
  6. Generation: LLM generates a tailored response.
  7. Output: Delivers the final answer to the user.

Get Udemy Course with limited discounted coupon — Generative AI Architectures with LLM, Prompt, RAG, Fine-Tuning and Vector DB

EShop Support App with AI-Powered LLM Capabilities

You’ll get hands-on experience designing a complete EShop Customer Support application, including LLM capabilities like Summarization, Q&A, Classification, Sentiment Analysis, Embedding Semantic Search, Code Generation by integrating LLM architectures into Enterprise applications.

--

--

Mehmet Ozkaya
Mehmet Ozkaya

Written by Mehmet Ozkaya

Software Architect | Udemy Instructor | AWS Community Builder | Cloud-Native and Serverless Event-driven Microservices https://github.com/mehmetozkaya

No responses yet