Guided Project 2: Optimizing a RAG Application with LangCache

6. Guided Project 2: Optimizing a RAG Application with LangCache

Retrieval-Augmented Generation (RAG) systems combine the power of LLMs with external knowledge bases to provide more accurate, up-to-date, and grounded responses. However, RAG workflows can be expensive and slow due to multiple LLM calls (for re-ranking, summarization, or final generation) and database lookups.

In this project, you’ll enhance a basic RAG workflow by integrating Redis LangCache at key stages to reduce LLM costs and latency.

Project Objective

To build a simplified RAG system that answers questions based on a small corpus of documents. LangCache will be used to:

Cache the final generated answers from the LLM.
(Optional, Advanced) Cache the results of the retrieval phase (i.e., relevant document chunks).

Simplified RAG Workflow

Our RAG system will follow these steps:

User Query: A user asks a question.
LangCache Check (Final Answer): First, check LangCache for a previously generated full answer to a similar query. If found, return it.
Document Retrieval (if cache miss): If no cached answer, search a local “document store” (a simple array of text chunks for this project) to find relevant pieces of information.
Context Construction: Combine the user query with the retrieved document chunks to create a comprehensive prompt for the LLM.
LLM Call: Send the contextualized prompt to a mock LLM.
Store in LangCache: Store the user query and the LLM’s generated answer in LangCache.
Return Answer: Present the LLM’s answer to the user.

Prerequisites

Completed “Setting Up Your Development Environment” (Chapter 1).
Understanding of all previous chapters, especially “Advanced LangCache Features” (Chapter 4).
Familiarity with RAG concepts (even if simplified for this project).

Project Structure

Create a new directory: learn-redis-langcache/projects/rag-optimizer.

`rag-optimizer/index.js` (for Node.js) or `rag-optimizer/rag_app.py` (for Python)

`rag-optimizer/documents.js` or `rag-optimizer/documents.py` (our “document store”)

`rag-optimizer/mock_llm.js` or `rag-optimizer/mock_llm.py`

`.env` file in the root `learn-redis-langcache` directory.

Step-by-Step Instructions

Step 1: Prepare Document Store and Mock LLM

We’ll define our knowledge base as a simple array of strings and reuse our mock LLM from Project 1.

Node.js (projects/rag-optimizer/documents.js)

// projects/rag-optimizer/documents.js
const documents = [
    "The capital of France is Paris. Paris is known for the Eiffel Tower and the Louvre Museum.",
    "The official currency of the European Union is the Euro. Many EU member states use the Euro.",
    "Python is a high-level, interpreted programming language known for its readability and versatility. It's widely used in web development, data science, and AI.",
    "Node.js is a JavaScript runtime built on Chrome's V8 JavaScript engine. It allows developers to run JavaScript on the server side, enabling full-stack JavaScript applications.",
    "Machine learning is a subset of artificial intelligence that involves training algorithms to learn from data and make predictions or decisions.",
    "The Amazon Rainforest is the largest rainforest in the world, covering vast areas of South America. It is vital for global climate regulation and biodiversity.",
    "Renewable energy sources include solar, wind, hydro, geothermal, and biomass. They are crucial for reducing carbon emissions."
];

function retrieveRelevantChunks(query, numChunks = 2) {
    // A very simple keyword-based retrieval for demonstration.
    // In a real RAG system, this would involve vector search,
    // lexical search, or a more sophisticated indexing mechanism.
    const relevant = [];
    const lowerQuery = query.toLowerCase();
    for (const doc of documents) {
        if (lowerQuery.split(' ').some(word => doc.toLowerCase().includes(word))) {
            relevant.push(doc);
            if (relevant.length >= numChunks) break;
        }
    }
    return relevant;
}

module.exports = { documents, retrieveRelevantChunks };

Node.js (projects/rag-optimizer/mock_llm.js) (Same as Project 1, just copy it)

// projects/rag-optimizer/mock_llm.js (Copy from chatbot-project/mock_llm.js)
async function mockLlmResponse(prompt) {
    await new Promise(resolve => setTimeout(resolve, 2000)); // Simulate longer LLM delay for RAG

    if (prompt.includes("capital of France")) {
        return "The capital of France is Paris. It is a major European city and a global center for art, fashion, gastronomy, and culture.";
    } else if (prompt.includes("Python programming")) {
        return "Python is a versatile programming language widely used for web development, data analysis, artificial intelligence, and scientific computing due to its clear syntax and extensive libraries.";
    } else if (prompt.includes("Node.js")) {
        return "Node.js is a powerful JavaScript runtime environment that allows you to execute JavaScript code outside of a web browser, commonly used for building scalable network applications.";
    } else if (prompt.includes("rainforest") && prompt.includes("importance")) {
        return "The Amazon Rainforest is incredibly important for global biodiversity and plays a critical role in regulating the Earth's climate by absorbing vast amounts of carbon dioxide and producing oxygen.";
    }
    else {
        return `As an LLM, I generated this based on the context: "${prompt.substring(0, 100)}..."`;
    }
}

module.exports = { mockLlmResponse };

Python (projects/rag-optimizer/documents.py)

# projects/rag-optimizer/documents.py
documents = [
    "The capital of France is Paris. Paris is known for the Eiffel Tower and the Louvre Museum.",
    "The official currency of the European Union is the Euro. Many EU member states use the Euro.",
    "Python is a high-level, interpreted programming language known for its readability and versatility. It's widely used in web development, data science, and AI.",
    "Node.js is a JavaScript runtime built on Chrome's V8 JavaScript engine. It allows developers to run JavaScript on the server side, enabling full-stack JavaScript applications.",
    "Machine learning is a subset of artificial intelligence that involves training algorithms to learn from data and make predictions or decisions.",
    "The Amazon Rainforest is the largest rainforest in the world, covering vast areas of South America. It is vital for global climate regulation and biodiversity.",
    "Renewable energy sources include solar, wind, hydro, geothermal, and biomass. They are crucial for reducing carbon emissions."
]

def retrieve_relevant_chunks(query: str, num_chunks: int = 2) -> list[str]:
    """
    A very simple keyword-based retrieval for demonstration.
    In a real RAG system, this would involve vector search,
    lexical search, or a more sophisticated indexing mechanism.
    """
    relevant = []
    lower_query = query.lower()
    query_words = lower_query.split(' ')

    for doc in documents:
        # Check if any query word is in the document (simple matching)
        if any(word in doc.lower() for word in query_words if len(word) > 2): # Ignore very short words
            relevant.append(doc)
            if len(relevant) >= num_chunks:
                break
    return relevant

Python (projects/rag-optimizer/mock_llm.py) (Same as Project 1, just copy it)

# projects/rag-optimizer/mock_llm.py (Copy from chatbot-project/mock_llm.py)
import asyncio

async def mock_llm_response(prompt: str) -> str:
    """Simulates an LLM API call with a longer delay and predefined responses for RAG context."""
    await asyncio.sleep(2) # Simulate longer network delay for RAG LLM

    # Simplified responses based on typical RAG outputs
    if "capital of France" in prompt:
        return "The capital of France is Paris. It is renowned for landmarks like the Eiffel Tower and the Louvre Museum, and is a major hub for art, fashion, and culture."
    elif "Python programming" in prompt:
        return "Python is a versatile and widely-used programming language, favored for its clear syntax and extensive libraries, making it popular in web development, data science, and AI applications."
    elif "Node.js" in prompt:
        return "Node.js is a JavaScript runtime built on Chrome's V8 engine, enabling server-side execution of JavaScript. It is commonly used for building scalable network applications and real-time services."
    elif "Amazon Rainforest" in prompt and ("importance" in prompt or "vital" in prompt):
        return "The Amazon Rainforest is globally significant as the largest tropical rainforest. It plays a crucial role in regulating climate, hosting immense biodiversity, and influencing global weather patterns."
    elif "renewable energy" in prompt:
        return "Renewable energy sources, such as solar, wind, and hydropower, are sustainable alternatives to fossil fuels. They are essential for reducing greenhouse gas emissions and combating climate change."
    else:
        # Generic RAG-like response if no specific match
        return f"Based on the provided context, I can tell you: {prompt[:100]}..."

Step 2: Implement the RAG Application Logic with LangCache

This is the core of our RAG system. It will integrate retrieval with the LLM call and, most importantly, with LangCache.

Node.js (projects/rag-optimizer/index.js)

// projects/rag-optimizer/index.js
require('dotenv').config({ path: '../../.env' });

const { LangCache } = require('@redis-ai/langcache');
const readline = require('readline');
const { mockLlmResponse } = require('./mock_llm');
const { retrieveRelevantChunks } = require('./documents');

// Retrieve LangCache credentials
const LANGCACHE_API_HOST = process.env.LANGCACHE_API_HOST;
const LANGCACHE_CACHE_ID = process.env.LANGCACHE_CACHE_ID;
const LANGCACHE_API_KEY = process.env.LANGCACHE_API_KEY;

// Initialize LangCache client
const langCache = new LangCache({
    serverURL: `https://${LANGCACHE_API_HOST}`,
    cacheId: LANGCACHE_CACHE_ID,
    apiKey: LANGCACHE_API_KEY,
});

console.log("RAG Optimizer Chatbot initialized. Type 'exit' to quit.");

const rl = readline.createInterface({
    input: process.stdin,
    output: process.stdout
});

async function runRagWorkflow() {
    rl.question('You: ', async (query) => {
        if (query.toLowerCase() === 'exit') {
            console.log('Goodbye!');
            rl.close();
            return;
        }

        let finalAnswer = '';
        let source = '';

        try {
            // --- Phase 1: Check LangCache for a complete answer ---
            console.log('\n--- Checking LangCache for final answer ---');
            const cachedResults = await langCache.search({
                prompt: query,
                similarityThreshold: 0.85 // Higher threshold for final answer
            });

            if (cachedResults && cachedResults.results.length > 0) {
                finalAnswer = cachedResults.results[0].response;
                source = 'LangCache (Final Answer)';
                console.log(`Cache Hit! (Score: ${cachedResults.results[0].score.toFixed(4)})`);
                console.log(`Bot (from ${source}): ${finalAnswer}`);
            } else {
                console.log('Cache Miss for final answer. Proceeding to Retrieval and LLM.');
                // --- Phase 2: Document Retrieval (if cache miss) ---
                console.log('--- Retrieving relevant documents ---');
                const relevantChunks = retrieveRelevantChunks(query);
                console.log(`Retrieved ${relevantChunks.length} relevant chunks.`);

                let context = relevantChunks.join('\n\n');
                if (context) {
                    context = `Context:\n${context}\n\n`;
                } else {
                    context = "No specific context found. Relying on general knowledge.\n\n";
                    console.warn("Warning: No relevant documents found for the query.");
                }

                const llmPrompt = `${context}Based on the context, answer the following question concisely: ${query}`;
                console.log(`\n--- Calling Mock LLM ---`);
                console.log(`LLM Prompt Preview: "${llmPrompt.substring(0, 100)}..."`);

                // --- Phase 3: LLM Call ---
                finalAnswer = await mockLlmResponse(llmPrompt);
                source = 'Mock LLM';
                console.log(`Bot (from ${source}): ${finalAnswer}`);

                // --- Phase 4: Store in LangCache ---
                console.log('\n--- Storing LLM response in LangCache ---');
                await langCache.set({ prompt: query, response: finalAnswer });
                console.log('Final answer stored in LangCache.');
            }
        } catch (error) {
            console.error('Error during RAG workflow:', error.message);
            // Fallback to a basic LLM call if anything goes wrong with cache or retrieval
            finalAnswer = await mockLlmResponse(query);
            source = 'Mock LLM (fallback)';
            console.log(`Bot (from ${source}): ${finalAnswer}`);
        }

        runRagWorkflow(); // Continue the chat
    });
}

runRagWorkflow();

Python (projects/rag-optimizer/rag_app.py)

# projects/rag-optimizer/rag_app.py
import os
import asyncio
import sys
from dotenv import load_dotenv
from langcache import LangCache
from mock_llm import mock_llm_response
from documents import retrieve_relevant_chunks

# Load environment variables from the parent .env file
load_dotenv(dotenv_path='../../.env')

# Retrieve LangCache credentials
LANGCACHE_API_HOST = os.getenv("LANGCACHE_API_HOST")
LANGCACHE_CACHE_ID = os.getenv("LANGCACHE_CACHE_ID")
LANGCACHE_API_KEY = os.getenv("LANGCACHE_API_KEY")

# Initialize LangCache client
lang_cache = LangCache(
    server_url=f"https://{LANGCACHE_API_HOST}",
    cache_id=LANGCACHE_CACHE_ID,
    api_key=LANGCACHE_API_KEY
)

print("RAG Optimizer Chatbot initialized. Type 'exit' to quit.")

async def run_rag_workflow():
    while True:
        try:
            query = await asyncio.to_thread(input, 'You: ')
        except EOFError:
            print('Goodbye!')
            break

        if query.lower() == 'exit':
            print('Goodbye!')
            break

        final_answer = ''
        source = ''

        try:
            # --- Phase 1: Check LangCache for a complete answer ---
            print('\n--- Checking LangCache for final answer ---')
            cached_results = await lang_cache.search(
                prompt=query,
                similarity_threshold=0.85 # Higher threshold for final answer
            )

            if cached_results:
                final_answer = cached_results[0].response
                source = 'LangCache (Final Answer)'
                print(f"Cache Hit! (Score: {cached_results[0].score:.4f})")
                print(f"Bot (from {source}): {final_answer}")
            else:
                print('Cache Miss for final answer. Proceeding to Retrieval and LLM.')
                # --- Phase 2: Document Retrieval (if cache miss) ---
                print('--- Retrieving relevant documents ---')
                relevant_chunks = retrieve_relevant_chunks(query)
                print(f"Retrieved {len(relevant_chunks)} relevant chunks.")

                context = '\n\n'.join(relevant_chunks)
                if context:
                    context = f"Context:\n{context}\n\n"
                else:
                    context = "No specific context found. Relying on general knowledge.\n\n"
                    print("Warning: No relevant documents found for the query.")

                llm_prompt = f"{context}Based on the context, answer the following question concisely: {query}"
                print(f"\n--- Calling Mock LLM ---")
                print(f"LLM Prompt Preview: \"{llm_prompt[:100]}...\"")

                # --- Phase 3: LLM Call ---
                final_answer = await mock_llm_response(llm_prompt)
                source = 'Mock LLM'
                print(f"Bot (from {source}): {final_answer}")

                # --- Phase 4: Store in LangCache ---
                print('\n--- Storing LLM response in LangCache ---')
                await lang_cache.set(prompt=query, response=final_answer)
                print('Final answer stored in LangCache.')
        except Exception as e:
            print(f"Error during RAG workflow: {e}")
            # Fallback to a basic LLM call if anything goes wrong with cache or retrieval
            final_answer = await mock_llm_response(query)
            source = 'Mock LLM (fallback)'
            print(f"Bot (from {source}): {final_answer}")

if __name__ == "__main__":
    asyncio.run(run_rag_workflow())

Step 3: Run and Test the RAG Optimizer Chatbot

Node.js:

Navigate to learn-redis-langcache/projects/rag-optimizer.
Run node index.js.

Python:

Navigate to learn-redis-langcache/projects/rag-optimizer.
Run python rag_app.py.

Testing Scenario:

Query 1 (Cache Miss, then LLM + Cache Store):
- You: What is the capital of France and what is it known for?
- Observe: Cache Miss..., Retrieving relevant documents..., Calling Mock LLM..., then Bot (from Mock LLM): ... and Final answer stored in LangCache. (This will take ~2 seconds due to mock LLM delay).
Query 2 (Cache Hit):
- You: Tell me about Paris, the French capital.
- Observe: Checking LangCache for final answer..., Cache Hit! (Score: X.XXX), Bot (from LangCache (Final Answer)): ... (This response should be very fast).
Query 3 (New Query, Cache Miss, then LLM + Cache Store):
- You: How does Python work and what is it used for?
- Observe: Cache Miss..., Retrieving relevant documents..., Calling Mock LLM..., then Bot (from Mock LLM): ... and Final answer stored in LangCache.
Query 4 (Cache Hit):
- You: Explain Python programming.
- Observe: Checking LangCache for final answer..., Cache Hit! (Score: X.XXX), Bot (from LangCache (Final Answer)): ...

Notice how the retrieveRelevantChunks function in our mock setup will try to find documents containing keywords from your query. The mock LLM then uses this (or the lack of it) to formulate a response. The key takeaway is the dramatic speed difference and cost saving when LangCache delivers a hit!

Step 4: Advanced Challenges and Enhancements

Challenge Yourself:

Cache Retrieval Results (Advanced Caching Strategy):
- Modify the workflow to also cache the retrieved document chunks based on the user’s initial query.
- When a query comes in:
  - First, check LangCache for the final answer (as implemented).
  - If no final answer, then check LangCache for cached retrieval results (i.e., the relevantChunks themselves) using the user query as the prompt and a specific attribute like type: "retrieval".
  - If retrieval results are cached, use them directly.
  - If not cached, perform the retrieveRelevantChunks step, then cache these chunks (e.g., as a JSON string or an array of strings) in LangCache with type: "retrieval" before constructing the LLM prompt.
  - This creates a two-tier caching system.
Dynamic Thresholds:
- Implement logic to dynamically adjust the similarity_threshold based on user feedback (e.g., implicit feedback like “was this helpful?”).
Metrics and Monitoring:
- Keep a simple counter for cache hits and misses within your application and print them at the end or periodically. This simulates real-world monitoring.
Error Handling for LLM:
- Extend the mock_llm_response to sometimes “fail” or return an “unavailable” message to practice robust error handling for LLM API limits or downtime.

This project demonstrates the power of LangCache in optimizing complex AI workflows like RAG, where multiple interactions with costly resources can be significantly reduced by intelligent caching.