LLama Index Documentation Assistant Bot using RAG

6 min readJan 21, 2025

Navigating through extensive documentation to find precise information can often feel overwhelming. Enter the Documentation Assistant Bot an AI-powered solution that simplifies this process using modern web crawling, vector storage, and RAG (Retrieval-Augmented Generation) capabilities. This guide provides a comprehensive breakdown of how the bot works, with clear explanations for each component of its architecture.

You can checkout all the code here: Link

The bot comprises two key parts:
1. crawler.py: A web crawler that retrieves website data and converts it into a vector store for efficient search and retrieval.
2. main.py: The query-handling interface that uses the vector store to answer user questions accurately.

The Concept

The Documentation Assistant Bot is designed to:
- Crawl Websites: Extract and process content from websites, storing it in a structured format.
- Enable Intelligent Search: Use embeddings and vector databases to perform semantic searches on the extracted content.
- Deliver Contextual Answers: Leverage language models to retrieve and synthesize relevant information for user queries.

This end-to-end pipeline ensures that users get the right answers quickly, with detailed reasoning and references.

The crawler.py script handles data collection and preprocessing, transforming raw website content into a vectorized format.

Importing Required Libraries

import os
import sys
import asyncio
import requests
from xml.etree import ElementTree
from typing import List, Dict, Any
from dataclasses import dataclass
from datetime import datetime, timezone
from urllib.parse import urlparse
from dotenv import load_dotenv

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

Purpose: Import necessary libraries for web crawling, embedding generation, and vector storage.
Notable Libraries:
crawl4ai: Facilitates efficient web crawling.
langchain_huggingface`: Generates embeddings using Hugging Face models.
langchain.vectorstores: Manages the Chroma vector store for fast and accurate searches.

Setting Up the Environment and Vector Store

# Load environment variables
load_dotenv()

# Initialize embedding model
embedding_model = HuggingFaceEmbeddings(model_name="dunzhang/stella_en_1.5B_v5")

# Initialize Chroma vector store
VECTOR_STORE_DIR = "./vectorstore"
os.makedirs(VECTOR_STORE_DIR, exist_ok=True)
chroma_db = Chroma(persist_directory=VECTOR_STORE_DIR, embedding_function=embedding_model)

Environment Variables: Load sensitive information like API keys securely using `.env` files.
Embedding Model: Use the Hugging Face model `stella_en_1.5B_v5` to convert textual data into embeddings.
Vector Store: Chroma is initialized to store and retrieve these embeddings.

Defining the ProcessedChunk Data Structure

@dataclass
class ProcessedChunk:
    url: str
    chunk_number: int
    title: str
    summary: str
    content: str
    metadata: Dict[str, Any]
    embedding: List[float]

Purpose: A ProcessedChunk represents a single chunk of processed data.
Fields:
url: Source of the data.
chunk_number: Identifies the chunk within a document.
metadata: Stores additional information like crawl time and source path.

Chunking Text

def chunk_text(text: str, chunk_size: int = 3000, overlap: int = 200) -> List[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        last_period = chunk.rfind('. ')
        if last_period > chunk_size * 0.3:
            end = start + last_period + 1
        chunks.append(text[start:end].strip())
        start = end - overlap
    return chunks

Why Chunking? Breaking long documents into smaller chunks improves embedding performance and retrieval accuracy.
Overlap: Adds contextual continuity between adjacent chunks.

Generating Embeddings

async def get_embedding(text: str) -> List[float]:
    return embedding_model.embed_query(text)

Purpose: Converts a text string into a vector representation.
Use Case: These embeddings allow for semantic similarity searches.

Processing and Storing Data

async def process_chunk(chunk: str, chunk_number: int, url: str) -> ProcessedChunk:
    embedding = await get_embedding(chunk)
    metadata = {
        "source": "langchain_docs",
        "chunk_size": len(chunk),
        "crawled_at": datetime.now(timezone.utc).isoformat(),
        "url_path": urlparse(url).path
    }
    return ProcessedChunk(
        url=url,
        chunk_number=chunk_number,
        title=f"Chunk {chunk_number} from {url}",
        summary=chunk[:1000],
        content=chunk,
        metadata=metadata,
        embedding=embedding
    )
async def insert_chunk_to_chroma(chunk: ProcessedChunk):
    try:
        chroma_db.add_texts(
            texts=[chunk.content],
            metadatas=[chunk.metadata],
            ids=[f"{chunk.url}_chunk_{chunk.chunk_number}"]
        )
        print(f"Inserted chunk {chunk.chunk_number} for {chunk.url}")
    except Exception as e:
        print(f"Error inserting chunk into Chroma: {e}")

Processing Chunks: Each chunk is embedded and enriched with metadata.
Storing Chunks: Inserts the processed chunks into the Chroma vector store for later retrieval.

Crawling Websites

async def crawl_parallel(urls: List[str], max_concurrent: int = 150):
    browser_config = BrowserConfig(
        headless=True,
        verbose=False,
        extra_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
    )
    crawl_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)

    crawler = AsyncWebCrawler(config=browser_config)
    await crawler.start()
    try:
        semaphore = asyncio.Semaphore(max_concurrent)
        async def process_url(url: str):
            async with semaphore:
                result = await crawler.arun(url=url, config=crawl_config, session_id="session1")
                if result.success:
                    print(f"Successfully crawled: {url}")
                    await process_and_store_document(url, result.markdown_v2.raw_markdown)
                else:
                    print(f"Failed to crawl {url}: {result.error_message}")
        await asyncio.gather(*[process_url(url) for url in urls])
    finally:
        await crawler.close()

Parallel Crawling: Processes multiple URLs concurrently for efficiency.
Error Handling: Logs failed URLs for debugging.

Main Function

async def main():
    urls = get_langchain_docs_urls()
    if not urls:
        print("No URLs found to crawl.")
        return
    print(f"Found {len(urls)} URLs to crawl.")
    await crawl_parallel(urls)
if __name__ == "__main__":
    asyncio.run(main())

Workflow: Fetch URLs, process their content, and store it in the vector database.

In Part 1, we explored how the crawler.py script retrieves and processes website data to create a vector store. Now, in Part 2, we’ll focus on the main.py script, which builds the query engine to interact with this vector store and answer user questions using Retrieval-Augmented Generation (RAG).

2. `main.py`: Querying the Vector Store

The main.py script provides the user-facing interface to interact with the vector store. It uses a combination of retrieval and generation to answer user questions effectively.

Loading the Vector Store

VECTOR_STORE_DIR = "./vectorstore"
embedding_model = HuggingFaceEmbeddings(model_name="dunzhang/stella_en_1.5B_v5")
vector_db = Chroma(persist_directory=VECTOR_STORE_DIR, embedding_function=embedding_model)

Purpose: Connect to the vector store populated by crawler.py.
Embedding Model: Ensures the same model is used for querying as was used for embedding in the vector store.

Building the RAG Pipeline

def build_qa_pipeline():
    llm = ChatOllama(model="qwen2.5:14b", base_url="http://localhost:11434", streaming=True)
    # MultiQueryRetriever for diverse document retrieval
    retriever = MultiQueryRetriever.from_llm(
        retriever=vector_db.as_retriever(search_kwargs={"k": 10}),
        llm=llm
    )
    # Format document content into a unified context
    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)
    qa_pipeline = (
        retriever
        | (lambda docs: {"context": format_docs(docs), "question": cl.user_session.get("current_query")})
        | ChatPromptTemplate.from_messages([
            ("system", "You are an expert assistant answering documentation queries."),
            ("human", "{context}\n\nQuestion: {question}\nAnswer:")
        ])
        | llm
    )
    return qa_pipeline
qa_pipeline = build_qa_pipeline()

Retriever: Fetches relevant chunks from the vector store based on user queries.
Prompt Template: Formats the context and user question for the language model.
Pipeline: Combines retrieval and generation for seamless RAG functionality.

Creating the Chatbot Interface

Using Chainlit, we build an interactive chatbot that users can query in real-time.

Chat Start

@cl.on_chat_start
async def start_chat():
    cl.user_session.set("qa_pipeline", qa_pipeline)
    await cl.Message(
        content="📖 **Welcome to the Documentation Assistant!**\n\nAsk me anything about the documentation."
    ).send()

Initialization: Sets up the RAG pipeline for the session.
Welcome Message: Greets the user with instructions.

Handling Messages

@cl.on_message
async def handle_message(message: cl.Message):
    qa_pipeline = cl.user_session.get("qa_pipeline")
    query = message.content
    # Store the current query
    cl.user_session.set("current_query", query)
    if query.lower() in ["exit", "quit"]:
        await cl.Message("Session ended. Come back anytime!").send()
        return
    # Initialize a streaming message
    msg = cl.Message(content="")
    # Stream the response as it is generated
    async for chunk in qa_pipeline.astream({"question": query}):
        if hasattr(chunk, "content"):
            await msg.stream_token(chunk.content)
    
    await msg.send()

Query Handling: Retrieves the query from the user and passes it to the pipeline.
Streaming Response: Sends the generated answer back to the user in real-time.

How It Works Together

User Query: The user asks a question via the chatbot interface.
Retrieval: Relevant chunks are fetched from the vector store using semantic similarity.
Generation: The language model synthesizes a coherent answer based on the retrieved chunks.
Response: The answer is streamed back to the user for a seamless experience.

Advantages of This Architecture

Efficiency: Combines the speed of vector-based retrieval with the power of language models.
Transparency: Retrieved chunks provide context for the answers.
Scalability: Can be extended to multiple domains and larger datasets.

Have ideas or improvements? Let’s connect and build something amazing together! 🚀

I know there is a lot more to improve with my explanation and its mostly code, I will try to update them as quickly as possible and give your suggestions in my email(possible improvements or topics I should write an article about) and If you want to talk about this or any LLM or computer vision topics do message me on my Linkedin here

LLama Index Documentation Assistant Bot using RAG

The Concept

Importing Required Libraries

Setting Up the Environment and Vector Store

Defining the ProcessedChunk Data Structure

Chunking Text

Generating Embeddings

Processing and Storing Data

Crawling Websites

Main Function

2. `main.py`: Querying the Vector Store

Loading the Vector Store

Building the RAG Pipeline

Creating the Chatbot Interface

Chat Start

Handling Messages

How It Works Together

Advantages of This Architecture

Written by Pavan Kunchala

No responses yet

LLama Index Documentation Assistant Bot using RAG

The Concept

Importing Required Libraries

Setting Up the Environment and Vector Store

Defining the ProcessedChunk Data Structure

Chunking Text

Generating Embeddings

Processing and Storing Data

Crawling Websites

Main Function

2. main.py: Querying the Vector Store

Loading the Vector Store

Building the RAG Pipeline

Creating the Chatbot Interface

Chat Start

Handling Messages

How It Works Together

Advantages of This Architecture

Written by Pavan Kunchala

No responses yet

2. `main.py`: Querying the Vector Store