Library Lab
ChatGPP PoC

Repository



MDT Library - RAG Chatbot - Experimental Project
This project was developed and tested as part of our experimental lane and is a made as a simpel proof-of-concept to
get familiar with and experiment with the various parts of a RAG-setup.
The example use-case is a chatbot, allowing us to interact with our Confluence IT documentation in natural language. As a fun twist,
it is named after our recently retired colleague Peter, who has been a great source of knowledge.

Key Findings & Experimentation Details
To learn more about our experiments and findings, please refer to our presentation given during the WUR Model & Data Day.

Collaboration Invitation
We believe that innovation thrives on collaboration. We invite LLM enthusiasts, researchers, and developers to join the LLM Focus Group.
Your insights and shared experiments are highly welcome. Feel free to contact Cristina Huidiu for more information cristina.huidiu@wur.nl.
We're currently planning multiple concrete challenges that still need to be tackled, and will be reaching out for volunteers in due time.

Disclaimer
Please be aware that the code in this repository is provided "as-is" without any support or guarantee. Use it, modify it,
or improve it as you see fit, but understand that this is an experimental initiative.

Warnings ⚠️


This project has been created mostly in my free time and is for educational purposes only. It is not intended for production use. The code is not optimized for performance or security, nor
does it have proper error handling or any test-coverage.


This project allows you to use your own data. Note that quality of input data is crucial for the quality of the final output.


An important todo that's still open: Implement a way to structurally assess quality and reliability of a whole RAG chain and the individual components.


Please note that allow_dangerous_deserialization is set to true for loading an earlier created FAISS index from disk.


Used packages

Huggingface Transformers to download & load the LLM model
Default model: Gemma 2 9B (using 4 bits quantization)
Streamlit for the chatbot interface
Langchain for:

PDF, HTML, Markdown Document loaders
Embeddings Encoder
FAISS DB for similarity search


SBert Sentence Transformers for cross encoder & reranking


Main components
The project consists of two important main components:

A pipeline/script to build the RAG index from a list of documents.
An interface to interact with the RAG index using natural language.

Note: this project does not link to a Confluence instance, but uses a simple index and a few example documents
to demonstrate the RAG concept. This allows you to easily test and keep control over the data.
Setting up a synchronization pipeline to Confluence can be done using the Confluence API but is not part of this project.

Architectural overview
Note: below diagrams are a high level overview and might skip some details for the sake of simplicity.

Indexing pipeline


RAG


Folder structure
Primary folders:


documents: Should contain the documents to be indexed by the vector DB, including a required index.yaml file with metadata.

src: Contains the source code.

config: Config files for LLM models including system prompts to use.

storage: Cache for faiss-index and downloaded LLM models.

Other folders:


.streamlit: Streamlit UI config.

docs: Anything related to the README.md documentation.


Components used to build the RAG index


src/rag/simple_document_index.py: Loads the document index as defined in the documents/index.yaml file, including metadata.

src/rag/pdf_processor.py: Extracts text from PDF files, splits text it into chunks of a predefined size and appends metadata.

src/rag/encoder.py: Used to encode the text chunks into embeddings.

src/rag/faiss_db.py: Used to store the embeddings in a FAISS database for fast similarity search.


Components used for the Chat interface


src/app.py: Main entrypoint: The Streamlit app that allows you to interact with the chatbot

src/model/context_retriever.py: Retrieves the most relevant context from the RAG index given a question


src/rag/faiss_db.py: Used to retrieve the most similar context from the FAISS database

src/rag/reranker.py: Reranks the retrieved contexts using SBert cross encoder

src/rag/context_extender.py: Extends the context to the whole source document if enabled


src/utils/response_generator.py: Generates the response given a question and context


src/model/prompt_builder.py: Generates the system prompt for the LLM model

src/model/llm_model.py: Loads the LLM model and generates answers given a question and context. Returns output as a stream.

src/utils/audio_generator.py: (Optional, not used without configuration.) Automatically plays an audio fragment for given words. Was used as a fun extra.


How to install

Prerequisites

Python 3.8 or higher
An Nvidia GPU with at least 16GB of VRAM
CUDA drivers
Pytorch with CUDA support (see Pytorch installation guide)


Installation

Clone the repository
Install the required packages with pip install -r requirements.txt (Note: pip is used here. You need to create a venv yourself if desired.)
Manually install the correct Pytorch version with GPU support: (see Pytorch installation guide. You'll only need the torch package and can skip torchvision and torchaudio.)
Create an .env file in the root of the project with the following content:

HF_ACCESS_TOKEN=Put your Hugging Face access token here (required for downloading models)


How to run

Step 1. Creating a RAG index

Add documents you want to index to the documents folder. Currently supported file-extensions: .pdf, .html, .md
Copy documents/example_index.yaml to documents/index.yaml file and add your PDF documents, including metadata. This metadata will be used to show reference links below answers.
Run the build-rag-index.py script to create the RAG index from your documents.


Step 2. Running the chatbot

You can start the Chatbot using streamlit run src/app.py.
This will start a Streamlit server at http://localhost:8501.


Explanation of settings in Chatbot sidebar
LLM

max_new_tokens: The maximum number of tokens the LLM model can generate in one go.

RAG & Reranking

k1 (rag): The number of similar contexts to retrieve from the FAISS database.
k2 (rerank): The number of contexts to keep after reranking has been applied.
threshold (rerank): The threshold for the SBert cross encoder to consider a context relevant. Results below this threshold will be discarded.
Expand context: If enabled, the found context chunks will be expanded to the whole source documents. Generally produces better results, but can cause context-length issues.

Other

Funny response chance: The chance (0.0-1.0) that a funny prompt will be used instead of the default prompt. (Disabled by default.)

Debug

Show RAG context: Displays an overview of retrieved documents at various steps in the RAG pipeline.
Disable LLM response: Disables the LLM response generation. Useful for debugging the RAG pipeline without running inference on the LLM model.


Using a different LLM model
LLM models can easily be swapped using a config. As different models react differently to our system prompts,
you can also define custom system prompt templates for each model. See the /config folder for examples.

Supplied models
The following model configurations are supplied with the project:


Model
Multi-language support
Parameters
Max. context length
General notes


gemma-2b-it.yaml
Very poor
2B
8192
Very small but fast model, with fast and decent responses.


gemma-2-2b-it.yaml
Poor
2B
8192
Smallest Gemma 2 version, a bit more powerful than the above model.


gemma-2-9b-it.yaml (default)
Decent/Good
9B
8192
Gemma 2 9B version. More powerful, and decent multilanguage support. Tested with EN/NL/DE


Add a MODEL_CONFIG_FILE=name_of_the_model_config.yaml to the .env file in the root of your project to switch to a different model.

How to add a new model

Create a new model yaml config file in the /config folder. You can use the config/gemma-2-9b-it.yaml as an example. Note that different models might require different system prompts. The system prompts can also be defined in the config file.
Add a MODEL_CONFIG_FILE=name_of_your_config.yaml to the .env file in the root of your project.
(Re)start the chatbot to use the new model.

See huggingface for available models: https://huggingface.co/models?pipeline_tag=text-generation

License
The MIT License (MIT). Please see License File for more information.