RAG Knowledge Base

A Telegram bot that turns uploaded PDFs into a searchable knowledge base. Ask questions in plain text - the bot retrieves relevant passages and generates grounded answers with source references.

PythonLangChainChromaDBOpenAIDocker

"Upload a PDF. Ask anything. Get answers with page references."

1000

chars per chunk, 200 overlap

top 4

chunks retrieved per question

1536-dim

embedding vector size

local

ChromaDB - no external vector DB

How it works

PDF upload→

PyMuPDF extract→

Chunk + overlap→

text-embedding-3-small→

ChromaDB→

GPT-4o-mini→

Answer + sources

Project Structure

rag/ingest.pyPDF → PyMuPDF → chunks → embeddings → ChromaDB

rag/retriever.pyquestion → similarity search → GPT-4o-mini → answer + sources

bot/handlers.pyTelegram handlers: /start /list /clear, PDF upload, questions

data/chroma/ChromaDB persistent storage - mounted as Docker volume

Notable detail

The model is explicitly instructed to answer only from retrieved context - not from its training data. This grounds every response in your documents and makes hallucination structurally impossible for out-of-context questions.

Code

# retriever.py - the full RAG pipeline in one function
  def ask(question: str) -> str:
      # 1. Embed the question in the same vector space as the chunks
      docs = vectorstore.similarity_search(question, k=TOP_K)
  
      if not docs:
          return "No relevant documents found."
  
      # 2. Build context from top-4 retrieved chunks
      context = "\n\n".join(d.page_content for d in docs)
  
      # 3. GPT-4o-mini answers ONLY from context - not from training data
      response = openai.chat.completions.create(
          model="gpt-4o-mini",
          messages=[
              {"role": "system", "content":
                  "Answer only using the context below. "
                  "If the answer is not in the context, say so."},
              {"role": "user", "content":
                  f"Context:\n{context}\n\nQuestion: {question}"}
          ]
      )
  
      # 4. Deduplicate sources and append to answer
      sources = list({
          f"{d.metadata['source']} (page {d.metadata['page']})"
          for d in docs
      })
      return response.choices[0].message.content + "\n\n📎 Sources:\n" + "\n".join(f" • {s}" for s in sources)

Screenshots

Features

•Upload any text-based PDF directly in Telegram - no web UI needed
•Smart chunking with overlap - context preserved across chunk boundaries
•Semantic search across all indexed documents simultaneously
•GPT-4o-mini answers strictly from retrieved context - no hallucination from training data
•Every answer includes source filename and page number
•Multi-document support - questions span across multiple PDFs at once
•Duplicate detection - prevents re-indexing the same file twice
•Guards for scanned PDFs, oversized files (500+ pages), and non-PDF uploads
•/list and /clear commands for knowledge base management
•ChromaDB persisted via Docker volume - data survives restarts

← All projects