Pavel SkvortsovPavel Skvortsov
← Back
05

RAG Knowledge Base

A Telegram bot that turns uploaded PDFs into a searchable knowledge base. Ask questions in plain text - the bot retrieves relevant passages and generates grounded answers with source references.

PythonLangChainChromaDBOpenAIDocker

"Upload a PDF. Ask anything. Get answers with page references."

1000
chars per chunk, 200 overlap
top 4
chunks retrieved per question
1536-dim
embedding vector size
local
ChromaDB - no external vector DB
How it works
PDF upload
PyMuPDF extract
Chunk + overlap
text-embedding-3-small
ChromaDB
GPT-4o-mini
Answer + sources
Project Structure
rag/ingest.pyPDF → PyMuPDF → chunks → embeddings → ChromaDB
rag/retriever.pyquestion → similarity search → GPT-4o-mini → answer + sources
bot/handlers.pyTelegram handlers: /start /list /clear, PDF upload, questions
data/chroma/ChromaDB persistent storage - mounted as Docker volume
Notable detail

The model is explicitly instructed to answer only from retrieved context - not from its training data. This grounds every response in your documents and makes hallucination structurally impossible for out-of-context questions.

Code
# retriever.py - the full RAG pipeline in one function
  def ask(question: str) -> str:
      # 1. Embed the question in the same vector space as the chunks
      docs = vectorstore.similarity_search(question, k=TOP_K)
  
      if not docs:
          return "No relevant documents found."
  
      # 2. Build context from top-4 retrieved chunks
      context = "\n\n".join(d.page_content for d in docs)
  
      # 3. GPT-4o-mini answers ONLY from context - not from training data
      response = openai.chat.completions.create(
          model="gpt-4o-mini",
          messages=[
              {"role": "system", "content":
                  "Answer only using the context below. "
                  "If the answer is not in the context, say so."},
              {"role": "user", "content":
                  f"Context:\n{context}\n\nQuestion: {question}"}
          ]
      )
  
      # 4. Deduplicate sources and append to answer
      sources = list({
          f"{d.metadata['source']} (page {d.metadata['page']})"
          for d in docs
      })
      return response.choices[0].message.content + "\n\n📎 Sources:\n" + "\n".join(f" • {s}" for s in sources)
Screenshots
demo
Features
  • Upload any text-based PDF directly in Telegram - no web UI needed
  • Smart chunking with overlap - context preserved across chunk boundaries
  • Semantic search across all indexed documents simultaneously
  • GPT-4o-mini answers strictly from retrieved context - no hallucination from training data
  • Every answer includes source filename and page number
  • Multi-document support - questions span across multiple PDFs at once
  • Duplicate detection - prevents re-indexing the same file twice
  • Guards for scanned PDFs, oversized files (500+ pages), and non-PDF uploads
  • /list and /clear commands for knowledge base management
  • ChromaDB persisted via Docker volume - data survives restarts