05
RAG Knowledge Base
A Telegram bot that turns uploaded PDFs into a searchable knowledge base. Ask questions in plain text - the bot retrieves relevant passages and generates grounded answers with source references.
PythonLangChainChromaDBOpenAIDocker
"Upload a PDF. Ask anything. Get answers with page references."
1000
chars per chunk, 200 overlap
top 4
chunks retrieved per question
1536-dim
embedding vector size
local
ChromaDB - no external vector DB
How it works
PDF upload→
PyMuPDF extract→
Chunk + overlap→
text-embedding-3-small→
ChromaDB→
GPT-4o-mini→
Answer + sources
Project Structure
rag/ingest.pyPDF → PyMuPDF → chunks → embeddings → ChromaDBrag/retriever.pyquestion → similarity search → GPT-4o-mini → answer + sourcesbot/handlers.pyTelegram handlers: /start /list /clear, PDF upload, questionsdata/chroma/ChromaDB persistent storage - mounted as Docker volumeNotable detail
The model is explicitly instructed to answer only from retrieved context - not from its training data. This grounds every response in your documents and makes hallucination structurally impossible for out-of-context questions.
Code
# retriever.py - the full RAG pipeline in one function
def ask(question: str) -> str:
# 1. Embed the question in the same vector space as the chunks
docs = vectorstore.similarity_search(question, k=TOP_K)
if not docs:
return "No relevant documents found."
# 2. Build context from top-4 retrieved chunks
context = "\n\n".join(d.page_content for d in docs)
# 3. GPT-4o-mini answers ONLY from context - not from training data
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content":
"Answer only using the context below. "
"If the answer is not in the context, say so."},
{"role": "user", "content":
f"Context:\n{context}\n\nQuestion: {question}"}
]
)
# 4. Deduplicate sources and append to answer
sources = list({
f"{d.metadata['source']} (page {d.metadata['page']})"
for d in docs
})
return response.choices[0].message.content + "\n\n📎 Sources:\n" + "\n".join(f" • {s}" for s in sources)Screenshots

Features
- •Upload any text-based PDF directly in Telegram - no web UI needed
- •Smart chunking with overlap - context preserved across chunk boundaries
- •Semantic search across all indexed documents simultaneously
- •GPT-4o-mini answers strictly from retrieved context - no hallucination from training data
- •Every answer includes source filename and page number
- •Multi-document support - questions span across multiple PDFs at once
- •Duplicate detection - prevents re-indexing the same file twice
- •Guards for scanned PDFs, oversized files (500+ pages), and non-PDF uploads
- •/list and /clear commands for knowledge base management
- •ChromaDB persisted via Docker volume - data survives restarts