Stop searching your documents. Start asking them questions.
Arkive is an enterprise-ready RAG (Retrieval-Augmented Generation) knowledge base that lets you upload documents and ask natural language questions β getting accurate, cited answers grounded in your actual files.
π arkive.tianakayemba.dev β no setup required!
Upload your documents and Arkive will:
- Answer questions in natural language β ask anything, get cited answers backed by your documents
- Extract text from PDFs, DOCX, and TXT files β including tables, grade breakdowns, schedules, and structured data
- Cite every answer β each response shows exactly which source and page the information came from
- Filter by document β scope your question to a specific file when you have multiple documents loaded
- Rate confidence β High/Medium/Low confidence badge per answer based on source relevance
- Preview documents β click any file in the library to view its full content with relevant passages highlighted
- Track your session β live stats for documents indexed, chunks stored, queries run, and average relevance
- Copy answers β one-click copy on any AI response for pasting into emails or reports
- Validate uploads β catches password-protected PDFs, empty files, oversized files, and corrupted documents with clear error messages
- Log every query β full observability via Langfuse: latency, token usage, retrieved sources, and answers logged per query
| Layer | Technology |
|---|---|
| Backend | Python, FastAPI, Uvicorn |
| Vector Database | ChromaDB (local, persistent) |
| Embeddings | sentence-transformers (all-MiniLM-L6-v2) |
| AI | Anthropic Claude API (claude-sonnet-4-6) |
| Document Parsing | pypdf, python-docx, chardet |
| Observability | Langfuse (query logging, latency, token tracking) |
| Frontend | React, Vite, Tailwind CSS v4 |
| HTTP Client | Axios |
| Deployment | Railway, custom domain |
git clone https://github.com/t-skayemba/arkive.git
cd arkivecd backend python3.12 -m venv .venv source .venv/bin/activate pip install -r requirements.txt
Note: Python 3.12 is required. Python 3.13+ is not yet supported by all dependencies.
cp .env.example .env
Open .env and fill in your keys:
ANTHROPIC_API_KEY=your_anthropic_key_here
LANGFUSE_PUBLIC_KEY=your_langfuse_public_key
LANGFUSE_SECRET_KEY=your_langfuse_secret_key
LANGFUSE_HOST=https://cloud.langfuse.com
Get an Anthropic key at console.anthropic.com Get a free Langfuse account at cloud.langfuse.com
uvicorn main:app --reload --port 8000
cd ../frontend
npm install
npm run devVisit http://localhost:5173
- Upload a document β drag and drop or click the upload zone. Supports PDF, DOCX, TXT up to 20MB
- Ask a question β type any natural language question and press Enter
- Filter by document β use the dropdown above the input bar to search within a specific file
- Review the answer β cited answer with source cards showing which passages were used
- Preview the document β click any file in the Library to view full text with cited passages highlighted
- Delete a document β hover over a file in the Library and click the trash icon
All settings are in backend/config.py:
| Setting | Default | Description |
|---|---|---|
chunk_size |
600 | Characters per text chunk |
chunk_overlap |
150 | Overlap between chunks for context continuity |
top_k_results |
25 | Max chunks retrieved per query |
embedding_model |
all-MiniLM-L6-v2 |
Local embedding model |
claude_model |
claude-sonnet-4-6 |
Claude model used for generation |
Arkive validates every upload before indexing:
| Check | Limit | Error shown |
|---|---|---|
| File type | PDF, DOCX, TXT only | Unsupported file type |
| File size | Max 20MB | File too large |
| Empty file | Must have content | File is empty |
| Password-protected PDF | Not supported | Password-protected PDF |
| Corrupted/image-only PDF | Must have extractable text | Could not read PDF |
| Corrupted DOCX | Must be valid Word document | Not a valid Word document |
Every query is logged to Langfuse with:
- The question asked and any document filter applied
- Number of chunks retrieved and top relevance score
- The full prompt sent to Claude
- Claude's response
- Input and output token counts
- End-to-end latency in milliseconds
| What | Where |
|---|---|
| Uploaded files | Stored locally in backend/data/uploads/ |
| Vector embeddings | Stored locally in backend/data/chroma_db/ |
| Query processing | Sent to Anthropic's API to generate answers |
| Query logs | Sent to Langfuse for observability |
Files and vectors never leave your machine. Query text and relevant document excerpts are sent to Anthropic's API for answer generation. See Anthropic's Privacy Policy for details.
arkive/
βββ backend/
β βββ main.py # FastAPI app entry point
β βββ config.py # All settings in one place
β βββ requirements.txt # Python dependencies
β βββ .env # API keys (not committed)
β βββ .env.example # Template for required keys
β βββ routers/
β β βββ documents.py # Upload, list, delete, preview endpoints
β β βββ query.py # Question answering endpoint
β βββ services/
β β βββ document_processor.py # PDF/DOCX/TXT extraction and chunking
β β βββ embeddings.py # sentence-transformers embedding service
β β βββ rag_engine.py # Vector search + Claude + Langfuse logging
β βββ models/
β β βββ schemas.py # Pydantic data models
β βββ data/
β βββ uploads/ # Uploaded document files
β βββ chroma_db/ # ChromaDB vector store
βββ frontend/
βββ src/
βββ App.jsx # Root layout and state
βββ utils/
β βββ api.js # Axios API client
βββ components/
βββ DocumentUpload.jsx # Drag and drop upload with validation
βββ DocumentLibrary.jsx # File list with delete and preview
βββ DocumentPreview.jsx # Full text modal with highlights
βββ QueryInterface.jsx # Chat UI with document filter dropdown
βββ SourceCard.jsx # Citation card component
| Method | Endpoint | Description |
|---|---|---|
GET |
/ |
Health check |
POST |
/documents/upload |
Upload and index a document |
GET |
/documents/list |
List all indexed documents |
DELETE |
/documents/{id} |
Remove a document |
GET |
/documents/{id}/content |
Get full document text for preview |
POST |
/query/ |
Ask a question, optionally filter by document_id |
GET |
/query/health |
Check how many chunks are available |
Interactive docs available at http://localhost:8000/docs when the backend is running.
- Multi-document filtering (search across a selected subset of documents)
- Folder/collection grouping for large document libraries
- Re-index all documents button
- OCR support for scanned PDFs
- Fully local LLM option via Ollama for air-gapped deployments
Built by Tiana Kayemba
MIT