Releases: mithun50/TreeDex

v0.1.5 — Smart Hierarchy Extraction for Large Documents

22 Mar 12:35

@mithun50 mithun50

v0.1.5

3a31bf8

v0.1.5 — Smart Hierarchy Extraction for Large Documents Latest

What's New

This release fixes hierarchy extraction accuracy on large (300+ page) documents. Previously, subsections would get flattened to top-level sections — now TreeDex uses multiple strategies to maintain correct depth.

New Features

PDF ToC extraction — If the PDF has bookmarks/outline, the tree is built directly from them — zero LLM calls needed, perfect hierarchy every time
Font-size heading detection — Analyzes font sizes across the document and injects [H1]/[H2]/[H3] markers so the LLM knows exactly which level each heading belongs to
Capped continuation context — For multi-chunk documents, the LLM sees a compact summary (top-level outline + last 30 sections) instead of the full history — 78% fewer tokens wasted on context
Orphan repair — If the LLM outputs "2.3.1" without a "2.3" parent, synthetic parents are auto-inserted to maintain a valid tree

New Exports

Function	Python	Node.js
Extract PDF ToC	`extract_toc(path)`	`await extractToc(path)`
ToC → sections	`toc_to_sections(toc)`	`tocToSections(toc)`
Repair orphans	`repair_orphans(sections)`	`repairOrphans(sections)`

How It Helps

Before — LLM sees flat text, guesses hierarchy:

1 Introduction 1.1 Background Large Language Models...

After — LLM sees font-size markers, knows exact depth:

[H2] 1 Introduction
[H3] 1.1 Background
Large Language Models...

Impact on Large Documents (1M+ tokens)

LLM Context	Groups	LLM Calls	Continuation Tokens
20k (default)	56	56	1,336 (was 5,946)
128k (large)	8	8	minimal
PDF with ToC	0	0	N/A

Full Changelog

pdf_parser / pdf-parser — ToC extraction, heading analysis, annotated text builder
tree_builder / tree-builder — toc_to_sections(), repair_orphans()
prompts — Updated to reference [H1]/[H2]/[H3] heading markers
core — ToC shortcut path, heading detection, capped context, orphan repair
loaders — Pass-through for detect_headings option
Tests — New test coverage for all new functionality
README.md — Updated How It Works section and API reference
how-treedex-works.svg — Updated pipeline diagram

19 files changed, 1,012 insertions(+), 102 deletions(-)

Assets 2

v0.1.4

01 Mar 12:36

@mithun50 mithun50

v0.1.4

71f8991

v0.1.4

What's New

Web Demo — Chat UI + Caching + Vercel Deploy

Chat-style UI: Two-panel layout with sidebar + chat bubbles (user/AI), typing indicator, collapsible sources
Main upload zone: File upload front-and-center in the chat area
Per-file progress: Upload multiple files with real-time indexing status
Disk cache (.cache/): Re-uploading the same file is instant; cached docs auto-restore on restart
Auto-retry on 429: Rate-limited Groq calls automatically wait and retry (up to 8 attempts)
Conversational fallback: Greetings get natural responses instead of "no info found"
Vercel serverless: Full api/ functions with client-side IndexedDB state
Mobile responsive: Sidebar hamburger menu, keyboard-aware input bar
dotenv config: .env.example with GROQ_API_KEY, PORT, LLM_MODEL, LLM_BASE_URL

Install

npm install treedex@0.1.4
pip install treedex==0.1.4

Assets 2

v0.1.2

01 Mar 09:51

@mithun50 mithun50

v0.1.2

f9582aa

v0.1.2

New Features

Agentic RAG mode — query(question, agentic=True) retrieves relevant sections then generates a direct LLM answer. Available in both Python and Node.js.
Multi-document support — Web demo now supports uploading and querying across multiple documents simultaneously
Answer prompt — New answerPrompt / ANSWER_PROMPT template for answer generation

Bug Fixes

Fix page range assignment — Sections starting on the same page no longer get inverted ranges (e.g. "pages 10-9"), which caused empty context text
Fix PDF loading — Convert Buffer to Uint8Array for pdfjs-dist compatibility
Fix multer file upload — Preserve original file extension for format detection
Increase LLM timeout — Default 5 min (configurable via timeout option on OpenAICompatibleLLM)

Other

Shorter retrieval reasoning (one sentence instead of verbose paragraph)
Improved answer prompt with explicit instructions to extract facts from context
Web demo switched from NVIDIA Kimi K2.5 to Groq for fast inference
Updated Colab notebook with agentic mode examples
Updated README with agentic RAG documentation

Full Changelog: v0.1.1...v0.1.2

Assets 2

v0.1.1

01 Mar 09:12

@mithun50 mithun50

v0.1.1

b186fa7

v0.1.1

Bug Fixes

Fix PDF loading — Convert Buffer to Uint8Array for pdfjs-dist compatibility
Fix web demo file upload — Preserve file extension after multer upload so autoLoader can detect format
Increase LLM request timeout — Default timeout raised from 2 min to 5 min; now configurable via timeout option on OpenAICompatibleLLM

Other

Web demo now uses published treedex@0.1.1 from npm instead of local file link

Full Changelog: v0.1.0...v0.1.1

Assets 2

v0.1.0 — Initial Release

01 Mar 02:54

@mithun50 mithun50

v0.1.0

ecb6cf8

v0.1.0 — Initial Release

TreeDex v0.1.0

Tree-based, vectorless document RAG framework.

Highlights

Tree-based indexing — preserves document hierarchy (chapters, sections, subsections)
18+ LLM backends — Gemini, OpenAI, Claude, Groq, Together AI, Fireworks, DeepSeek, Ollama, and any OpenAI-compatible endpoint
Zero vector dependencies — no embeddings, no vector DB, just JSON
Exact page attribution — every answer traces back to source pages
4 document formats — PDF, TXT, HTML, DOCX

Install

pip install treedex

Quick Start

from treedex import TreeDex, GeminiLLM
llm = GeminiLLM(api_key="YOUR_KEY")
index = TreeDex.from_file("document.pdf", llm=llm)
result = index.query("What is the main argument?")
print(result.context)
print(result.pages_str)

What's Included

treedex/ — Core library (pdf_parser, tree_builder, loaders, llm_backends, prompts, core)
examples/ — Quick start examples + sample index
tests/ — Full test suite
benchmarks/ — TreeDex vs ChromaDB vs Naive comparison (auto-run in CI)
assets/ — SVG charts auto-generated from real benchmarks

Assets 2

Releases: mithun50/TreeDex

v0.1.5 — Smart Hierarchy Extraction for Large Documents

What's New

New Features

New Exports

How It Helps

Impact on Large Documents (1M+ tokens)

Full Changelog

Uh oh!

v0.1.4

What's New

Web Demo — Chat UI + Caching + Vercel Deploy

Install

Uh oh!

v0.1.2

New Features

Bug Fixes

Other

Uh oh!

v0.1.1

Bug Fixes

Other

Uh oh!

v0.1.0 — Initial Release

TreeDex v0.1.0

Highlights

Install

Quick Start

What's Included

Uh oh!