Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Releases: mithun50/TreeDex

v0.1.5 — Smart Hierarchy Extraction for Large Documents

22 Mar 12:35
@mithun50 mithun50

Choose a tag to compare

What's New

This release fixes hierarchy extraction accuracy on large (300+ page) documents. Previously, subsections would get flattened to top-level sections — now TreeDex uses multiple strategies to maintain correct depth.

New Features

  • PDF ToC extraction — If the PDF has bookmarks/outline, the tree is built directly from them — zero LLM calls needed, perfect hierarchy every time
  • Font-size heading detection — Analyzes font sizes across the document and injects [H1]/[H2]/[H3] markers so the LLM knows exactly which level each heading belongs to
  • Capped continuation context — For multi-chunk documents, the LLM sees a compact summary (top-level outline + last 30 sections) instead of the full history — 78% fewer tokens wasted on context
  • Orphan repair — If the LLM outputs "2.3.1" without a "2.3" parent, synthetic parents are auto-inserted to maintain a valid tree

New Exports

Function Python Node.js
Extract PDF ToC extract_toc(path) await extractToc(path)
ToC → sections toc_to_sections(toc) tocToSections(toc)
Repair orphans repair_orphans(sections) repairOrphans(sections)

How It Helps

Before — LLM sees flat text, guesses hierarchy:

1 Introduction 1.1 Background Large Language Models...

After — LLM sees font-size markers, knows exact depth:

[H2] 1 Introduction
[H3] 1.1 Background
Large Language Models...

Impact on Large Documents (1M+ tokens)

LLM Context Groups LLM Calls Continuation Tokens
20k (default) 56 56 1,336 (was 5,946)
128k (large) 8 8 minimal
PDF with ToC 0 0 N/A

Full Changelog

  • pdf_parser / pdf-parser — ToC extraction, heading analysis, annotated text builder
  • tree_builder / tree-buildertoc_to_sections(), repair_orphans()
  • prompts — Updated to reference [H1]/[H2]/[H3] heading markers
  • core — ToC shortcut path, heading detection, capped context, orphan repair
  • loaders — Pass-through for detect_headings option
  • Tests — New test coverage for all new functionality
  • README.md — Updated How It Works section and API reference
  • how-treedex-works.svg — Updated pipeline diagram

19 files changed, 1,012 insertions(+), 102 deletions(-)

Assets 2
Loading

v0.1.4

01 Mar 12:36
@mithun50 mithun50

Choose a tag to compare

What's New

Web Demo — Chat UI + Caching + Vercel Deploy

  • Chat-style UI: Two-panel layout with sidebar + chat bubbles (user/AI), typing indicator, collapsible sources
  • Main upload zone: File upload front-and-center in the chat area
  • Per-file progress: Upload multiple files with real-time indexing status
  • Disk cache (.cache/): Re-uploading the same file is instant; cached docs auto-restore on restart
  • Auto-retry on 429: Rate-limited Groq calls automatically wait and retry (up to 8 attempts)
  • Conversational fallback: Greetings get natural responses instead of "no info found"
  • Vercel serverless: Full api/ functions with client-side IndexedDB state
  • Mobile responsive: Sidebar hamburger menu, keyboard-aware input bar
  • dotenv config: .env.example with GROQ_API_KEY, PORT, LLM_MODEL, LLM_BASE_URL

Install

npm install treedex@0.1.4
pip install treedex==0.1.4
Loading

v0.1.2

01 Mar 09:51
@mithun50 mithun50

Choose a tag to compare

New Features

  • Agentic RAG modequery(question, agentic=True) retrieves relevant sections then generates a direct LLM answer. Available in both Python and Node.js.
  • Multi-document support — Web demo now supports uploading and querying across multiple documents simultaneously
  • Answer prompt — New answerPrompt / ANSWER_PROMPT template for answer generation

Bug Fixes

  • Fix page range assignment — Sections starting on the same page no longer get inverted ranges (e.g. "pages 10-9"), which caused empty context text
  • Fix PDF loading — Convert Buffer to Uint8Array for pdfjs-dist compatibility
  • Fix multer file upload — Preserve original file extension for format detection
  • Increase LLM timeout — Default 5 min (configurable via timeout option on OpenAICompatibleLLM)

Other

  • Shorter retrieval reasoning (one sentence instead of verbose paragraph)
  • Improved answer prompt with explicit instructions to extract facts from context
  • Web demo switched from NVIDIA Kimi K2.5 to Groq for fast inference
  • Updated Colab notebook with agentic mode examples
  • Updated README with agentic RAG documentation

Full Changelog: v0.1.1...v0.1.2

Loading

v0.1.1

01 Mar 09:12
@mithun50 mithun50

Choose a tag to compare

Bug Fixes

  • Fix PDF loading — Convert Buffer to Uint8Array for pdfjs-dist compatibility
  • Fix web demo file upload — Preserve file extension after multer upload so autoLoader can detect format
  • Increase LLM request timeout — Default timeout raised from 2 min to 5 min; now configurable via timeout option on OpenAICompatibleLLM

Other

  • Web demo now uses published treedex@0.1.1 from npm instead of local file link

Full Changelog: v0.1.0...v0.1.1

Loading

v0.1.0 — Initial Release

01 Mar 02:54
@mithun50 mithun50

Choose a tag to compare

TreeDex v0.1.0

Tree-based, vectorless document RAG framework.

Highlights

  • Tree-based indexing — preserves document hierarchy (chapters, sections, subsections)
  • 18+ LLM backends — Gemini, OpenAI, Claude, Groq, Together AI, Fireworks, DeepSeek, Ollama, and any OpenAI-compatible endpoint
  • Zero vector dependencies — no embeddings, no vector DB, just JSON
  • Exact page attribution — every answer traces back to source pages
  • 4 document formats — PDF, TXT, HTML, DOCX

Install

pip install treedex

Quick Start

from treedex import TreeDex, GeminiLLM
llm = GeminiLLM(api_key="YOUR_KEY")
index = TreeDex.from_file("document.pdf", llm=llm)
result = index.query("What is the main argument?")
print(result.context)
print(result.pages_str)

What's Included

  • treedex/ — Core library (pdf_parser, tree_builder, loaders, llm_backends, prompts, core)
  • examples/ — Quick start examples + sample index
  • tests/ — Full test suite
  • benchmarks/ — TreeDex vs ChromaDB vs Naive comparison (auto-run in CI)
  • assets/ — SVG charts auto-generated from real benchmarks
Loading

AltStyle によって変換されたページ (->オリジナル) /