██╗ ██╗██████╗ ██╗ ███████╗███╗ ██╗███████╗
██║ ██╔╝██╔══██╗ ██║ ██╔════╝████╗ ██║██╔════╝
█████╔╝ ██████╔╝ ██║ █████╗ ██╔██╗ ██║███████╗
██╔═██╗ ██╔══██╗ ██║ ██╔══╝ ██║╚██╗██║╚════██║
██║ ██╗██████╔╝ ███████╗███████╗██║ ╚████║███████║
╚═╝ ╚═╝╚═════╝ ╚══════╝╚══════╝╚═╝ ╚═══╝╚══════╝
═══════════════════════════════════════════════════════════
Knowledge Base Lens · Code & Document Intelligence
═══════════════════════════════════════════════════════════
English | 中文
A progressive-disclosure knowledge base generator for large codebases and document collections. KBLens uses tree-sitter to extract AST skeletons from source code, and markitdown to convert documents from various formats (PDF, DOCX, PPTX, HTML, etc.) to Markdown. Both are packed into LLM-friendly batches and summarized into hierarchical Markdown — giving AI assistants structured context without reading every file.
When doing vibe coding — using AI assistants (Cursor, Copilot, OpenCode, etc.) to write and refactor code through natural language — the AI needs to understand your codebase's architecture. But large codebases (100K+ files) are too big for LLMs to consume directly. Without structured context, AI assistants either hallucinate or say "I don't know" when asked about internal systems.
The same problem applies to document collections — internal wikis, technical docs, design specs, API references. They contain critical knowledge but are scattered across formats and too large for LLMs to ingest as-is.
KBLens solves both by generating a three-layer knowledge base from your actual source code and documents:
L0 INDEX.md Project overview + package directory
L1 packages/engine.md Per-package component listing and architecture
L2 packages/engine/ Per-component: purpose, key types, public APIs, dependencies
This gives AI assistants a reliable, searchable reference — like an always-up-to-date architecture document generated from actual code and docs. Point your AI tool at the knowledge base, and it can answer questions like "how does the physics system work?" or "what's the configuration reference for deployment?" without reading every source file.
- Dual mode — Processes both code (via tree-sitter AST extraction) and documents (via markitdown conversion + section splitting) through the same pipeline
- AST-based code extraction — Uses tree-sitter to extract class/struct/enum/function signatures from C++, C#, Python, TypeScript, and JavaScript source files. No guessing, no hallucination.
- Document format support — PDF, DOCX, PPTX, XLSX, HTML, CSV, EPUB, and more via markitdown. Documents are converted to Markdown, split by heading level, and summarized.
- Hybrid output — LLM generates concise summaries, while raw content (AST signatures for code, original text for documents) is appended directly. Zero truncation, minimal LLM output tokens.
- Hierarchical summaries — Three levels of detail (project → package → component) with progressive disclosure. Ask about a package, get the overview. Ask about a class or document section, get the details.
- Incremental updates — Only regenerates components whose source files changed. Tracks changes via file hash. A full run on 200+ components takes ~5 minutes; incremental runs take seconds.
- Local LLM support — Works with local models via llama.cpp, Ollama, or any OpenAI-compatible API. Includes thinking-mode detection for models like Qwen3.5 and DeepSeek-R1.
- Change detection — Five-way classification (unchanged / changed / new / deleted / failed) with automatic cleanup of orphaned files and cascade updates to affected packages.
- Multi-source projects — One config file can define multiple source directories with different types (code or document). Each source gets its own independent knowledge base.
- Concurrent generation — Processes 8 components in parallel with 8 concurrent LLM calls. Includes exponential backoff retry (3 attempts) for transient failures.
- Browser viewer — Built-in
kblens servecommand starts a local HTTP server to browse the knowledge base in your browser with syntax-highlighted code, Markdown rendering, and a tree navigation sidebar. Supports viewing multiple knowledge bases (code + docs) simultaneously. - Resume from interruption — Progress is persisted after each component. Ctrl+C and re-run to continue where you left off.
- Live dashboard — Rich terminal UI showing real-time progress, active components, token usage, and error count.
- Python 3.11+
- C compiler — Required by tree-sitter for grammar compilation (GCC, Clang, or MSVC)
- On Ubuntu/Debian:
sudo apt install build-essential - On macOS: Xcode Command Line Tools (
xcode-select --install) - On Windows: Visual Studio Build Tools or MinGW
- On Ubuntu/Debian:
# From PyPI (code knowledge base only) pip install kblens # With document format support (PDF, DOCX, PPTX, HTML, etc.) pip install 'kblens[docs]' # With full document format support (all markitdown backends) pip install 'kblens[docs-all]' # Upgrade to latest version pip install --upgrade kblens # Or install from GitHub directly pip install git+https://github.com/disrei/KBLens.git # Or clone and install in development mode git clone https://github.com/disrei/KBLens.git cd kblens pip install -e . # code only pip install -e ".[docs]" # + document support # Verify kblens version
| Extra | Command | What It Adds |
|---|---|---|
| (none) | pip install kblens |
Code KB (C++, C#, Python, TS/JS) + documents (.md, .txt only) |
docs |
pip install 'kblens[docs]' |
+ PDF, DOCX, PPTX, XLSX, HTML, CSV, EPUB via markitdown |
docs-all |
pip install 'kblens[docs-all]' |
+ all markitdown optional backends |
dev |
pip install 'kblens[dev]' |
+ pytest, ruff |
kblens init
This walks you through creating ~/.config/kblens/config.yaml with your source paths and LLM settings.
Or create it manually:
# ~/.config/kblens/config.yaml version: 1 project: "my_project" output_dir: "~/kblens_kb/my_project" sources: # Code source — uses AST extraction - path: "/path/to/src" name: "source-code" # Document source — uses markitdown + section splitting - path: "/path/to/docs" name: "project-docs" type: "document" llm: model: "gpt-4o-mini" # api_key: "your-api-key" # see "API Key Security" below temperature: 0.2 summary_language: "en"
kblens generate --dry-run
This scans your source, extracts AST / document sections, and reports statistics without calling the LLM.
kblens generate
For a project with ~200 components, expect ~5 minutes and ~400K input tokens.
kblens add-kb
This command adds the current working directory into ~/.config/kblens/config.yaml and immediately generates that directory's knowledge base.
- If the current directory is not in the config yet, KBLens appends it under
sourcesand runs generation for that source only. - If the current directory already exists in the config, KBLens tells you and runs an incremental update for that source.
- If the global config file does not exist yet, KBLens creates one automatically with a default
output_dirof~/kblens_kb.
The generated knowledge base is a directory of Markdown files. You can:
- Browse in browser — Run
kblens serveto open a local viewer with syntax highlighting, tree navigation, and Markdown rendering - Browse directly — Open
INDEX.mdand navigate through the hierarchy - Search with grep — Find any class, function, or concept across all summaries
- Integrate with AI tools — Point your coding assistant's skill/tool at the knowledge base directory (see AI Assistant Integration below)
Each kblens generate run appends its output directory to the KBLENS_KB_PATH environment variable. Run multiple generations (e.g., one for code, one for docs), then a single kblens serve shows everything together.
# After running generate for each knowledge base, just: kblens serve # Or explicitly specify directories: kblens serve --kb ~/kblens_kb/my_project # Browse multiple knowledge bases (code + docs) together kblens serve --kb ~/kblens_kb/code_output --kb ~/kblens_kb/doc_output # Use a specific config file to locate the output directory kblens serve --config kblens.yaml
The viewer starts a local HTTP server (default port 9753) with:
- Left sidebar — Collapsible tree showing all sources, packages, and components
- Right content — Markdown rendered with GitHub-dark styling and syntax-highlighted code blocks
- Multi-source — All KBs from
KBLENS_KB_PATHand--kbflags are merged; all sources appear in the sidebar
KBLens can generate knowledge bases from document collections — technical docs, wikis, design specs, API references, etc.
| Format | Extensions | Requirement |
|---|---|---|
| Markdown | .md |
Built-in (no extra deps) |
| Plain text | .txt |
Built-in |
.pdf |
pip install 'kblens[docs]' |
|
| Word | .docx, .doc |
pip install 'kblens[docs]' |
| PowerPoint | .pptx |
pip install 'kblens[docs]' |
| Excel | .xlsx, .xls |
pip install 'kblens[docs]' |
| HTML | .html, .htm |
pip install 'kblens[docs]' |
| CSV | .csv |
pip install 'kblens[docs]' |
| EPUB | .epub |
pip install 'kblens[docs]' |
| Jupyter | .ipynb |
pip install 'kblens[docs]' |
Format conversion is powered by markitdown (Microsoft).
The document pipeline replaces the AST extraction phase with:
- Convert — Non-Markdown files are converted to Markdown via markitdown
- Section Extract — Markdown is split by heading level (default:
##) into sections - Image Handling — Image references are preserved as
[Image: alt text](path)for searchability
The rest of the pipeline (packing, LLM summarization, aggregation, writing) is shared with the code path.
sources: - path: "/path/to/docs" name: "project-docs" type: "document" # Required: tells KBLens to use document pipeline section_level: 2 # Split on ## headings (default: 2) image_handling: "reference" # Keep image refs (default: "reference", or "ignore")
Each leaf node in the document knowledge base has two sections:
# Component Name ## Topic Summary What this documentation covers and its purpose. ## Key Concepts and Definitions Important terms, entities, and definitions. ## Actionable Information Steps, commands, configurations, reference data. ## Related Topics Connections to other documents. --- ## Original Content ### From: filename.md#section-heading (Complete original text preserved for precise retrieval)
The LLM summary enables navigation and topic matching, while the original content below the --- separator allows precise retrieval and direct quoting.
Confluence is supported as a preprocessing workflow, not as a native source.type inside KBLens.
Use the bundled confluence_crawler.py script to:
- Authenticate to Confluence via REST API
- Recursively fetch a page and its child pages
- Convert Confluence HTML storage content to Markdown
- Save the result as a local
.mdtree - Point
kblens generateat that output directory as a normal document source
This separation is intentional:
- KBLens core focuses on local file inputs (
.md,.pdf,.docx, etc.) - Confluence crawling is a remote-fetch/auth problem, not just a document conversion problem
- markitdown can help with HTML -> Markdown after fetching, but it does not replace the Confluence API step
Example usage:
python confluence_crawler.py "https://confluence.example.com/display/SPACE/Page+Title" \
--depth 3 \
--output ./confluence_docsThen use the crawled output as a regular document source:
sources: - path: "./confluence_docs" name: "confluence-docs" type: "document"
Notes:
- The crawler is currently a standalone utility script in the repo root
- It is not wired into the
kblensCLI yet - It is not included as a formal source type like
codeordocument
KBLens works well with locally deployed LLMs for privacy-sensitive or cost-free usage.
# Example: llama.cpp with Qwen3.5-9B
llama-server -m model.gguf -c 65536 --n-gpu-layers 99 --flash-attn on \
-b 2048 -ub 512 --port 8080 --cache-type-k q8_0 --cache-type-v q8_0 -np 1llm: model: "openai/your-model-name" api_base: "http://localhost:8080/v1" api_key: "not-needed" temperature: 0.2 max_concurrent: 1 # Serial execution for local LLMs max_concurrent_components: 1 packing: token_budget: 20000 # Larger batches = fewer LLM calls
Models with built-in "thinking mode" (Qwen3.5, DeepSeek-R1, etc.) may output reasoning tokens instead of actual content by default. KBLens automatically detects this and shows a fix:
LLM returned empty content but has reasoning_content — the model is in
'thinking mode'. Disable thinking in your kblens config:
llm:
extra_body:
chat_template_kwargs:
enable_thinking: false
Add the suggested extra_body configuration to disable thinking mode:
llm: model: "openai/Qwen3.5-9B" api_base: "http://localhost:8080/v1" api_key: "not-needed" extra_body: chat_template_kwargs: enable_thinking: false
The extra_body field passes arbitrary parameters to the LLM API, making it compatible with any server-specific options.
Never commit API keys to version control. Use one of these methods:
-
Environment variable (recommended):
export KBLENS_LLM_KEY=sk-your-key-here -
Local config override — Create a
.local.yamlsibling next to your config file:# ~/.config/kblens/config.local.yaml (gitignored) llm: api_key: "sk-your-key-here"
-
Config key_env reference — Point to any environment variable:
llm: api_key_env: "MY_OPENAI_KEY"
KBLens uses a two-layer config system:
| Layer | Location | Purpose |
|---|---|---|
| Global | ~/.config/kblens/config.yaml |
Shared LLM settings, packing parameters |
| Project | ./kblens.yaml in project root |
Project-specific sources and output |
Project config overrides global config. Each layer can have a .local.yaml sibling for sensitive values (API keys).
version: 1 project: "my_project" # Project name (displayed in CLI) output_dir: "~/kblens_kb/my_project" # Knowledge base output root sources: # Source directories to scan - path: "/path/to/src" # Absolute path name: "core" # Short name (used as subdirectory) # type: "code" # Default: "code" (AST extraction) - path: "/path/to/docs" name: "docs" type: "document" # Document pipeline (markitdown + sections) section_level: 2 # Split on H2 headings (default: 2) image_handling: "reference" # "reference" (keep) or "ignore" (remove) include_extensions: "auto" # "auto" or explicit list: [".h", ".cpp"] exclude_patterns: # Glob patterns to skip - "*/test/*" - "*_test.*" llm: model: "gpt-4o-mini" # Any litellm-compatible model api_base: "https://api.openai.com/v1" api_key: "sk-..." # Or use api_key_env / KBLENS_LLM_KEY temperature: 0.2 max_concurrent: 8 # Concurrent LLM calls max_concurrent_components: 8 # Concurrent component pipelines extra_body: # Extra params passed to LLM API chat_template_kwargs: # Example: disable thinking mode enable_thinking: false packing: token_budget: 8000 # Target tokens per batch token_min: 1000 # Minimum batch size token_max: 24000 # Maximum batch size component_split_threshold: 200 # File count threshold for splitting summary_language: "en" # Language for generated summaries
| Variable | Purpose |
|---|---|
KBLENS_LLM_KEY |
LLM API key (overrides config) |
KBLENS_KB_PATH |
Accumulated automatically by each kblens generate run; used by AI skills and kblens serve to locate all KBs. Supports multiple paths separated by ; (Windows) or : (Unix). |
kblens generate # Generate all sources
kblens generate --source core # Generate only the "core" source
kblens generate --dry-run # Preview without LLM calls
kblens generate --config ./my.yaml # Use specific config file
kblens add-kb # Add current directory and generate/update it
kblens serve # Browse KB in browser (auto-detect from env)
kblens serve --kb ./output # Browse a specific KB directory
kblens serve --kb ./code --kb ./docs # Browse multiple KBs together
kblens serve --port 8080 # Use a custom port
kblens status # Show knowledge base status
kblens monitor # Monitor a running generation
kblens init # Interactive config setup
kblens version # Show version
For a project with a code source and a document source:
~/kblens_kb/my_project/
├── source-code/ # Source: code
│ ├── INDEX.md # L0: package directory with links
│ ├── _meta.json # Component status, hashes, token counts
│ └── source-code/
│ ├── engine.md # L1: engine package overview
│ ├── engine/
│ │ ├── SoundSystem.md # L2: component (summary + AST signatures)
│ │ └── Physics.md
│ └── gameplay.md
├── project-docs/ # Source: documents
│ ├── INDEX.md
│ ├── _meta.json
│ └── project-docs/
│ ├── api.md # L1: api package overview
│ ├── api/
│ │ └── api.md # L2: component (summary + original content)
│ └── guides.md
## Responsibility What this component does. ## Key Types and Relationships Classes, structs, enums and how they relate. ## Source Files File paths grouped by role. ## Dependencies Explicit #include paths. --- ## Complete API Signatures ```cpp class MyClass { void MyMethod(int param); }; ```
## Topic Summary What this documentation covers. ## Key Concepts and Definitions Important terms and entities. ## Actionable Information Steps, commands, configurations. ## Related Topics Connections to other documents. --- ## Original Content ### From: filename.md#section-heading (Complete original text)
KBLens runs a six-phase pipeline for each source:
- Scan — Walk the directory tree, discover components (package/subdir pairs), count files and lines
- Extract — For code: parse with tree-sitter, extract AST skeletons. For documents: convert formats via markitdown, split by heading level into sections.
- Pack — Group entries into token-budgeted batches, create aggregation groups for large components
- Leaf Summarize — Send each batch to the LLM for a focused summary; raw content (AST or original text) is preserved separately (Phase 4)
- Aggregate — Merge summaries upward: fragments → component overview → package overview → INDEX (Phase 5a-5d)
- Write — Persist Markdown files (summary + appended raw content) and update
_meta.jsonincrementally
KBLens is designed for daily use in active development. Just re-run kblens generate after code or document changes — it will figure out what needs updating.
On subsequent runs:
- Unchanged components are skipped entirely (hash match based on file path + mtime + size)
- Changed components are regenerated, and their package's L1 overview is updated
- New components are generated and added to the package overview
- Deleted components have their
.mdfiles and metadata cleaned up - Failed components (from previous timeout/errors) are automatically retried
- Skipped components (< 50 AST tokens) are recorded in metadata to avoid re-scanning
- L0 INDEX is regenerated only if any package changed
- C++ (
.h,.hpp,.cpp,.cc,.cxx) — classes, structs, enums, free functions, templates, supplementary.cppextraction - C# (
.cs) — classes, structs, interfaces, records, enums, delegates, generics with constraints, attributes, XML doc comments - Python (
.py,.pyi) — classes with public methods, module-level functions, type-annotated constants, decorators, docstrings,__all__ - TypeScript (
.ts,.tsx) — classes, interfaces, type aliases, enums, exported functions, arrow functions, access modifiers - JavaScript (
.js,.jsx,.mjs,.cjs) — classes, exported functions, constants
See Supported Formats above.
KBLens supports two layout styles:
- Deep layout (C++ engine style):
source/package/component/src/*.h— three directory levels - Flat layout (Python package style):
source/package/*.py— package directory contains code files directly
Both are auto-detected during scanning. Document sources use the same layout detection.
Planned languages:
- C++
- C#
- Python
- TypeScript / JavaScript
- Document knowledge base (PDF, DOCX, PPTX, HTML, etc.)
- Java / Kotlin
- Rust
- Go
KBLens generates Markdown knowledge bases that can be queried by AI coding assistants. An OpenCode skill template is included in skills/kblens-kb/SKILL.md.
# Auto-install skill kblens skill install # Or manually mkdir -p ~/.config/opencode/skills/kblens-kb cp skills/kblens-kb/SKILL.md ~/.config/opencode/skills/kblens-kb/
The skill automatically reads KBLENS_KB_PATH (set after each kblens generate) to find the knowledge base.
The knowledge base is plain Markdown files. You can integrate it with any AI tool that supports file-based context:
- Add the knowledge base directory as a reference path
- Use grep/search to find relevant
.mdfiles - The three-layer hierarchy (INDEX → package → component) provides natural progressive disclosure
- The knowledge base uses absolute paths in
_meta.jsonfor change tracking. If you move your source code directory, regenerate the knowledge base withkblens generate. - Hybrid output mode: LLM only generates concise summaries (~400 tokens per batch). Raw content (AST signatures or document text) is appended directly, so it is never truncated or hallucinated.
- LLM model compatibility: KBLens uses litellm under the hood, so any model supported by litellm will work (OpenAI, Anthropic, local Ollama, llama.cpp, etc.).
- For local LLM users: set
max_concurrent: 1and increasetoken_budget(e.g., 20000) to minimize the number of serial LLM calls.
MIT — see LICENSE.