GitHub - disrei/KBLens: code knowledge base

Name	Name	Last commit message	Last commit date
Latest commit History 46 Commits
skills/kblens-kb	skills/kblens-kb
src/kblens	src/kblens
tests	tests
.gitignore	.gitignore
AGENTS.md	AGENTS.md
LICENSE	LICENSE
PRODUCT_IMPROVEMENT_PLAN_zh.md	PRODUCT_IMPROVEMENT_PLAN_zh.md
README.md	README.md
README_zh.md	README_zh.md
confluence_crawler.py	confluence_crawler.py
kblens.example.yaml	kblens.example.yaml
pyproject.toml	pyproject.toml

██╗ ██╗██████╗ ██╗ ███████╗███╗ ██╗███████╗
██║ ██╔╝██╔══██╗ ██║ ██╔════╝████╗ ██║██╔════╝
█████╔╝ ██████╔╝ ██║ █████╗ ██╔██╗ ██║███████╗
██╔═██╗ ██╔══██╗ ██║ ██╔══╝ ██║╚██╗██║╚════██║
██║ ██╗██████╔╝ ███████╗███████╗██║ ╚████║███████║
╚═╝ ╚═╝╚═════╝ ╚══════╝╚══════╝╚═╝ ╚═══╝╚══════╝
═══════════════════════════════════════════════════════════
 Knowledge Base Lens · Code & Document Intelligence
═══════════════════════════════════════════════════════════

English | 中文

A progressive-disclosure knowledge base generator for large codebases and document collections. KBLens uses tree-sitter to extract AST skeletons from source code, and markitdown to convert documents from various formats (PDF, DOCX, PPTX, HTML, etc.) to Markdown. Both are packed into LLM-friendly batches and summarized into hierarchical Markdown — giving AI assistants structured context without reading every file.

Why KBLens

When doing vibe coding — using AI assistants (Cursor, Copilot, OpenCode, etc.) to write and refactor code through natural language — the AI needs to understand your codebase's architecture. But large codebases (100K+ files) are too big for LLMs to consume directly. Without structured context, AI assistants either hallucinate or say "I don't know" when asked about internal systems.

The same problem applies to document collections — internal wikis, technical docs, design specs, API references. They contain critical knowledge but are scattered across formats and too large for LLMs to ingest as-is.

KBLens solves both by generating a three-layer knowledge base from your actual source code and documents:

L0 INDEX.md Project overview + package directory
L1 packages/engine.md Per-package component listing and architecture
L2 packages/engine/ Per-component: purpose, key types, public APIs, dependencies

This gives AI assistants a reliable, searchable reference — like an always-up-to-date architecture document generated from actual code and docs. Point your AI tool at the knowledge base, and it can answer questions like "how does the physics system work?" or "what's the configuration reference for deployment?" without reading every source file.

Key Features

Dual mode — Processes both code (via tree-sitter AST extraction) and documents (via markitdown conversion + section splitting) through the same pipeline
AST-based code extraction — Uses tree-sitter to extract class/struct/enum/function signatures from C++, C#, Python, TypeScript, and JavaScript source files. No guessing, no hallucination.
Document format support — PDF, DOCX, PPTX, XLSX, HTML, CSV, EPUB, and more via markitdown. Documents are converted to Markdown, split by heading level, and summarized.
Hybrid output — LLM generates concise summaries, while raw content (AST signatures for code, original text for documents) is appended directly. Zero truncation, minimal LLM output tokens.
Hierarchical summaries — Three levels of detail (project → package → component) with progressive disclosure. Ask about a package, get the overview. Ask about a class or document section, get the details.
Incremental updates — Only regenerates components whose source files changed. Tracks changes via file hash. A full run on 200+ components takes ~5 minutes; incremental runs take seconds.
Local LLM support — Works with local models via llama.cpp, Ollama, or any OpenAI-compatible API. Includes thinking-mode detection for models like Qwen3.5 and DeepSeek-R1.
Change detection — Five-way classification (unchanged / changed / new / deleted / failed) with automatic cleanup of orphaned files and cascade updates to affected packages.
Multi-source projects — One config file can define multiple source directories with different types (code or document). Each source gets its own independent knowledge base.
Concurrent generation — Processes 8 components in parallel with 8 concurrent LLM calls. Includes exponential backoff retry (3 attempts) for transient failures.
Browser viewer — Built-in kblens serve command starts a local HTTP server to browse the knowledge base in your browser with syntax-highlighted code, Markdown rendering, and a tree navigation sidebar. Supports viewing multiple knowledge bases (code + docs) simultaneously.
Resume from interruption — Progress is persisted after each component. Ctrl+C and re-run to continue where you left off.
Live dashboard — Rich terminal UI showing real-time progress, active components, token usage, and error count.

Prerequisites

Python 3.11+
C compiler — Required by tree-sitter for grammar compilation (GCC, Clang, or MSVC)
- On Ubuntu/Debian: sudo apt install build-essential
- On macOS: Xcode Command Line Tools (xcode-select --install)
- On Windows: Visual Studio Build Tools or MinGW

Installation

# From PyPI (code knowledge base only)
pip install kblens
# With document format support (PDF, DOCX, PPTX, HTML, etc.)
pip install 'kblens[docs]'
# With full document format support (all markitdown backends)
pip install 'kblens[docs-all]'
# Upgrade to latest version
pip install --upgrade kblens
# Or install from GitHub directly
pip install git+https://github.com/disrei/KBLens.git
# Or clone and install in development mode
git clone https://github.com/disrei/KBLens.git
cd kblens
pip install -e . # code only
pip install -e ".[docs]" # + document support
# Verify
kblens version

Install Extras

Extra	Command	What It Adds
(none)	`pip install kblens`	Code KB (C++, C#, Python, TS/JS) + documents (.md, .txt only)
`docs`	`pip install 'kblens[docs]'`	+ PDF, DOCX, PPTX, XLSX, HTML, CSV, EPUB via markitdown
`docs-all`	`pip install 'kblens[docs-all]'`	+ all markitdown optional backends
`dev`	`pip install 'kblens[dev]'`	+ pytest, ruff

Quick Start

1. Create a configuration

kblens init

This walks you through creating ~/.config/kblens/config.yaml with your source paths and LLM settings.

Or create it manually:

# ~/.config/kblens/config.yaml
version: 1
project: "my_project"
output_dir: "~/kblens_kb/my_project"
sources:
 # Code source — uses AST extraction
 - path: "/path/to/src"
 name: "source-code"
 # Document source — uses markitdown + section splitting
 - path: "/path/to/docs"
 name: "project-docs"
 type: "document"
llm:
 model: "gpt-4o-mini"
 # api_key: "your-api-key" # see "API Key Security" below
 temperature: 0.2
summary_language: "en"

2. Preview

kblens generate --dry-run

This scans your source, extracts AST / document sections, and reports statistics without calling the LLM.

3. Generate

kblens generate

For a project with ~200 components, expect ~5 minutes and ~400K input tokens.

3.1 Add Current Directory Quickly

kblens add-kb

This command adds the current working directory into ~/.config/kblens/config.yaml and immediately generates that directory's knowledge base.

If the current directory is not in the config yet, KBLens appends it under sources and runs generation for that source only.
If the current directory already exists in the config, KBLens tells you and runs an incremental update for that source.
If the global config file does not exist yet, KBLens creates one automatically with a default output_dir of ~/kblens_kb.

4. Use

The generated knowledge base is a directory of Markdown files. You can:

Browse in browser — Run kblens serve to open a local viewer with syntax highlighting, tree navigation, and Markdown rendering
Browse directly — Open INDEX.md and navigate through the hierarchy
Search with grep — Find any class, function, or concept across all summaries
Integrate with AI tools — Point your coding assistant's skill/tool at the knowledge base directory (see AI Assistant Integration below)

Browser Viewer

Each kblens generate run appends its output directory to the KBLENS_KB_PATH environment variable. Run multiple generations (e.g., one for code, one for docs), then a single kblens serve shows everything together.

# After running generate for each knowledge base, just:
kblens serve
# Or explicitly specify directories:
kblens serve --kb ~/kblens_kb/my_project
# Browse multiple knowledge bases (code + docs) together
kblens serve --kb ~/kblens_kb/code_output --kb ~/kblens_kb/doc_output
# Use a specific config file to locate the output directory
kblens serve --config kblens.yaml

The viewer starts a local HTTP server (default port 9753) with:

Left sidebar — Collapsible tree showing all sources, packages, and components
Right content — Markdown rendered with GitHub-dark styling and syntax-highlighted code blocks
Multi-source — All KBs from KBLENS_KB_PATH and --kb flags are merged; all sources appear in the sidebar

Document Knowledge Base

KBLens can generate knowledge bases from document collections — technical docs, wikis, design specs, API references, etc.

Supported Formats

Format	Extensions	Requirement
Markdown	`.md`	Built-in (no extra deps)
Plain text	`.txt`	Built-in
PDF	`.pdf`	`pip install 'kblens[docs]'`
Word	`.docx`, `.doc`	`pip install 'kblens[docs]'`
PowerPoint	`.pptx`	`pip install 'kblens[docs]'`
Excel	`.xlsx`, `.xls`	`pip install 'kblens[docs]'`
HTML	`.html`, `.htm`	`pip install 'kblens[docs]'`
CSV	`.csv`	`pip install 'kblens[docs]'`
EPUB	`.epub`	`pip install 'kblens[docs]'`
Jupyter	`.ipynb`	`pip install 'kblens[docs]'`

Format conversion is powered by markitdown (Microsoft).

How It Works (Documents)

The document pipeline replaces the AST extraction phase with:

Convert — Non-Markdown files are converted to Markdown via markitdown
Section Extract — Markdown is split by heading level (default: ##) into sections
Image Handling — Image references are preserved as [Image: alt text](path) for searchability

The rest of the pipeline (packing, LLM summarization, aggregation, writing) is shared with the code path.

Document Source Configuration

sources:
 - path: "/path/to/docs"
 name: "project-docs"
 type: "document" # Required: tells KBLens to use document pipeline
 section_level: 2 # Split on ## headings (default: 2)
 image_handling: "reference" # Keep image refs (default: "reference", or "ignore")

Document Output Format

Each leaf node in the document knowledge base has two sections:

# Component Name
## Topic Summary
What this documentation covers and its purpose.
## Key Concepts and Definitions
Important terms, entities, and definitions.
## Actionable Information
Steps, commands, configurations, reference data.
## Related Topics
Connections to other documents.
---
## Original Content
### From: filename.md#section-heading
(Complete original text preserved for precise retrieval)

The LLM summary enables navigation and topic matching, while the original content below the --- separator allows precise retrieval and direct quoting.

Confluence Preprocessing Tool

Confluence is supported as a preprocessing workflow, not as a native source.type inside KBLens.

Use the bundled confluence_crawler.py script to:

Authenticate to Confluence via REST API
Recursively fetch a page and its child pages
Convert Confluence HTML storage content to Markdown
Save the result as a local .md tree
Point kblens generate at that output directory as a normal document source

This separation is intentional:

KBLens core focuses on local file inputs (.md, .pdf, .docx, etc.)
Confluence crawling is a remote-fetch/auth problem, not just a document conversion problem
markitdown can help with HTML -> Markdown after fetching, but it does not replace the Confluence API step

Example usage:

python confluence_crawler.py "https://confluence.example.com/display/SPACE/Page+Title" \
 --depth 3 \
 --output ./confluence_docs

Then use the crawled output as a regular document source:

sources:
 - path: "./confluence_docs"
 name: "confluence-docs"
 type: "document"

Notes:

The crawler is currently a standalone utility script in the repo root
It is not wired into the kblens CLI yet
It is not included as a formal source type like code or document

Using with Local LLMs

KBLens works well with locally deployed LLMs for privacy-sensitive or cost-free usage.

Recommended Setup

# Example: llama.cpp with Qwen3.5-9B
llama-server -m model.gguf -c 65536 --n-gpu-layers 99 --flash-attn on \
 -b 2048 -ub 512 --port 8080 --cache-type-k q8_0 --cache-type-v q8_0 -np 1

Configuration for Local LLMs

llm:
 model: "openai/your-model-name"
 api_base: "http://localhost:8080/v1"
 api_key: "not-needed"
 temperature: 0.2
 max_concurrent: 1 # Serial execution for local LLMs
 max_concurrent_components: 1
packing:
 token_budget: 20000 # Larger batches = fewer LLM calls

Thinking Model Support

Models with built-in "thinking mode" (Qwen3.5, DeepSeek-R1, etc.) may output reasoning tokens instead of actual content by default. KBLens automatically detects this and shows a fix:

LLM returned empty content but has reasoning_content — the model is in
'thinking mode'. Disable thinking in your kblens config:
 llm:
 extra_body:
 chat_template_kwargs:
 enable_thinking: false

Add the suggested extra_body configuration to disable thinking mode:

llm:
 model: "openai/Qwen3.5-9B"
 api_base: "http://localhost:8080/v1"
 api_key: "not-needed"
 extra_body:
 chat_template_kwargs:
 enable_thinking: false

The extra_body field passes arbitrary parameters to the LLM API, making it compatible with any server-specific options.

API Key Security

Never commit API keys to version control. Use one of these methods:

Environment variable (recommended):
```
export KBLENS_LLM_KEY=sk-your-key-here
```

Local config override — Create a .local.yaml sibling next to your config file:

# ~/.config/kblens/config.local.yaml (gitignored)
llm:
 api_key: "sk-your-key-here"

Config key_env reference — Point to any environment variable:
```
llm:
 api_key_env: "MY_OPENAI_KEY"
```

Configuration

KBLens uses a two-layer config system:

Layer	Location	Purpose
Global	`~/.config/kblens/config.yaml`	Shared LLM settings, packing parameters
Project	`./kblens.yaml` in project root	Project-specific sources and output

Project config overrides global config. Each layer can have a .local.yaml sibling for sensitive values (API keys).

Config Reference

version: 1
project: "my_project" # Project name (displayed in CLI)
output_dir: "~/kblens_kb/my_project" # Knowledge base output root
sources: # Source directories to scan
 - path: "/path/to/src" # Absolute path
 name: "core" # Short name (used as subdirectory)
 # type: "code" # Default: "code" (AST extraction)
 - path: "/path/to/docs"
 name: "docs"
 type: "document" # Document pipeline (markitdown + sections)
 section_level: 2 # Split on H2 headings (default: 2)
 image_handling: "reference" # "reference" (keep) or "ignore" (remove)
include_extensions: "auto" # "auto" or explicit list: [".h", ".cpp"]
exclude_patterns: # Glob patterns to skip
 - "*/test/*"
 - "*_test.*"
llm:
 model: "gpt-4o-mini" # Any litellm-compatible model
 api_base: "https://api.openai.com/v1"
 api_key: "sk-..." # Or use api_key_env / KBLENS_LLM_KEY
 temperature: 0.2
 max_concurrent: 8 # Concurrent LLM calls
 max_concurrent_components: 8 # Concurrent component pipelines
 extra_body: # Extra params passed to LLM API
 chat_template_kwargs: # Example: disable thinking mode
 enable_thinking: false
packing:
 token_budget: 8000 # Target tokens per batch
 token_min: 1000 # Minimum batch size
 token_max: 24000 # Maximum batch size
 component_split_threshold: 200 # File count threshold for splitting
summary_language: "en" # Language for generated summaries

Environment Variables

Variable	Purpose
`KBLENS_LLM_KEY`	LLM API key (overrides config)
`KBLENS_KB_PATH`	Accumulated automatically by each `kblens generate` run; used by AI skills and `kblens serve` to locate all KBs. Supports multiple paths separated by `;` (Windows) or `:` (Unix).

CLI Reference

kblens generate # Generate all sources
kblens generate --source core # Generate only the "core" source
kblens generate --dry-run # Preview without LLM calls
kblens generate --config ./my.yaml # Use specific config file
kblens add-kb # Add current directory and generate/update it
kblens serve # Browse KB in browser (auto-detect from env)
kblens serve --kb ./output # Browse a specific KB directory
kblens serve --kb ./code --kb ./docs # Browse multiple KBs together
kblens serve --port 8080 # Use a custom port
kblens status # Show knowledge base status
kblens monitor # Monitor a running generation
kblens init # Interactive config setup
kblens version # Show version

Output Structure

For a project with a code source and a document source:

~/kblens_kb/my_project/
├── source-code/ # Source: code
│ ├── INDEX.md # L0: package directory with links
│ ├── _meta.json # Component status, hashes, token counts
│ └── source-code/
│ ├── engine.md # L1: engine package overview
│ ├── engine/
│ │ ├── SoundSystem.md # L2: component (summary + AST signatures)
│ │ └── Physics.md
│ └── gameplay.md
├── project-docs/ # Source: documents
│ ├── INDEX.md
│ ├── _meta.json
│ └── project-docs/
│ ├── api.md # L1: api package overview
│ ├── api/
│ │ └── api.md # L2: component (summary + original content)
│ └── guides.md

Code Output Format

## Responsibility
What this component does.
## Key Types and Relationships
Classes, structs, enums and how they relate.
## Source Files
File paths grouped by role.
## Dependencies
Explicit #include paths.
---
## Complete API Signatures
```cpp
class MyClass { void MyMethod(int param); };
```

Document Output Format

## Topic Summary
What this documentation covers.
## Key Concepts and Definitions
Important terms and entities.
## Actionable Information
Steps, commands, configurations.
## Related Topics
Connections to other documents.
---
## Original Content
### From: filename.md#section-heading
(Complete original text)

How It Works

KBLens runs a six-phase pipeline for each source:

Scan — Walk the directory tree, discover components (package/subdir pairs), count files and lines
Extract — For code: parse with tree-sitter, extract AST skeletons. For documents: convert formats via markitdown, split by heading level into sections.
Pack — Group entries into token-budgeted batches, create aggregation groups for large components
Leaf Summarize — Send each batch to the LLM for a focused summary; raw content (AST or original text) is preserved separately (Phase 4)
Aggregate — Merge summaries upward: fragments → component overview → package overview → INDEX (Phase 5a-5d)
Write — Persist Markdown files (summary + appended raw content) and update _meta.json incrementally

Incremental Behavior

KBLens is designed for daily use in active development. Just re-run kblens generate after code or document changes — it will figure out what needs updating.

On subsequent runs:

Unchanged components are skipped entirely (hash match based on file path + mtime + size)
Changed components are regenerated, and their package's L1 overview is updated
New components are generated and added to the package overview
Deleted components have their .md files and metadata cleaned up
Failed components (from previous timeout/errors) are automatically retried
Skipped components (< 50 AST tokens) are recorded in metadata to avoid re-scanning
L0 INDEX is regenerated only if any package changed

Language Support

Code Languages

C++ (.h, .hpp, .cpp, .cc, .cxx) — classes, structs, enums, free functions, templates, supplementary .cpp extraction
C# (.cs) — classes, structs, interfaces, records, enums, delegates, generics with constraints, attributes, XML doc comments
Python (.py, .pyi) — classes with public methods, module-level functions, type-annotated constants, decorators, docstrings, __all__
TypeScript (.ts, .tsx) — classes, interfaces, type aliases, enums, exported functions, arrow functions, access modifiers
JavaScript (.js, .jsx, .mjs, .cjs) — classes, exported functions, constants

Document Formats

See Supported Formats above.

Directory Layout

KBLens supports two layout styles:

Deep layout (C++ engine style): source/package/component/src/*.h — three directory levels
Flat layout (Python package style): source/package/*.py — package directory contains code files directly

Both are auto-detected during scanning. Document sources use the same layout detection.

Roadmap

Planned languages:

C++
C#
Python
TypeScript / JavaScript
Document knowledge base (PDF, DOCX, PPTX, HTML, etc.)
Java / Kotlin
Rust
Go

AI Assistant Integration

KBLens generates Markdown knowledge bases that can be queried by AI coding assistants. An OpenCode skill template is included in skills/kblens-kb/SKILL.md.

OpenCode Setup

# Auto-install skill
kblens skill install
# Or manually
mkdir -p ~/.config/opencode/skills/kblens-kb
cp skills/kblens-kb/SKILL.md ~/.config/opencode/skills/kblens-kb/

The skill automatically reads KBLENS_KB_PATH (set after each kblens generate) to find the knowledge base.

Other AI Tools

The knowledge base is plain Markdown files. You can integrate it with any AI tool that supports file-based context:

Add the knowledge base directory as a reference path
Use grep/search to find relevant .md files
The three-layer hierarchy (INDEX → package → component) provides natural progressive disclosure

Notes

The knowledge base uses absolute paths in _meta.json for change tracking. If you move your source code directory, regenerate the knowledge base with kblens generate.
Hybrid output mode: LLM only generates concise summaries (~400 tokens per batch). Raw content (AST signatures or document text) is appended directly, so it is never truncated or hallucinated.
LLM model compatibility: KBLens uses litellm under the hood, so any model supported by litellm will work (OpenAI, Anthropic, local Ollama, llama.cpp, etc.).
For local LLM users: set max_concurrent: 1 and increase token_budget (e.g., 20000) to minimize the number of serial LLM calls.

License

MIT — see LICENSE.

Folders and files

Latest commit

History

Repository files navigation

Why KBLens

Key Features

Prerequisites

Installation

Install Extras

Quick Start

1. Create a configuration

2. Preview

3. Generate

3.1 Add Current Directory Quickly

4. Use

Browser Viewer

Document Knowledge Base

Supported Formats

How It Works (Documents)

Document Source Configuration

Document Output Format

Confluence Preprocessing Tool

Using with Local LLMs

Recommended Setup

Configuration for Local LLMs

Thinking Model Support

API Key Security

Configuration

Config Reference

Environment Variables

CLI Reference

Output Structure

Code Output Format

Document Output Format

How It Works

Incremental Behavior

Language Support

Code Languages

Document Formats

Directory Layout

Roadmap

AI Assistant Integration

OpenCode Setup

Other AI Tools

Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages