Memory and Knowledge

ankurCES edited this page Jun 9, 2026 · 10 revisions

Memory & Knowledge

Memory & Knowledge is how blumi — a local-first, bring-your-own-key (BYOK) AI coding agent — remembers facts across sessions, survives crashes mid-task, learns which memories actually work, shares knowledge across machines, and understands your codebase. It needs three things most assistants lack: memory that survives a crash, memory that learns and travels across machines, and a real understanding of your code. blumi ships all of it local-first — a bundled embedding model means no API key for memory or code search, and nothing leaves your machines.

This page covers durable execution, semantic long-term memory (RAG), SEDM cross-peer diffusion, and the native code knowledge base — plus how to configure, drive, and test each.

TL;DR — Memory & Knowledge key facts

Local-first by default. Memory and code search run on a bundled on-device embedding model (bge-small-en-v1.5), so they need no API key and send nothing to the cloud.
Semantic long-term memory (RAG). Each turn, blumi recalls the most relevant past memories and injects them as background context; RAG = retrieval-augmented generation.
Durable execution. Every tool step is checkpointed to SQLite, so a crash or restart resumes mid-turn instead of replaying the whole turn.
SEDM governance. SEDM = self-evolving distributed memory: dedup on write, value-based fitness, consolidation/eviction, and probationary failure→fix learning keep memory useful, not just large.
Cross-peer diffusion. High-value, non-private memories spread across the grid so what one machine learns, the others pick up — your user (private) namespace never leaves the node.
Native code knowledge base. Index any repo and search it by meaning or keyword (hybrid vector + FTS5) from the CLI, agent tools, or the blugo Code tab.
Graceful offline fallback. If embeddings are off or the network is unreachable, memory and knowledge fall back to FTS5 keyword search — they keep working.

At a glance

Capability	What it does	Drive it with
Durable execution	Checkpoints each tool step; resumes mid-turn after a crash/restart	automatic
Semantic memory (RAG)	Recalls relevant memories each turn (vectors, FTS5 fallback)	automatic + `memory` tool
SEDM governance	Dedup on write, value-based fitness, consolidation/eviction, probationary learning	automatic
Cross-peer diffusion	Spreads worth-sharing memories across the grid	automatic (grid + `diffuse`)
Code knowledge base	Index a repo; hybrid vector/FTS5 code search	`blumi knowledge`, `code_search`/`code_retrieve`, blugo Code tab
GPU acceleration	Runs the bundled embedder on Apple CoreML / NVIDIA CUDA when present	automatic · `blumi accel doctor` · `/accel`
Self-healing	Classifies tool failures → budgeted recovery; learns failure→fix episodes	automatic · `heal` config · `/heal`

Everything is on by default. The first memory/knowledge operation downloads the embedding model (~130 MB, bge-small-en-v1.5) once into ~/.blumi/models; after that it runs fully offline. If a node can't reach the network, memory/knowledge transparently fall back to keyword (FTS5) search.

How does durable execution survive a crash?

Durable execution means a crashed or restarted run resumes mid-turn instead of replaying the whole turn. Every tool step within a turn is checkpointed to SQLite (checkpoints table in blumi.db). If the process crashes or the gateway is restarted mid-turn, the next run resumes from the last completed step — the backbone for long autonomous runs (blumi loop, grid-dispatched work). Resume is at-least-once: idempotent tools + the doom-loop guard keep it safe.

Nothing to configure. To see it: start a multi-step task, restart the gateway (launchctl kickstart -k gui/$(id -u)/com.blumi.serve) mid-run, reconnect — it picks up where it left off.

How does blumi's semantic long-term memory (RAG) work?

blumi's long-term memory uses retrieval-augmented generation (RAG): each turn it embeds your latest message, finds the most relevant past memories, and injects them as context. A vector store (memories + memory_vectors + memories_fts in blumi.db) lets blumi recall knowledge across sessions. The recalled memories are added as background context — appended as a trailing user message so the cached system prompt stays byte-identical (prompt caching keeps working). With embeddings off, it degrades to FTS5 keyword recall.

Namespaces

user — durable facts/preferences about you. Private — never diffuses off the node.
agent — the agent's own notes/conventions (the default for "remember this..."). Diffusable.
project:<hash> — project-scoped knowledge. Diffusable.

Using it from chat — the memory tool:

add — remember a fact (mirrored to MEMORY.md/USER.md and embedded, dedup-gated).
query — semantic search over everything remembered.
read / replace / remove — manage the file-backed entries.

Tip: phrasing decides the namespace. "Remember about me: I prefer dark mode" → user (private). "Remember this project convention: ..." → agent/project (shareable across the grid).

White-box memory (view · pin · edit · delete)

Memory isn't a black box: you can inspect and curate individual entries, not just the MEMORY.md/USER.md files. Each entry shows its namespace, kind, utility, hit count, origin, and timestamps.

Pin an entry to protect it — pinned entries are exempt from eviction + consolidation, so a critical decision never drifts or gets merged away.
Edit an entry's text — blumi re-embeds it (and keeps the FTS5 index in sync), so search stays correct.
Delete an entry outright (its vector + graph edges go too).

Drive it from the Web UI's Memory tab, or the API (behind the gateway password):

POST /api/memory/list {"namespace":"agent","status":"active","limit":200}
POST /api/memory/pin {"id":123,"pinned":true}
POST /api/memory/update {"id":123,"text":"corrected fact"}
POST /api/memory/delete {"id":123}

Editing or pinning a user-namespace entry stays local — it never changes what crosses the grid (diffusion still excludes user% and ignores pinned).

What is SEDM (self-evolving distributed memory)?

SEDM (self-evolving distributed memory) is the governance layer that keeps blumi's memory useful instead of just large. Memory that only grows becomes noise, so blumi applies the governance from the SEDM paper:

Write-admission (dedup). Before inserting, blumi checks the closest existing memory; if it's a near-duplicate (cosine ≥ dedup_threshold, default 0.92) it merges and bumps utility instead of piling up.
Utility scoring. Every time a memory is recalled or merged, its hits/utility rise — so what actually helps surfaces more.
Consolidation & eviction. A periodic sweep folds near-duplicate clusters into the highest-utility member and soft-evicts the weakest once a namespace exceeds max_per_namespace.
Cross-peer diffusion. The sweep pushes high-value, non-user memories to live grid peers (POST /api/grid/memory, authenticated by the shared grid secret). Each receiver re-admits the memory through its own dedup gate and tags it with the origin node, so memories never loop (A→B→A). What one machine learns, the others pick up. Your user namespace never leaves the node.

Learning from outcomes (v0.5.0)

Beyond dedup + utility, memory now learns from what actually works — outcome, not just retrieval:

Probationary failure→fix. A guided recovery is stored as a pending hypothesis — physically present but invisible to recall/mining/diffusion — and only promoted to a real fix once the same tool is observed to succeed on a later step (cross-step verify, on by default), with provenance. Unverified guesses never get recalled as "fixes"; un-promoted hypotheses are reaped by the sweep.
Value-based fitness. Each memory carries a learned value distinct from retrieval utility: rewarded when it was in context for a productive step, decayed on failures, and bumped when a peer independently holds it (cross-node consensus). Eviction drops the lowest-value memory first — so genuinely-useful memories survive, not merely frequently-retrieved ones. (With RPL-Judgement on, its "regret" Error-Delta feeds this same signal.)
Structure-aware recall. Recall re-ranks a wider candidate pool by memory-graph degree (hub-suppression — generic "matches-everything" notes stop dominating) and is seeded from the last couple of user turns, not just the latest line.
Conflict resolution. Reversible supersede + conflict detection settle contradictions: with memory.resolve_conflicts on, the sweep asks the LLM to classify same-topic pairs and supersedes the outdated side (conservative — ambiguous pairs are left untouched). Diffusion shares the highest-value memories first.
Retrospection (daily). A differential replay of session transcripts (memory.retrospect, on by default) consolidates each session's new messages — read from a watermark, never the whole history — into durable learnings once per memory.retrospect_hours (default 24). So memory gets steadily more solid without re-iterating the chat each turn. Learnings are stored in the agent namespace, so they diffuse across the grid like any other shared memory (the private user namespace never leaves the node).
Curation everywhere. The sweep below now runs for standalone blumi run / blumi tui too (not only the gateway), so memory is curated and the recall graph is built on every install.

Diffusion needs the grid enabled on the participating nodes and memory.diffuse = true (default). The sweep runs every memory.sweep_secs (default 60s), so a new shareable memory appears on peers within ~a minute.

See diffusion happen (deterministic check on each machine):

sqlite3 ~/.blumi/blumi.db \
 "SELECT namespace, origin, substr(text,1,60) FROM memories WHERE status='active';"

On the node that learned it, the row has origin = ''; on a peer that received it, origin is the sender's node name.

How does blumi search my codebase?

blumi's native code knowledge base indexes a repo into a sibling ~/.blumi/knowledge.db and searches it by meaning or keyword (hybrid: FTS5 first for exact/symbol precision, vector fill for semantic recall). Indexing is gitignore-aware and diff-aware — re-running only touches changed files (by SHA). Symbol extraction is lightweight (per-language, no external grammars), and asset/binary files (svg, images, lockfiles...) are skipped.

Structural code graph (opt-in)

By default the graph is Tier-0: edges are name co-occurrence (a symbol's body mentions another symbol's name). Build with --features code-graph and set knowledge.graph.mode = "structural" to parse with tree-sitter instead — declarations gain a scope-qualified name / parent / signature, and references resolve into typed edges (call / type / implements / contains, marked resolved when unambiguous). That makes code_graph impact <symbol> a precise change blast radius, and (when knowledge.graph.rpl_impact is on) feeds the RPL blast radius so editing a heavily-referenced file faces tighter review. Languages: Rust, Python, Go, JavaScript, TypeScript. The default build stays native-lite (regex symbols, no C grammars); code_graph still works over the lite edges, just less precisely.

CLI — `blumi knowledge`

blumi knowledge ingest ~/code/my-repo # index a repo/dir (incremental)
blumi knowledge search "where is auth handled"
blumi knowledge status # files · symbols · vectors
blumi knowledge list # indexed sources
blumi knowledge remove <source> # drop an indexed source
blumi knowledge impact <symbol> # transitive callers — the change blast radius
blumi knowledge callers <symbol> # who calls/uses it (callees = what it calls)
blumi knowledge implementers <trait> # types implementing a trait

The same code-graph queries are also one tap away in the TUI (/knowledge lists the most-connected "hot-spot" symbols), the gateway (POST /api/knowledge/graph), and blugo (each Code-tab search result has an Impact action showing its change blast radius).

From chat — agent tools

code_search — find relevant functions/types/code by meaning or keywords; returns file:line + snippet. Use it to locate where something is implemented before reading/editing.
code_retrieve — fetch a file's indexed symbols (with snippets), or one symbol by name.
code_graph — query the typed code graph around a symbol: callers (who calls/uses it), callees (what it calls), impact (transitive callers — the change blast radius of editing it), or implementers (types implementing a trait). Use it before an edit to see what depends on something. (code_neighbors / code_path cover raw graph traversal.)

From the phone — blugo Code tab

Control center → Code: index a repo (path on the gateway machine, with live progress), list/remove sources, and search code with file:line + snippet result cards.

The index is local to each node (the orchestrator's knowledge.db). Re-ingest is cheap, so keep it fresh as the code changes.

GPU acceleration

Does blumi use the GPU? Yes — blumi runs its bundled embedder on the GPU when one is detected and falls back to CPU otherwise, and no LLM ever runs in-process. All in-process numeric work is the bundled embedder (the ONNX bge-small-en-v1.5 model behind memory and code search). blumi accel doctor prints exactly what's detected and how to wire more.

Accelerator	How	Notes
Apple CoreML / Metal	On by default on Apple Silicon — no flag	baked into the ONNX Runtime blumi already downloads (≈zero extra binary size)
NVIDIA CUDA	Opt-in: `curl ... \| BLUMI_CUDA=1 sh`, or `cargo install --features gpu-cuda`	ships `libonnxruntime.so` next to the binary; installer verifies it loads, else falls back to a lean build
CPU	Automatic fallback everywhere	the lean default on Linux/Windows/CI (FTS5 keyword search)

blumi accel detect # the chosen accelerator on one line (apple-coreml | cuda | cpu)
blumi accel status # detection + the embedder's active execution provider + backends
blumi accel doctor # full report + copy-paste setup hints (local GPU server / CUDA)

The bigger GPU win is the LLM, not the embedder. Since blumi is BYOK and runs no model in-process, point a provider at a local GPU server — the ollama, local-mlx, and local-cuda presets ship in config. On Linux/NVIDIA this is the recommended path: a lean blumi + Ollama for the LLM and GPU embeddings, with no fragile native linking.
Grid-aware offload. Every node reports its accelerator in /api/grid/metrics (strongest = CUDA > Apple CoreML > CPU). A CPU node can set embeddings.backend = "grid" to offload embedding to the strongest GPU peer (secret-authed POST /api/grid/embed) with a local/FTS5 fallback — so a lean Linux box rides a Mac/CUDA peer for vectors. See Grid → GPU-aware embedding offload. The active accelerator shows in the TUI (/accel), /api/status, and the blugo Status/Grid panels.

The heavy embedder build is Apple-default and opt-in elsewhere, so a plain Linux/CI install stays lean and never does a multi-GB native link unless you ask for GPU (BLUMI_CUDA=1 / --features gpu-cuda). Mismatch (e.g. acceleration.mode = "cuda" on a CPU build) degrades to CPU with a loud warning — blumi accel doctor shows the compiled features.

Configuration

In ~/.blumi/settings.json (every field shown is the default — omit to keep it):

{
 "embeddings": {
 "enabled": true,
 "backend": "local",
 "model": "bge-small-en-v1.5",
 "dim": 384
 },
 "memory": {
 "enabled": true,
 "recall_k": 5,
 "dedup_threshold": 0.92,
 "max_per_namespace": 2000,
 "diffuse": true,
 "sweep_secs": 60
 },
 "knowledge": {
 "enabled": true,
 "max_file_kb": 256,
 "exclude": []
 },
 "acceleration": {
 "mode": "auto"
 },
 "heal": {
 "enabled": true,
 "recovery_budget": 2,
 "verify": true,
 "learn": true,
 "evolve": "auto",
 "redact_paths": true
 }
}

embeddings.backend — local (bundled ONNX model, fully offline), openai (a configured OpenAI-compatible /embeddings endpoint, e.g. Ollama/llama.cpp/cloud), or grid (offload to the strongest GPU peer via POST /api/grid/embed, with a local/FTS5 fallback — see GPU acceleration).
acceleration.mode — auto (detect), cpu, apple (CoreML/Metal), or cuda for the bundled embedder. See GPU acceleration.
heal — self-healing reflex recovery + failure→fix learning + evolution; learned fixes are stored as agent-namespace memories (so they diffuse). See Self-Management → Self-healing.
Lean / constrained install: set embeddings.enabled = false to skip the model entirely — memory and code search then run on FTS5 keyword matching, no download.

The bundled-embeddings binary is the default; a --no-default-features build of blumi-llm omits the ONNX runtime for a leaner artifact.

Troubleshooting

First memory/code op is slow, then fast. That's the one-time ~130 MB model download (~/.blumi/models). It runs in the background; recall falls back to keyword until it's ready, and a later sweep backfills any vectors. Confirm with du -sh ~/.blumi/models (~130 MB when complete).
Search returns irrelevant results. Make sure the repo is indexed (blumi knowledge status). Symbol/keyword queries are answered FTS-first, so exact names land precisely; very broad queries lean on vectors.
Diffusion didn't reach a peer. Both nodes must run a build with /api/grid/memory, share the grid secret, and have the grid up; the memory must be in a non-user namespace; allow one sweep (~sweep_secs). The user namespace never diffuses by design.

FAQ

Does blumi need an API key for memory or code search?

No. blumi's memory and code knowledge base run on a bundled, on-device embedding model (bge-small-en-v1.5), so they need no API key. You only bring your own key (BYOK) for the LLM that drives the agent and, optionally, for voice — memory and search work without one.

Is my memory sent to the cloud?

No — memory is local-first and stored in SQLite (~/.blumi/blumi.db). The bundled embedder runs on your machine, so nothing leaves it for memory or code search. With cross-peer diffusion on, high-value memories spread only to your own grid peers over an authenticated channel — never to a third party — and your private user namespace never leaves the originating node at all.

How is blumi's memory different from a vector database?

A plain vector database only stores and retrieves embeddings; blumi adds SEDM governance on top. blumi dedups on write (merging near-duplicates instead of piling up), scores each memory by retrieval utility and a learned value, consolidates and evicts the weakest entries, learns verified failure→fix episodes, and diffuses worthwhile memories across machines. Memory that only grows becomes noise — SEDM keeps it useful.

How does blumi survive a crash mid-task?

Through durable execution: every tool step is checkpointed to SQLite, so a crashed or restarted run resumes from the last completed step instead of replaying the whole turn. This is the backbone for long autonomous runs like blumi loop and grid-dispatched work.

How does blumi learn which memories actually work?

blumi learns from outcomes, not just retrieval (v0.5.0). A guided recovery is stored as a probationary pending hypothesis and only promoted to a real fix once the same tool later succeeds (cross-step verify). Each memory also carries a learned value — rewarded when it helped a productive step, decayed on failures, and bumped when a peer independently holds it — and eviction drops the lowest-value memory first, so genuinely useful memories survive.

Can blumi work fully offline?

Yes. After the one-time ~130 MB model download into ~/.blumi/models, memory and code search run fully offline. If a node can't reach the network — or if you set embeddings.enabled = false — memory and knowledge transparently fall back to FTS5 keyword search and keep working.

Memory and Knowledge

Memory & Knowledge

At a glance

How does durable execution survive a crash?

How does blumi's semantic long-term memory (RAG) work?

White-box memory (view · pin · edit · delete)

What is SEDM (self-evolving distributed memory)?

Learning from outcomes (v0.5.0)

How does blumi search my codebase?

Structural code graph (opt-in)

CLI — blumi knowledge

From chat — agent tools

From the phone — blugo Code tab

GPU acceleration

Configuration

Troubleshooting

FAQ

Does blumi need an API key for memory or code search?

Is my memory sent to the cloud?

How is blumi's memory different from a vector database?

How does blumi survive a crash mid-task?

How does blumi learn which memories actually work?

Can blumi work fully offline?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blumi wiki

Clone this wiki locally

CLI — `blumi knowledge`