-
-
Notifications
You must be signed in to change notification settings - Fork 24
Knowledge Graph Guide
How knowledge graphs work in SuperLocalMemory - TF-IDF entity extraction, Leiden clustering, and graph-enhanced search explained for developers.
A knowledge graph is a network of entities (concepts) and relationships that represents how your memories connect to each other. SuperLocalMemory automatically builds this graph from your saved memories to improve search quality and discover hidden relationships.
Example:
Memory 1: "We use FastAPI for REST APIs"
Memory 2: "JWT tokens expire after 24 hours"
Memory 3: "FastAPI requires authentication middleware"
Knowledge Graph discovers:
FastAPI ββ REST APIs
FastAPI ββ authentication
authentication ββ JWT tokens
Even though Memory 1 and 2 don't mention each other,
the graph connects them via "authentication"!
SuperLocalMemory uses GraphRAG (Microsoft Research) approach with three core algorithms:
What it does: Identifies important terms (entities) in your memories.
TF-IDF = Term Frequency - Inverse Document Frequency
Formula (simplified):
importance = (how often term appears in memory)
Γγ°γ€ log(total memories / memories with this term)
Example:
Memory: "FastAPI is faster than Flask for high-throughput APIs"
Extracted entities:
- "FastAPI" (TF-IDF: 0.85) β
Important
- "Flask" (TF-IDF: 0.72) β
Important
- "high-throughput" (TF-IDF: 0.68) β
Important
- "APIs" (TF-IDF: 0.45) β οΈ Common but relevant
- "is" (TF-IDF: 0.02) β Stop word, filtered out
- "than" (TF-IDF: 0.01) β Stop word, filtered out
Filtering rules:
- Minimum TF-IDF score: 0.1
- Stop words removed (the, and, or, is, etc.)
- Case insensitive ("React" = "react")
- Minimum term length: 3 characters
What it does: Groups related memories into topic clusters.
Leiden = Community detection algorithm (better than older Louvain algorithm)
How it works:
- Creates graph nodes from entities
- Creates edges between entities that co-occur
- Detects "communities" (groups of highly connected nodes)
- Optimizes for modularity (how well-defined clusters are)
Example clusters discovered:
Cluster 1: "Authentication & Security" (23 memories)
Top entities: JWT, OAuth, tokens, auth, security
Cluster 2: "Database & PostgreSQL" (18 memories)
Top entities: PostgreSQL, database, SQL, queries, indexes
Cluster 3: "React & Frontend" (15 memories)
Top entities: React, hooks, components, state, props
Modularity score:
- Excellent: >0.7 (clusters are well-defined)
- Good: 0.5-0.7 (clusters are meaningful)
- Poor: <0.3 (clusters are arbitrary)
What it does: Finds connections between memories.
Three types of edges:
A. Similarity Edges
cosine_similarity = dot(vector_A, vector_B) / (norm(vector_A) * norm(vector_B))
- Score 0.8-1.0: Very similar content
- Score 0.5-0.8: Related content
- Score 0.3-0.5: Loosely related
- Score <0.3: Not connected
B. Co-occurrence Edges
If two entities appear in same memory β create edge
Weight = number of co-occurrences
C. Temporal Edges
If two memories created within 1 hour β may be related
Useful for conversation threads
slm build-graph
Output:
π Building Knowledge Graph...
Phase 1: Entity Extraction
Scanning 1,247 memories...
Extracted 892 unique entities
Created 892 graph nodes
β Complete (3.2s)
Phase 2: Relationship Discovery
Computing similarity scores...
Created 3,456 edges (relationships)
Avg edges per node: 3.9
β Complete (5.1s)
Phase 3: Optimization
Indexing graph structure...
Pruning weak edges (score < 0.3)...
Final edge count: 2,134
β Complete (1.2s)
β
Knowledge graph built successfully!
Graph Statistics:
Nodes: 892
Edges: 2,134
Density: 0.27%
Largest Component: 856 nodes (96%)
slm build-graph --clustering
Requires optional dependencies:
pip3 install python-igraph leidenalg
Additional output:
Phase 4: Topic Clustering (Leiden)
Detecting communities...
Found 47 clusters
Largest cluster: 89 memories
Smallest cluster: 3 memories
Modularity score: 0.82 (excellent)
β Complete (2.3s)
Discovered Clusters:
Cluster 1 (89 memories): "Authentication & Security"
Top entities: JWT, OAuth, tokens, auth, security
Cluster 2 (76 memories): "Database & PostgreSQL"
Top entities: PostgreSQL, database, SQL, queries, indexes
slm build-graph --force
Deletes existing graph and rebuilds from scratch. Use when:
- Graph seems corrupted
- After major bulk import
- Want fresh start
Total unique entities extracted
Good indicators:
- 100+ nodes for 1,000 memories
- 500+ nodes for 5,000 memories
Poor indicators:
- <10 nodes for 1,000 memories (not extracting entities properly)
Total relationships discovered
Edges/Nodes ratio:
- Good: >2 (well-connected)
- Poor: <1 (disconnected graph)
Example:
892 nodes, 2,134 edges
Ratio: 2,134 / 892 = 2.39 β
Good
How connected the graph is
Formula:
density = (actual edges / possible edges) Γγ°γ€ 100
possible edges = nodes Γγ°γ€ (nodes - 1) / 2
Example:
892 nodes
Possible edges: 892 Γγ°γ€ 891 / 2 = 397,386
Actual edges: 2,134
Density: (2,134 / 397,386) Γγ°γ€ 100 = 0.54%
Typical values:
- 0.1% - 1%: Normal
- <0.05%: Very disconnected (isolated knowledge)
-
5%: Too connected (poor entity extraction)
Size of biggest connected subgraph
Good indicators:
-
80% of nodes (knowledge is interconnected)
Poor indicators:
- <50% of nodes (fragmented knowledge islands)
Example:
892 nodes total
856 nodes in largest component
Coverage: 856 / 892 = 96% β
Excellent
- Bulk imports - Added 50+ memories at once
- Database restore - Restored from backup
- Major milestone - Sprint complete, project phase done
- Monthly - Keep graph optimized
- After 500 new memories - Maintain quality
- When search feels slow - Rebuild indexes
- Poor search results - Graph may be stale
- Missing relationships - Rebuild connections
- Corrupted graph errors - Force rebuild
Automation (cron):
# Every Sunday at 3 AM 0 3 * * 0 /usr/local/bin/slm build-graph --clustering >> /var/log/slm-build.log 2>&1
Basic keyword matching:
slm recall "authentication" Results: - "JWT tokens expire after 24 hours" β Contains "auth" stem - "User login endpoint uses POST" β Missed (no "auth" keyword)
Graph traversal finds related memories:
slm recall "authentication" Results (via graph): - "JWT tokens expire after 24 hours" β Direct match - "User login endpoint uses POST" β Graph: login β auth β JWT - "OAuth 2.0 flow implementation" β Graph: OAuth β tokens β auth - "Session management strategy" β Graph: sessions β auth β security
How it works:
- Find memories matching query (direct)
- Extract entities from those memories
- Traverse graph to find related entities
- Find memories containing related entities
- Rank by combined score (keyword + graph + semantic)
# Build with clustering slm build-graph --clustering # Search within specific cluster slm recall "performance" --cluster "Database & PostgreSQL"
Benefits:
- Faster search (smaller search space)
- More relevant results (topically focused)
- Avoids false positives from other domains
# Python API from memory_store_v2 import MemoryStoreV2 from graph_engine import GraphEngine store = MemoryStoreV2() graph = GraphEngine() # Find memories related to ID 42 related = graph.get_related_memories(42, limit=5) for mem_id, score in related: print(f"Memory {mem_id}: {score:.2f}")
# Export graph for visualization (coming soon) slm build-graph --export graph.json # Generate HTML visualization slm graph-viz graph.json > graph.html
| Memory Count | Build Time | With Clustering |
|---|---|---|
| 100 | ~1s | ~1.5s |
| 1,000 | ~10s | ~15s |
| 5,000 | ~1min | ~1.5min |
| 10,000 | ~2min | ~3min |
| 50,000+ | ~15min | ~25min |
Factors affecting speed:
- Memory content length (longer = slower)
- Vocabulary size (more unique words = slower)
- Hardware (CPU, RAM)
Before graph:
- Average search time: 150ms
- Recall@10: 68% (finds 68% of relevant memories)
After graph:
- Average search time: 45ms (Γγ°γ€ faster!)
- Recall@10: 87% (finds 87% of relevant memories)
Improvement: 28% more relevant results, 70% faster
Cause: Not enough RAM for large graph
Solution:
# Build in chunks slm build-graph --chunk-size 1000 # Or archive old memories first sqlite3 ~/.claude-memory/memory.db \ "DELETE FROM memories WHERE created_at < date('now', '-180 days');"
Cause: Optional dependencies not installed
Solution:
pip3 install python-igraph leidenalg # Verify python3 -c "import igraph; import leidenalg" # Try again slm build-graph --clustering
Cause: Stale graph or poor similarity threshold
Solution:
# Force complete rebuild slm build-graph --force # Adjust similarity threshold (advanced) slm build-graph --min-similarity 0.4 # Default: 0.3
Solutions:
# Show progress slm build-graph --verbose # Skip clustering (faster) slm build-graph # No --clustering flag # Check disk space df -h ~/.claude-memory/
# Import many memories while read -r line; do slm remember "$line" done < bulk_memories.txt # Immediately rebuild graph slm build-graph
# Install dependencies once pip3 install python-igraph leidenalg # Always build with clustering if >1000 memories if [ $(slm status | grep "Total memories" | awk '{print 3γγ«}') -gt 1000 ]; then slm build-graph --clustering else slm build-graph fi
# Check graph statistics slm status --verbose | grep -A 10 "Knowledge Graph" # Good indicators: # - Edges/Nodes ratio > 2 # - Density: 0.1% - 1% # - Largest component: >80% # - Modularity (if clustering): >0.5
# Add to crontab # Weekly: Sunday 3 AM 0 3 * * 0 /usr/local/bin/slm build-graph --clustering # After git push (post-push hook) #!/bin/bash slm remember "Pushed $(git log -1 --oneline)" --tags git slm build-graph
Python code (simplified):
from sklearn.feature_extraction.text import TfidfVectorizer # Extract entities vectorizer = TfidfVectorizer( max_features=5000, min_df=2, max_df=0.8, stop_words='english', ngram_range=(1, 2) ) # Fit on all memories tfidf_matrix = vectorizer.fit_transform(memories) # Get feature names (entities) entities = vectorizer.get_feature_names_out() # Filter by score threshold important_entities = [e for e, score in zip(entities, scores) if score > 0.1]
Resolution parameter:
- Default: 1.0
- Lower (0.5): Fewer, larger clusters
- Higher (2.0): More, smaller clusters
Quality metric (modularity):
Q = (edges_within_clusters / total_edges) - (expected_edges_within_clusters / total_edges)2
Remove weak edges to improve performance:
# Keep only edges with score > threshold threshold = 0.3 pruned_edges = [(u, v, w) for u, v, w in edges if w > threshold] # Result: 30-50% fewer edges, same search quality
Standard Leiden finds flat communities β "Python", "JavaScript", "DevOps". Hierarchical Leiden goes deeper by recursively sub-clustering large communities:
Python (42 members)
βββ FastAPI (18 members)
β βββ Authentication (7 members)
β βββ Database Models (6 members)
βββ Data Science (14 members)
βββ CLI Tools (10 members)
- Flat Leiden runs first (existing behavior)
- Clusters with β₯10 members are recursively sub-clustered
- Maximum depth: 3 levels (configurable via
max_depthparameter) - Each sub-cluster gets its own name from TF-IDF entity extraction
-
parent_cluster_idanddepthcolumns track the hierarchy ingraph_clusterstable
# Run hierarchical sub-clustering on existing clusters python3 ~/.claude-memory/graph_engine.py hierarchical # Full build (includes hierarchical + summaries automatically) python3 ~/.claude-memory/graph_engine.py build
-- New columns on graph_clusters (added automatically) ALTER TABLE graph_clusters ADD COLUMN parent_cluster_id INTEGER; ALTER TABLE graph_clusters ADD COLUMN depth INTEGER DEFAULT 0; ALTER TABLE graph_clusters ADD COLUMN summary TEXT;
Every cluster gets a TF-IDF structured summary describing its contents:
Cluster "FastAPI & Authentication"
Summary: Key topics: fastapi, authentication, jwt, middleware, oauth |
Projects: myapp, api-gateway | Categories: backend |
18 memories | Sub-cluster of: Python
| Component | Source | Example |
|---|---|---|
| Key topics | Top 5 TF-IDF entities | fastapi, authentication, jwt |
| Projects | Distinct project_name values |
myapp, api-gateway |
| Categories | Distinct category values |
backend, security |
| Size | Member count | 18 memories |
| Hierarchy | Parent cluster name (if sub-cluster) | Sub-cluster of: Python |
# Generate summaries for all clusters python3 ~/.claude-memory/graph_engine.py summaries # Summaries are also generated automatically during build python3 ~/.claude-memory/graph_engine.py build
Summaries appear in the web dashboard clusters view and are returned by the /api/clusters endpoint.
- Quick Start Tutorial - First-time setup
- Pattern Learning Explained - How pattern learning works
- CLI Cheatsheet - Command reference
- Python API - Programmatic access
- Why Local Matters - Privacy benefits
Created by Varun Pratap Bhardwaj Solution Architect β’ SuperLocalMemory