Xctopus (Alpha) is an adaptive knowledge architecture designed to mitigate Catastrophic Forgetting through distributed epistemic memory. The system organizes information flows into hierarchical structures of Knowledge Nodes (KNs) that evolve dynamically according to the nature of the data.
The system is built from Transformers, Bayesian Nodes, and modular knowledge orchestration. It implements an Adaptive Knowledge Architecture by Layers, where Layer 1 acts as a "living organism" that automatically adjusts its granularity according to domain complexity.
Empirically validated on semantically opposite domains (conversational and scientific), Xctopus demonstrates automatic adaptation while maintaining semantic purity. Contributions are welcome as the system continues its active development.
- Project Objective
- Architecture Overview
- How it Works
- Capa Clustering (Layer 1)
- Empirical Validation (Layer 1)
- Performance Optimizations
- Project Structure
- Installation
- Quick Start
- Configuration
- Documentation
- Roadmap
- Contributions
- License
Our primary mission is the Mitigation of Catastrophic Forgetting in continual learning systems. Xctopus achieves this by evolving from a rigid model into an Adaptive Knowledge Architecture.
Organic Adaptation: Layer 1 acts as a "living organism" that adjusts its granularity based on domain complexity (e.g., automatically shifting from "Continents" in conversational data to "Archipelagos" in scientific data).
Epistemic Collaboration: Multiple Knowledge Nodes collaborate to process information, update Bayesian beliefs, and enable cumulative learning while preserving previously acquired knowledge.
Traceability: Ensuring that every piece of acquired knowledge is persistent, scalable, and semantically pure.
The system demonstrates automatic adaptation - detecting domain characteristics and adjusting clustering density without manual intervention, while maintaining semantic purity across diverse domains (validated on conversational and scientific domains with <1% variance difference).
Xctopus Adaptive Knowledge Architecture
Xctopus is built on an Adaptive Knowledge Architecture by Layers, where each layer acts as a specialized component that evolves based on domain characteristics. Currently, Layer 1 (Clustering & Fusion) is fully implemented, optimized, and empirically validated.
-
Adaptive Granularity: Layer 1 acts as a "living organism" that automatically adjusts clustering density. It successfully transitions from "Continents" (broad topics) to "Archipelagos" (technical niches) without manual retuning.
-
Hierarchical Nodes: Knowledge Nodes encapsulate self-contained computational units with statistical signatures (Centroid, Mass, Variance).
-
Modular Orchestration: A lightweight layer that coordinates the FilterBayesian and KNRepository for real-time routing.
-
Continuous Learning: Bayesian belief updating for adaptive knowledge acquisition, mitigating catastrophic forgetting at the structural level.
-
Optimized Performance: 99% of iterations in GPU/RAM via vectorized operations and SQLite WAL mode.
-
Semantic Purity Preservation: Maintains strict knowledge coherence (variance stability ~0.29) even when scaling from 600 to 3,900+ nodes.
Xctopus does not use static clustering. Instead, it implements Organic Knowledge Induction. The system grows and reshapes itself following four fundamental rules:
Every new piece of data is evaluated by the FilterBayesian. It calculates the probability of belonging to a node based on its gravitational pull (mass) and semantic distance.
A LocalFilter acts as a quality gate. If a data point is an outlier that would ruin a node's coherence (variance), it is rejected to keep the knowledge "pure."
Data that doesn't fit anywhere isn't lost. It goes to a Temporary Buffer. When enough similar ideas gather, a new Knowledge Node is born.
Nodes update their "memory" (centroid and variance) using Welford's Algorithm. This allows the system to learn incrementally without ever needing to re-train from scratch.
Visualization: The system maps knowledge as a galaxy of nodes where size reflects accumulated semantic mass. See the "Visual Evidence" section below for detailed visualization of the "Archipelago" structure in scientific domains.
- Dynamic Growth: Knowledge Nodes are created organically as new concepts emerge, not predefined
- Real-time Statistics: Centroids, mass, and variance are updated incrementally with each embedding
- Bayesian Intelligence: Routing decisions consider both similarity and node maturity (mass)
- Memory Efficient: No duplicate storage; Repository is the single source of truth
- Persistent Learning: System state is maintained across sessions via SQLite persistence
- Adaptive Granularity: Automatically adjusts clustering density based on domain characteristics (validated on conversational and scientific domains)
- Post-Clustering Fusion: Intelligent merging of similar nodes to reduce fragmentation while preserving semantic purity
Capa Clustering is the foundational layer of Xctopus, responsible for organic organization of embeddings into Knowledge Nodes through statistical routing and semantic coherence.
- SQLite-based persistence for Knowledge Node metadata
- Efficient FP16 tensor storage as BLOBs
- Buffer management for temporary embeddings
- Optimized queries with WAL mode and vectorized operations
- Core routing logic based on 4 Golden Rules:
- Rule 1: Similarity Threshold (
S_MIN) - Rule 2: Critical Mass (
log1p(mass) * LAMBDA_FACTOR) - Rule 3: Variance Penalty
- Rule 4: Statistical Stability
- Rule 1: Similarity Threshold (
- Vectorized similarity calculations for performance
- Encapsulates statistical signature (centroid, mass, variance)
- Welford's algorithm for numerically stable updates (FP16-safe)
- Local filter for semantic purity validation
- Transformer/LoRA components (standby for future layers)
- Coordinates routing decisions and node lifecycle
- Intelligent refresh of FilterBayesian signatures (every
REFRESH_INTERVAL) - Buffer aggregation (groups similar buffers before creating new ones)
- Warmup: loads existing nodes from Repository on startup
- Entry point for processing datasets
- Optimized processing loop (warmup + intelligent refresh)
- Rich console output with progress bars and formatted tables
- Batch commits for efficient database operations
- Post-clustering fusion protocol for consolidating similar Knowledge Nodes
- Vectorized similarity matrix calculations (optimized for large-scale analysis)
- Semantic adjacency matrix computation
- Automatic buffer reassignment after fusion operations
- Fusion potential diagnostics with optimized O(n2) → O(n2) vectorized operations
- ✅ Optimized Performance: 99% of iterations in GPU/RAM, minimal disk I/O
- ✅ Numerical Stability: Welford's algorithm prevents FP16 overflow
- ✅ Memory Efficient: No duplicate embedding storage (Repository is single source of truth)
- ✅ Scalable: Vectorized operations handle large datasets efficiently
- ✅ Persistent: SQLite WAL mode for concurrent read/write operations
- ✅ Traceable: Comprehensive logging and structured output
- ✅ Adaptive Clustering: Automatically adjusts granularity based on domain characteristics
- ✅ Post-Clustering Fusion: Intelligent merging of similar Knowledge Nodes to reduce fragmentation
- ✅ Vectorized Diagnostics: Optimized similarity calculations for large-scale analysis
xctopus/
├── src/
│ └── xctopus/
│ ├── __init__.py # Package initialization and exports
│ ├── settings.py # Centralized configuration (NO hardcoded values)
│ ├── logger_config.py # Logging setup
│ ├── main.py # Entry point for Capa Clustering
│ ├── repository.py # KNRepository: SQLite persistence
│ ├── filter_bayesian.py # FilterBayesian: Routing logic
│ ├── knowledgenode.py # KnowledgeNode: Core node logic
│ ├── orchestrator.py # Orchestrator: Coordination layer
│ └── fusion.py # Fusion Engine: Post-clustering consolidation
├── notebooks/ # Jupyter notebooks for testing and analysis
│ └── quickstart.ipynb # Main testing notebook
├── logs/ # Log files (auto-generated)
├── knowledge_base.sqlite # SQLite database (auto-generated)
├── pyproject.toml # Dependencies and project config
├── .gitignore
└── README.md
- Python 3.8+
- PyTorch (CPU or CUDA)
- SQLite3 (usually included with Python)
# Clone the repository git clone https://github.com/msancheza/xctopus-core.git cd xctopus-core/xctopus # Install dependencies pip install -e .
For enhanced functionality:
# Enhanced console output (formatted tables, progress bars) pip install rich>=13.0.0 # Or install all optional dependencies pip install -e ".[all]"
Note: The system works without rich, but with reduced console formatting.
Your dataset should be a CSV file with embeddings. Each row should contain a single embedding vector (384 dimensions by default, configurable in settings.py).
Example CSV format:
embedding_0,embedding_1,embedding_2,...,embedding_383
0.123,0.456,0.789,...,0.321
...
# Process your dataset
python -m xctopus.main data/embeddings.csvThe system will:
- Load embeddings from the CSV
- Initialize components (Repository, FilterBayesian, Orchestrator)
- Process each embedding through the routing system
- Create Knowledge Nodes organically based on semantic similarity
- Execute post-clustering fusion to consolidate similar nodes
- Display progress and summary statistics
from xctopus import KNRepository, FilterBayesian, Orchestrator from xctopus.main import load_embeddings, process_dataset, initialize_components import torch # Initialize components repository, filter_bayesian, orchestrator = initialize_components() # Load embeddings from CSV embeddings = load_embeddings("data/embeddings.csv") # Process dataset process_dataset( embeddings=embeddings, repository=repository, filter_bayesian=filter_bayesian, orchestrator=orchestrator ) # Access results signatures = repository.get_all_signatures() print(f"Created {len(signatures)} Knowledge Nodes") # Optional: Run fusion to consolidate similar nodes from xctopus.fusion import fuse_knowledge_nodes, diagnose_fusion_potential # Diagnose fusion potential diagnosis = diagnose_fusion_potential(repository) print(f"Fusion potential: {diagnosis['similarity_pairs']}") # Execute fusion fusion_stats = fuse_knowledge_nodes(repository, orchestrator) print(f"Fusion completed: {fusion_stats['fusions_performed']} nodes merged")
All configuration is centralized in src/xctopus/settings.py. No hardcoded values are allowed in the codebase.
# Technical Configuration DTYPE = torch.float16 # Half-precision for memory efficiency DEVICE = "cuda" or "cpu" # Auto-detected based on availability # Routing Parameters S_MIN = 0.65 # Minimum cosine similarity threshold (optimized for diverse datasets) LAMBDA_FACTOR = 0.1 # Critical mass attraction strength # Structure Parameters EMBEDDING_DIM = 384 # Embedding vector dimension BUFFER_THRESHOLD = 3 # Embeddings needed to promote buffer to KN (reduced for faster concept validation) # Persistence Parameters DB_PATH = "knowledge_base.sqlite" SAVE_BATCH_SIZE = 10 # Batch commits for efficiency # Orchestrator Parameters REFRESH_INTERVAL = 10 # Intelligent refresh frequency # Fusion Parameters FUSION_SIMILARITY_THRESHOLD = 0.85 # Minimum similarity for node fusion FUSION_MIN_MASS = 10 # Maximum mass for "Small Stable" nodes FUSION_MAX_VARIANCE = 0.5 # Maximum variance for stable nodes FUSION_VARIANCE_INCREASE_THRESHOLD = 0.1 # Maximum variance increase after fusion
Edit src/xctopus/settings.py directly, or create a custom settings module:
# custom_settings.py import torch from xctopus.settings import * # Override specific parameters S_MIN = 0.80 BUFFER_THRESHOLD = 10 REFRESH_INTERVAL = 20
Xctopus follows a layered architecture approach, where each layer builds upon the previous one:
Status: Fully implemented, optimized, and empirically validated
- ✅ Core Components: Repository, FilterBayesian, KnowledgeNode, Orchestrator
- ✅ Fusion Engine: Post-clustering consolidation of similar Knowledge Nodes
- ✅ Vectorized Diagnostics: Optimized similarity calculations for large-scale analysis
- ✅ Universal Validation: System validated on conversational and scientific domains
- ✅ Adaptive Granularity: Automatic adjustment of clustering density based on domain
- ✅ Performance Optimizations: Vectorized operations, intelligent refresh, batch commits
Validation Results:
- Processes 18,260 embeddings in ~15-16 minutes
- Maintains semantic purity (variance ~0.29) across diverse domains
- Demonstrates automatic adaptation (×ばつ granularity difference between domains)
Focus: Incremental training of Knowledge Nodes created in Layer 1
- Transformer/LoRA fine-tuning for each Knowledge Node
- Incremental learning protocols
- Knowledge persistence and retrieval
- Training state management
- Multi-layer orchestration
- Attention mechanisms between nodes
- Advanced morphological operations
- Cross-layer knowledge transfer
- Benchmark experiments on standard continual learning datasets
- Performance profiling and optimization
- Integration with external knowledge bases
The current implementation includes several critical optimizations:
- Warmup Initialization: Load signatures once at startup, not per iteration
- Intelligent Refresh: Update FilterBayesian signatures only when needed (
REFRESH_INTERVAL) - Vectorized Operations: Single SQL JOIN for buffer centroids (instead of N queries)
- WAL Mode: SQLite Write-Ahead Logging for concurrent read/write
- Batch Commits: Periodic database commits for efficiency
- Memory Optimization: No duplicate embedding storage (Repository is single source)
- Vectorized Similarity Calculations: Matrix operations for fusion diagnostics (reduces O(n×ばつ5) to single vectorized pass)
- Adaptive Granularity: System automatically adjusts clustering density based on domain characteristics
Result: Processes 18,260 embeddings in ~15-16 minutes with full fusion operations (vs. several hours without optimizations).
Xctopus has demonstrated a unique capability to detect and adapt to the "semantic fingerprint" of different domains without manual intervention:
The knowledge clusters into large thematic masses with dominant master nodes:
- Master Node Mass: 93 embeddings (×ばつ the average)
- Architecture: Few large hubs acting as semantic centers
- Pattern: Broad, general topics attract many related documents
The system generates a dense network of thousands of specialized islands:
- Total Knowledge Nodes: 3,901 (×ばつ more granular)
- Master Node Mass: 28 embeddings (only ×ばつ the average)
- Architecture: Uniform distribution, no dominant hubs
- Pattern: Each technical concept forms its own compact, well-defined cluster
| Metric | Conversational Domain | Scientific Domain (arXiv) | Interpretation |
|---|---|---|---|
| Node Consolidation | 42.71% | 82.95% | Scientific domain shows extreme efficiency |
| Noise Reduction | Baseline | -79% | Better consolidation in structured domains |
| Semantic Purity (Variance) | 0.2909 | 0.2882 | <1% difference - purity preserved |
| Knowledge Nodes | 680 | 3,901 | Automatic granularity adjustment |
| Max Node Mass | 93 | 28 | Uniform distribution in scientific domain |
Interpretation: The stability of variance (difference < 1%) confirms that the system maintains knowledge purity regardless of whether the data is fluid conversational language or highly technical scientific content. The system automatically detects domain characteristics and adjusts its clustering density (×ばつ more nodes in scientific domain) while preserving semantic coherence.
Knowledge Galaxy Visualization
The "Knowledge Galaxy" visualization provides visual proof of the "Archipelago" structure in scientific domains - a perfectly organized network of specialized knowledge islands with uniform distribution and absence of dominant hubs.
Contributions are welcome! We appreciate your interest in helping improve Xctopus.
- 📖 Read our Contributing Guide to get started
- 📋 Review our Code of Conduct to understand our community standards
Whether you're fixing bugs, adding features, improving documentation, or suggesting ideas, all contributions are valued and recognized.
This project is licensed under the MIT License - see the LICENSE.md file for details.
- 🌐 Official website: xctopus.com
- 📚 Documentation: See
docs/folder for detailed guides - 💬 Research Discussion: Approaches to Mitigate Catastrophic Forgetting in Modular Systems
This project is in an exploratory phase. The intention is to build an innovative architecture, inspired by adaptive and hierarchical systems, and open it to interested researchers when ready.
Current Focus: Capa Clustering (Layer 1) is complete and operational. Future layers will build upon this foundation.