Enterprise-Grade Multi-Tenant AI Memory System
A secure, high-performance backend for building AI applications with persistent memory, designed for large-scale enterprise deployments with strict multi-tenant isolation.
TypeScript Express Milvus License
Quick Start Β· Features Β· Architecture Β· API Docs Β· Configuration
Building AI apps with memory is hard. Building enterprise-scale, multi-tenant AI apps with memory is harder.
Common challenges:
- β Data isolation between users is complex and error-prone
- β Vector databases grow indefinitely, storing irrelevant conversations
- β RAG systems need memory, but how do you manage it at scale?
- β Single LLM round-trip for memory evaluation is too slow
- β Pure vector search misses keyword-critical memories
MemChat solves all of these out of the box:
- β Dual-layer isolation β JWT + database partition key enforcement
- β Cognitive memory model β semantic / episodic / procedural / todo classification
- β Single-call memory pipeline β 2 LLM calls β 1, ~50% latency reduction
- β Hybrid retrieval β vector + keyword + time decay + importance scoring
- β Session-aware working memory β natural multi-turn conversation context
- β Embedding LRU cache β eliminate redundant inference for repeated text
- β HNSW vector index β million-scale performance, enterprise-ready
Your AI assistant that truly remembers β across sessions, workspaces, and time
π Day 1, Workspace "work":
User: "I prefer TypeScript for backend, and I work at TikTok's infra team"
AI: "Got it! I'll remember your stack preference and team context..."
β Stores: semantic memory (preference), episodic memory (team info)
π Day 3, new session (memory auto-retrieved):
User: "What language should I use for my new API?"
AI: "Based on your TypeScript preference and TikTok infra context, consider..."
β Long-term memory retrieved via hybrid search!
π Same session, continuing conversation:
User: "Also, remind me about the architecture discussion we just had"
AI: "Sure! Earlier you mentioned wanting to use microservices for..."
β Working memory β no retrieval needed, context is in-session!
MemChat models memory after human cognitive science, with four distinct memory types:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Memory Taxonomy β
ββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββ€
β semantic β Stable facts: preferences, skills, background β
β episodic β Events: meetings, decisions, experiences β
β procedural β Patterns: habits, workflows, behaviors β
β todo β Tasks: reminders, deadlines, action items β
ββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Memory Layer Architecture β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ€
β Working Memory (session) β Last N turns, in-context β
β Long-term Memory (Milvus)β Persistent, RAG-retrieved β
ββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββ
Before (2 LLM calls, serial):
embed(summary) β search β LLM: importance check β LLM: update decision
β latency ~2s β latency ~1s
After (1 LLM call):
embed(userMessage) β search β LLM: extract facts + decide (all-in-one)
β cache hit likely β latency ~1s + batch multi-fact support
Key benefits:
- ~50% reduction in memory pipeline latency
- Batch fact extraction β extracts multiple facts per conversation
- Embedding cache hit β RAG phase already embedded the user message
Pure vector search is not enough. MemChat uses a multi-signal scoring pipeline:
Retrieve topK Γγ°γ€ 3 candidates from Milvus
β
Multi-signal scoring:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β vector_sim Γγ°γ€ 0.50 (semantic similarity) β
β keyword_score Γγ°γ€ 0.20 (BM25-inspired term overlap) β
β time_decay Γγ°γ€ 0.15 (Ebbinghaus forgetting curve) β
β importance Γγ°γ€ 0.15 (LLM-assigned importance) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
Re-rank β return topK
Time decay formula (Ebbinghaus-inspired):
score = exp(-ln(2) Γγ°γ€ age_days / half_life_days)
Memories fade naturally over time, just like human memory.
POST /api/chat { sessionId: "optional-client-id", message: "..." }
LLM receives:
[system: persona + long-term memories]
[user: "turn 1"] β working memory
[assistant: "turn 1"] β working memory
[user: "turn 2"] β working memory
[assistant: "turn 2"] β working memory
[user: "current message"] β current turn
- Natural multi-turn β no need to repeat context in every message
- Session isolation β each sessionId maintains independent context
- Auto-expiry β sessions expire after configurable TTL (default 2 hours)
chat() β embed(userMessage) β cache MISS β model inference β cache SET
β
processAndStoreMemory() β embed(userMessage) β cache HIT β instant return
β
Zero redundant inference!
- LRU eviction with configurable max size (default 2000 entries)
- Cache stats available via service for monitoring
Index: IVF_FLAT β HNSW
β good for <100K β designed for millions
β exact search β approximate, high recall
β nprobe tuning β ef tuning (simpler)
Params: M=16, efConstruction=200 (build quality)
Search: ef=max(64, topΓγ°γ€4) (query quality)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Layer 1: JWT Authentication Middleware β
β - Extracts user_id from signed token β
β - Rejects all unauthenticated requests β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Layer 2: Milvus Partition Key Enforcement β
β - user_id is Partition Key (physical data isolation) β
β - All queries force-filtered by user_id β
β - Impossible to access another user's data β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Security guarantee: Even if auth middleware is bypassed, data layer prevents cross-tenant access.
Level 0: Raw conversation chunks
β (greedy vector clustering + LLM summarization)
Level 1: Topic summaries
β (same process)
Level 2: High-level abstractions
Trigger: when memories reach 50% of maxMemoriesPerUser
Cleanup: when memories reach 90% of maxMemoriesPerUser
β delete expired β compress β retention scoring β LLM evaluation
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Client / Frontend β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Express.js Server β
β βββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββββ β
β β Auth Middleware β β Controllers β β Routes β β
β β (JWT Verify) β β (Business Logic)β β (Endpoints) β β
β βββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββΌβββββββββββββββββββββββ
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββββββββ ββββββββββββββββββββ
β LLM Service β β Memory Service β βEmbedding Service β
β (OpenAI compat) β β βββββββββββββββββββββ β β (Local Model) β
β β β β Pipeline (1-call) β β β + LRU Cache β
β chat() with β β β Hybrid Retrieval β β ββββββββββββββββββββ
β working memory β β β Time Decay Score β β
βββββββββββββββββββ β βββββββββββββββββββββ β
βββββββββββββββββββββββββ
ββββββββββββββββββββββββββΌβββββββββββββββββββββββ
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββββββββ ββββββββββββββββββββ
β Working Memory β β Milvus (HNSW) β β Compression β
β Service β β user_memories β β Service β
β (session store) β β (Partition by user) β β (cluster+LLM) β
βββββββββββββββββββ βββββββββββββββββββββββββ ββββββββββββββββββββ
- Node.js 18+
- Docker & Docker Compose
- OpenAI API key (or compatible endpoint)
# Clone the repository git clone https://github.com/your-username/memchat.git cd memchat # Start Milvus (vector database) docker-compose up -d # Install dependencies npm install # Configure environment cp .env.example .env # Edit .env with your API keys # Start development server npm run dev
Visit http://localhost:3000 for the interactive testing UI.
# Server PORT=3000 NODE_ENV=development # JWT JWT_SECRET=your-super-secret-key # Milvus MILVUS_ADDRESS=localhost:19530 # LLM (OpenAI compatible) LLM_API_KEY=your-api-key LLM_BASE_URL=https://api.openai.com/v1 LLM_MODEL=gpt-4
# Max messages kept per session (user+assistant each count as 1) WORKING_MEMORY_MAX_MESSAGES=20 # Session expiry in minutes (default: 2 hours) WORKING_MEMORY_TTL_MINUTES=120 # Disable if not needed WORKING_MEMORY_ENABLED=true
# Max cached embeddings (LRU eviction) EMBEDDING_CACHE_MAX_SIZE=2000 # Disable for debugging EMBEDDING_CACHE_ENABLED=true
# Weights must sum to ~1.0 RETRIEVAL_VECTOR_WEIGHT=0.50 # Semantic similarity RETRIEVAL_KEYWORD_WEIGHT=0.20 # Keyword overlap (BM25-inspired) RETRIEVAL_TIME_DECAY_WEIGHT=0.15 # Recency (Ebbinghaus curve) RETRIEVAL_IMPORTANCE_WEIGHT=0.15 # LLM-assigned importance # Half-life for time decay (days) β memories at this age score ~0.5 RETRIEVAL_HALF_LIFE_DAYS=90 # Candidate pool multiplier (topK Γγ°γ€ this = candidates fetched for reranking) RETRIEVAL_CANDIDATE_MULTIPLIER=3
MAX_MEMORIES_PER_USER=1000 MEMORY_CLEANUP_THRESHOLD=0.9 # Trigger cleanup at 90% capacity MEMORY_CLEANUP_TARGET=0.7 # Reduce to 70% after cleanup MEMORY_SIMILARITY_TOP_K=8 # Candidates for write dedup MEMORY_SIMILARITY_THRESHOLD=0.7 # L2 threshold for similarity
POST /api/auth/register - Register User
curl -X POST http://localhost:3000/api/auth/register \ -H "Content-Type: application/json" \ -d '{"username": "alice"}'
Response:
{
"userId": "alice",
"username": "alice",
"token": "eyJhbGciOiJIUzI1NiIs..."
}POST /api/auth/login - Login
curl -X POST http://localhost:3000/api/auth/login \ -H "Content-Type: application/json" \ -d '{"username": "alice"}'
POST /api/chat - Send Message
curl -X POST http://localhost:3000/api/chat \ -H "Authorization: Bearer YOUR_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "workspaceId": "work-project", "message": "I prefer TypeScript for backend development", "sessionId": "optional-session-uuid" }'
Response:
{
"response": "I'll remember that you prefer TypeScript...",
"memoriesUsed": 2,
"memoriesStored": 1,
"sessionId": "alice:work-project"
}Upgraded Flow:
- Resolves or creates session (working memory)
- Retrieves long-term memories via hybrid search (vector + keyword + time decay)
- Calls LLM with session history + long-term memories + current message
- Updates working memory (sync)
- Async pipeline: single LLM call extracts facts + decides create/update/merge/skip
GET /api/memories?workspaceId=xxx - List Memories
curl "http://localhost:3000/api/memories?workspaceId=work-project" \ -H "Authorization: Bearer YOUR_TOKEN"
Response:
{
"count": 3,
"memories": [
{
"id": "memory-uuid",
"content": "User prefers TypeScript for backend",
"importanceScore": 8
}
]
}PUT /api/memories/:id - Update Memory
curl -X PUT http://localhost:3000/api/memories/memory-uuid \ -H "Authorization: Bearer YOUR_TOKEN" \ -H "Content-Type: application/json" \ -d '{"content": "Updated content"}'
DELETE /api/memories/:id - Delete Memory
curl -X DELETE http://localhost:3000/api/memories/memory-uuid \
-H "Authorization: Bearer YOUR_TOKEN"| Component | Technology | Notes |
|---|---|---|
| Runtime | Node.js + TypeScript | Type-safe throughout |
| Framework | Express.js | REST API |
| Vector DB | Milvus 2.4 | HNSW index, Partition Key isolation |
| Embeddings | @xenova/transformers (local) | MiniLM-L12-v2, 384-dim, with LRU cache |
| LLM | OpenAI API (or compatible) | Single-call pipeline |
| Auth | JWT | Stateless, multi-tenant |
| Container | Docker Compose | Milvus + etcd + MinIO |
src/
βββ config/
β βββ index.ts # All config with env var overrides
βββ middlewares/
β βββ auth.middleware.ts # JWT verification
βββ services/
β βββ memory.service.ts # Memory orchestration (pipeline + hybrid retrieval)
β βββ working-memory.service.ts # Session-level short-term memory
β βββ milvus.service.ts # Vector DB (HNSW, Partition Key isolation)
β βββ embedding.service.ts # Embeddings + LRU cache
β βββ llm.service.ts # LLM (chat with session history, pipeline)
β βββ chunking.service.ts # Text chunking
β βββ compression.service.ts # Cluster-based memory compression
β βββ memory-cleanup.service.ts # Cleanup orchestration
β βββ cleanup.service.ts # Periodic expired memory cleanup
β βββ persona.service.ts # AI persona management
βββ controllers/
β βββ chat.controller.ts # Chat endpoint with working memory
β βββ memory.controller.ts # Memory CRUD
β βββ auth.controller.ts
β βββ persona.controller.ts
βββ routes/
βββ types/
β βββ index.ts # Full type definitions incl. MemoryCategory
βββ utils/
- Never trust client input β all
user_idfrom JWT, never from request body - Defense in depth β auth middleware + Milvus partition key (two independent layers)
- No plaintext secrets β environment variables only
- Input validation β TypeScript type checking on all endpoints
- Tenant isolation β even if one tenant guesses another's
workspaceId, user_id partition key blocks all cross-tenant queries
| Scenario | Before | After |
|---|---|---|
| Memory pipeline LLM calls | 2 (serial) | 1 |
| Embedding for same message | Γγ°γ€ inference | Γγ°γ€ (cache hit) |
| Search candidates for retrieval | topK exact | topK Γγ°γ€ 3 + rerank |
| Index type | IVF_FLAT | HNSW (million-scale) |
| Multi-turn context | Not supported | Working memory (last N turns) |
- Streaming responses (SSE)
- Knowledge graph layer (entity + relation extraction)
- Pluggable embedding model (support OpenAI, Cohere, etc.)
- Redis-backed working memory (for multi-instance deployments)
- Multi-modal memory (images, files)
- Admin dashboard with memory analytics
- Rate limiting per tenant
- Prometheus metrics endpoint
Contributions are welcome! Please read our Contributing Guide for details.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Milvus β High-performance vector database with HNSW support
- Transformers.js β Local multilingual embeddings
- OpenAI β LLM capabilities
β If this project helped you, please give it a star! β