Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

sskingss/MemChat

Repository files navigation

🧠 MemChat

Enterprise-Grade Multi-Tenant AI Memory System

A secure, high-performance backend for building AI applications with persistent memory, designed for large-scale enterprise deployments with strict multi-tenant isolation.

TypeScript Express Milvus License

Quick Start Β· Features Β· Architecture Β· API Docs Β· Configuration


🎯 Why MemChat?

Building AI apps with memory is hard. Building enterprise-scale, multi-tenant AI apps with memory is harder.

Common challenges:

  • ❌ Data isolation between users is complex and error-prone
  • ❌ Vector databases grow indefinitely, storing irrelevant conversations
  • ❌ RAG systems need memory, but how do you manage it at scale?
  • ❌ Single LLM round-trip for memory evaluation is too slow
  • ❌ Pure vector search misses keyword-critical memories

MemChat solves all of these out of the box:

  • βœ… Dual-layer isolation β€” JWT + database partition key enforcement
  • βœ… Cognitive memory model β€” semantic / episodic / procedural / todo classification
  • βœ… Single-call memory pipeline β€” 2 LLM calls β†’ 1, ~50% latency reduction
  • βœ… Hybrid retrieval β€” vector + keyword + time decay + importance scoring
  • βœ… Session-aware working memory β€” natural multi-turn conversation context
  • βœ… Embedding LRU cache β€” eliminate redundant inference for repeated text
  • βœ… HNSW vector index β€” million-scale performance, enterprise-ready

🎬 Demo

MemChat Demo

Your AI assistant that truly remembers β€” across sessions, workspaces, and time

✨ Persistent Memory Across Sessions

πŸ“ Day 1, Workspace "work":
User: "I prefer TypeScript for backend, and I work at TikTok's infra team"
AI: "Got it! I'll remember your stack preference and team context..."
 β†’ Stores: semantic memory (preference), episodic memory (team info)
πŸ“ Day 3, new session (memory auto-retrieved):
User: "What language should I use for my new API?"
AI: "Based on your TypeScript preference and TikTok infra context, consider..."
 ↑ Long-term memory retrieved via hybrid search!
πŸ“ Same session, continuing conversation:
User: "Also, remind me about the architecture discussion we just had"
AI: "Sure! Earlier you mentioned wanting to use microservices for..."
 ↑ Working memory β€” no retrieval needed, context is in-session!

πŸš€ Features

🧠 Cognitive Memory Architecture

MemChat models memory after human cognitive science, with four distinct memory types:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Memory Taxonomy β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ semantic β”‚ Stable facts: preferences, skills, background β”‚
β”‚ episodic β”‚ Events: meetings, decisions, experiences β”‚
β”‚ procedural β”‚ Patterns: habits, workflows, behaviors β”‚
β”‚ todo β”‚ Tasks: reminders, deadlines, action items β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Memory Layer Architecture β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Working Memory (session) β”‚ Last N turns, in-context β”‚
β”‚ Long-term Memory (Milvus)β”‚ Persistent, RAG-retrieved β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

⚑ Single-Call Memory Pipeline

Before (2 LLM calls, serial):

embed(summary) β†’ search β†’ LLM: importance check β†’ LLM: update decision
 ↑ latency ~2s ↑ latency ~1s

After (1 LLM call):

embed(userMessage) β†’ search β†’ LLM: extract facts + decide (all-in-one)
 ↑ cache hit likely ↑ latency ~1s + batch multi-fact support

Key benefits:

  • ~50% reduction in memory pipeline latency
  • Batch fact extraction β€” extracts multiple facts per conversation
  • Embedding cache hit β€” RAG phase already embedded the user message

πŸ” Hybrid Retrieval with Reranking

Pure vector search is not enough. MemChat uses a multi-signal scoring pipeline:

×ば぀ 3 candidates from Milvus ↓ Multi-signal scoring: β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ vector_sim ×ば぀ 0.50 (semantic similarity) β”‚ β”‚ keyword_score ×ば぀ 0.20 (BM25-inspired term overlap) β”‚ β”‚ time_decay ×ば぀ 0.15 (Ebbinghaus forgetting curve) β”‚ β”‚ importance ×ば぀ 0.15 (LLM-assigned importance) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ↓ Re-rank β†’ return topK">
Retrieve topK ×ば぀ 3 candidates from Milvus
 ↓
 Multi-signal scoring:
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚ vector_sim ×ば぀ 0.50 (semantic similarity) β”‚
 β”‚ keyword_score ×ば぀ 0.20 (BM25-inspired term overlap) β”‚
 β”‚ time_decay ×ば぀ 0.15 (Ebbinghaus forgetting curve) β”‚
 β”‚ importance ×ば぀ 0.15 (LLM-assigned importance) β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 ↓
 Re-rank β†’ return topK

Time decay formula (Ebbinghaus-inspired):

×ば぀ age_days / half_life_days)">
score = exp(-ln(2) ×ば぀ age_days / half_life_days)

Memories fade naturally over time, just like human memory.

πŸ’¬ Session-Aware Working Memory

POST /api/chat { sessionId: "optional-client-id", message: "..." }
LLM receives:
 [system: persona + long-term memories]
 [user: "turn 1"] ← working memory
 [assistant: "turn 1"] ← working memory
 [user: "turn 2"] ← working memory
 [assistant: "turn 2"] ← working memory
 [user: "current message"] ← current turn
  • Natural multi-turn β€” no need to repeat context in every message
  • Session isolation β€” each sessionId maintains independent context
  • Auto-expiry β€” sessions expire after configurable TTL (default 2 hours)

πŸš€ Embedding LRU Cache

chat() β†’ embed(userMessage) β†’ cache MISS β†’ model inference β†’ cache SET
 ↓
processAndStoreMemory() β†’ embed(userMessage) β†’ cache HIT β†’ instant return
 ↓
 Zero redundant inference!
  • LRU eviction with configurable max size (default 2000 entries)
  • Cache stats available via service for monitoring

πŸ—„οΈ HNSW Vector Index

×ば぀4) (query quality)">
Index: IVF_FLAT β†’ HNSW
 ↑ good for <100K ↑ designed for millions
 ↑ exact search ↑ approximate, high recall
 ↑ nprobe tuning ↑ ef tuning (simpler)
Params: M=16, efConstruction=200 (build quality)
Search: ef=max(64, top×ば぀4) (query quality)

πŸ” Enterprise-Grade Multi-Tenancy

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Layer 1: JWT Authentication Middleware β”‚
β”‚ - Extracts user_id from signed token β”‚
β”‚ - Rejects all unauthenticated requests β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Layer 2: Milvus Partition Key Enforcement β”‚
β”‚ - user_id is Partition Key (physical data isolation) β”‚
β”‚ - All queries force-filtered by user_id β”‚
β”‚ - Impossible to access another user's data β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Security guarantee: Even if auth middleware is bypassed, data layer prevents cross-tenant access.

πŸ—œοΈ Hierarchical Memory Compression

Level 0: Raw conversation chunks
 ↓ (greedy vector clustering + LLM summarization)
Level 1: Topic summaries
 ↓ (same process)
Level 2: High-level abstractions
Trigger: when memories reach 50% of maxMemoriesPerUser
Cleanup: when memories reach 90% of maxMemoriesPerUser
 β†’ delete expired β†’ compress β†’ retention scoring β†’ LLM evaluation

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Client / Frontend β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Express.js Server β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Auth Middleware β”‚ β”‚ Controllers β”‚ β”‚ Routes β”‚ β”‚
β”‚ β”‚ (JWT Verify) β”‚ β”‚ (Business Logic)β”‚ β”‚ (Endpoints) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LLM Service β”‚ β”‚ Memory Service β”‚ β”‚Embedding Service β”‚
β”‚ (OpenAI compat) β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ (Local Model) β”‚
β”‚ β”‚ β”‚ β”‚ Pipeline (1-call) β”‚ β”‚ β”‚ + LRU Cache β”‚
β”‚ chat() with β”‚ β”‚ β”‚ Hybrid Retrieval β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ working memory β”‚ β”‚ β”‚ Time Decay Score β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Working Memory β”‚ β”‚ Milvus (HNSW) β”‚ β”‚ Compression β”‚
β”‚ Service β”‚ β”‚ user_memories β”‚ β”‚ Service β”‚
β”‚ (session store) β”‚ β”‚ (Partition by user) β”‚ β”‚ (cluster+LLM) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“¦ Quick Start

Prerequisites

  • Node.js 18+
  • Docker & Docker Compose
  • OpenAI API key (or compatible endpoint)

Installation

# Clone the repository
git clone https://github.com/your-username/memchat.git
cd memchat
# Start Milvus (vector database)
docker-compose up -d
# Install dependencies
npm install
# Configure environment
cp .env.example .env
# Edit .env with your API keys
# Start development server
npm run dev

Visit http://localhost:3000 for the interactive testing UI.


βš™οΈ Configuration

Core Environment Variables

# Server
PORT=3000
NODE_ENV=development
# JWT
JWT_SECRET=your-super-secret-key
# Milvus
MILVUS_ADDRESS=localhost:19530
# LLM (OpenAI compatible)
LLM_API_KEY=your-api-key
LLM_BASE_URL=https://api.openai.com/v1
LLM_MODEL=gpt-4

Working Memory Tuning

# Max messages kept per session (user+assistant each count as 1)
WORKING_MEMORY_MAX_MESSAGES=20
# Session expiry in minutes (default: 2 hours)
WORKING_MEMORY_TTL_MINUTES=120
# Disable if not needed
WORKING_MEMORY_ENABLED=true

Embedding Cache Tuning

# Max cached embeddings (LRU eviction)
EMBEDDING_CACHE_MAX_SIZE=2000
# Disable for debugging
EMBEDDING_CACHE_ENABLED=true

Hybrid Retrieval Weights

×ば぀ this = candidates fetched for reranking) RETRIEVAL_CANDIDATE_MULTIPLIER=3">
# Weights must sum to ~1.0
RETRIEVAL_VECTOR_WEIGHT=0.50 # Semantic similarity
RETRIEVAL_KEYWORD_WEIGHT=0.20 # Keyword overlap (BM25-inspired)
RETRIEVAL_TIME_DECAY_WEIGHT=0.15 # Recency (Ebbinghaus curve)
RETRIEVAL_IMPORTANCE_WEIGHT=0.15 # LLM-assigned importance
# Half-life for time decay (days) β€” memories at this age score ~0.5
RETRIEVAL_HALF_LIFE_DAYS=90
# Candidate pool multiplier (topK ×ば぀ this = candidates fetched for reranking)
RETRIEVAL_CANDIDATE_MULTIPLIER=3

Memory Management

MAX_MEMORIES_PER_USER=1000
MEMORY_CLEANUP_THRESHOLD=0.9 # Trigger cleanup at 90% capacity
MEMORY_CLEANUP_TARGET=0.7 # Reduce to 70% after cleanup
MEMORY_SIMILARITY_TOP_K=8 # Candidates for write dedup
MEMORY_SIMILARITY_THRESHOLD=0.7 # L2 threshold for similarity

πŸ“– API Endpoints

Authentication

POST /api/auth/register - Register User
curl -X POST http://localhost:3000/api/auth/register \
 -H "Content-Type: application/json" \
 -d '{"username": "alice"}'

Response:

{
 "userId": "alice",
 "username": "alice",
 "token": "eyJhbGciOiJIUzI1NiIs..."
}
POST /api/auth/login - Login
curl -X POST http://localhost:3000/api/auth/login \
 -H "Content-Type: application/json" \
 -d '{"username": "alice"}'

Chat

POST /api/chat - Send Message
curl -X POST http://localhost:3000/api/chat \
 -H "Authorization: Bearer YOUR_TOKEN" \
 -H "Content-Type: application/json" \
 -d '{
 "workspaceId": "work-project",
 "message": "I prefer TypeScript for backend development",
 "sessionId": "optional-session-uuid"
 }'

Response:

{
 "response": "I'll remember that you prefer TypeScript...",
 "memoriesUsed": 2,
 "memoriesStored": 1,
 "sessionId": "alice:work-project"
}

Upgraded Flow:

  1. Resolves or creates session (working memory)
  2. Retrieves long-term memories via hybrid search (vector + keyword + time decay)
  3. Calls LLM with session history + long-term memories + current message
  4. Updates working memory (sync)
  5. Async pipeline: single LLM call extracts facts + decides create/update/merge/skip

Memory Management

GET /api/memories?workspaceId=xxx - List Memories
curl "http://localhost:3000/api/memories?workspaceId=work-project" \
 -H "Authorization: Bearer YOUR_TOKEN"

Response:

{
 "count": 3,
 "memories": [
 {
 "id": "memory-uuid",
 "content": "User prefers TypeScript for backend",
 "importanceScore": 8
 }
 ]
}
PUT /api/memories/:id - Update Memory
curl -X PUT http://localhost:3000/api/memories/memory-uuid \
 -H "Authorization: Bearer YOUR_TOKEN" \
 -H "Content-Type: application/json" \
 -d '{"content": "Updated content"}'
DELETE /api/memories/:id - Delete Memory
curl -X DELETE http://localhost:3000/api/memories/memory-uuid \
 -H "Authorization: Bearer YOUR_TOKEN"

πŸ› οΈ Tech Stack

Component Technology Notes
Runtime Node.js + TypeScript Type-safe throughout
Framework Express.js REST API
Vector DB Milvus 2.4 HNSW index, Partition Key isolation
Embeddings @xenova/transformers (local) MiniLM-L12-v2, 384-dim, with LRU cache
LLM OpenAI API (or compatible) Single-call pipeline
Auth JWT Stateless, multi-tenant
Container Docker Compose Milvus + etcd + MinIO

πŸ“ Project Structure

src/
β”œβ”€β”€ config/
β”‚ └── index.ts # All config with env var overrides
β”œβ”€β”€ middlewares/
β”‚ └── auth.middleware.ts # JWT verification
β”œβ”€β”€ services/
β”‚ β”œβ”€β”€ memory.service.ts # Memory orchestration (pipeline + hybrid retrieval)
β”‚ β”œβ”€β”€ working-memory.service.ts # Session-level short-term memory
β”‚ β”œβ”€β”€ milvus.service.ts # Vector DB (HNSW, Partition Key isolation)
β”‚ β”œβ”€β”€ embedding.service.ts # Embeddings + LRU cache
β”‚ β”œβ”€β”€ llm.service.ts # LLM (chat with session history, pipeline)
β”‚ β”œβ”€β”€ chunking.service.ts # Text chunking
β”‚ β”œβ”€β”€ compression.service.ts # Cluster-based memory compression
β”‚ β”œβ”€β”€ memory-cleanup.service.ts # Cleanup orchestration
β”‚ β”œβ”€β”€ cleanup.service.ts # Periodic expired memory cleanup
β”‚ └── persona.service.ts # AI persona management
β”œβ”€β”€ controllers/
β”‚ β”œβ”€β”€ chat.controller.ts # Chat endpoint with working memory
β”‚ β”œβ”€β”€ memory.controller.ts # Memory CRUD
β”‚ β”œβ”€β”€ auth.controller.ts
β”‚ └── persona.controller.ts
β”œβ”€β”€ routes/
β”œβ”€β”€ types/
β”‚ └── index.ts # Full type definitions incl. MemoryCategory
└── utils/

πŸ”’ Security Best Practices

  1. Never trust client input β€” all user_id from JWT, never from request body
  2. Defense in depth β€” auth middleware + Milvus partition key (two independent layers)
  3. No plaintext secrets β€” environment variables only
  4. Input validation β€” TypeScript type checking on all endpoints
  5. Tenant isolation β€” even if one tenant guesses another's workspaceId, user_id partition key blocks all cross-tenant queries

πŸ“Š Performance Characteristics

Scenario Before After
Memory pipeline LLM calls 2 (serial) 1
Embedding for same message ×ば぀ inference ×ば぀ (cache hit)
Search candidates for retrieval topK exact topK ×ば぀ 3 + rerank
Index type IVF_FLAT HNSW (million-scale)
Multi-turn context Not supported Working memory (last N turns)

πŸ—ΊοΈ Roadmap

  • Streaming responses (SSE)
  • Knowledge graph layer (entity + relation extraction)
  • Pluggable embedding model (support OpenAI, Cohere, etc.)
  • Redis-backed working memory (for multi-instance deployments)
  • Multi-modal memory (images, files)
  • Admin dashboard with memory analytics
  • Rate limiting per tenant
  • Prometheus metrics endpoint

🀝 Contributing

Contributions are welcome! Please read our Contributing Guide for details.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

  • Milvus β€” High-performance vector database with HNSW support
  • Transformers.js β€” Local multilingual embeddings
  • OpenAI β€” LLM capabilities

⭐ If this project helped you, please give it a star! ⭐

Report Bug Β· Request Feature

About

Background services that can record different users' memory during conversations

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /