Benchmark tool for evaluating Memvid on the LoCoMo (Long-term Conversational Memory) benchmark.
Memvid achieves 85.7% accuracy on LoCoMo - 28% higher than leading memory systems.
| Category | Accuracy | Questions |
|---|---|---|
| Single-hop | 80.14% | 282 |
| Multi-hop | 80.37% | 321 |
| Temporal | 71.88% | 96 |
| World-knowledge | 91.08% | 841 |
| Adversarial | 77.80% | 446 |
| Overall (Cat. 1-4) | 85.65% | 1,540 |
Following standard methodology, adversarial category is excluded from the primary metric.
- Judge Model: gpt-4o-mini (lenient grading)
- Answering Model: gpt-4o
- Embedding Model: text-embedding-3-large
- Search Mode: Hybrid (BM25 + Semantic)
- Retrieval K: 60
# Install bun install # Set environment variables export OPENAI_API_KEY=sk-... export MEMVID_API_KEY=mv2_... # Get from memvid.dev export MEMVID_MEMORY_ID=... # Your memory ID # Run full benchmark (~3 hours) bun run bench:full # Or quick test (100 questions, ~10 min) bun run bench:quick
You can also create a .env file with these variables.
# Full benchmark (1,986 questions) bun run src/index.ts run -r my-run --force # Limit to N questions bun run src/index.ts run -r quick -l 100 # Sample N per category bun run src/index.ts run -r sample -s 25 # List questions bun run src/index.ts list -l 20 # Resume interrupted run bun run src/index.ts run -r my-run
-r, --run-id Run identifier (required)
-j, --judge Judge model (default: gpt-4o-mini)
-m, --model Answering model (default: gpt-4o)
-l, --limit Limit total questions
-s, --sample Sample N per category
-t, --types Filter by types (comma-separated)
--force Clear checkpoint and start fresh
Results are saved to data/runs/{run-id}/:
data/runs/my-run/
├── checkpoint.json # Full evaluation data
└── report.json # Summary metrics
- Categories 1-4 accuracy (excludes adversarial)
- Lenient LLM-as-judge grading
- Standard evaluation prompt
| System | LoCoMo Accuracy |
|---|---|
| Memvid | 85.65% |
| Full-context | 72.90% |
| Mem0g | 68.44% |
| Mem0 | 66.88% |
| Zep | 65.99% |
| LangMem | 58.10% |
| OpenAI | 52.90% |
Baseline figures from arXiv:2504.19413. Some vendors dispute these results.
- LoCoMo Dataset
- LoCoMo Paper - Maharana et al., ACL 2024
MIT