Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Benchmark tool for evaluating Memvid on the LoCoMo (Long-term Conversational Memory) benchmark.

Notifications You must be signed in to change notification settings

memvid/memvidbench

Repository files navigation

MemvidBench

Benchmark tool for evaluating Memvid on the LoCoMo (Long-term Conversational Memory) benchmark.

Results

Memvid achieves 85.7% accuracy on LoCoMo - 28% higher than leading memory systems.

Category Accuracy Questions
Single-hop 80.14% 282
Multi-hop 80.37% 321
Temporal 71.88% 96
World-knowledge 91.08% 841
Adversarial 77.80% 446
Overall (Cat. 1-4) 85.65% 1,540

Following standard methodology, adversarial category is excluded from the primary metric.

Configuration

  • Judge Model: gpt-4o-mini (lenient grading)
  • Answering Model: gpt-4o
  • Embedding Model: text-embedding-3-large
  • Search Mode: Hybrid (BM25 + Semantic)
  • Retrieval K: 60

Quick Start

# Install
bun install
# Set environment variables
export OPENAI_API_KEY=sk-...
export MEMVID_API_KEY=mv2_... # Get from memvid.dev
export MEMVID_MEMORY_ID=... # Your memory ID
# Run full benchmark (~3 hours)
bun run bench:full
# Or quick test (100 questions, ~10 min)
bun run bench:quick

You can also create a .env file with these variables.

Commands

# Full benchmark (1,986 questions)
bun run src/index.ts run -r my-run --force
# Limit to N questions
bun run src/index.ts run -r quick -l 100
# Sample N per category
bun run src/index.ts run -r sample -s 25
# List questions
bun run src/index.ts list -l 20
# Resume interrupted run
bun run src/index.ts run -r my-run

Options

-r, --run-id Run identifier (required)
-j, --judge Judge model (default: gpt-4o-mini)
-m, --model Answering model (default: gpt-4o)
-l, --limit Limit total questions
-s, --sample Sample N per category
-t, --types Filter by types (comma-separated)
--force Clear checkpoint and start fresh

Output

Results are saved to data/runs/{run-id}/:

data/runs/my-run/
├── checkpoint.json # Full evaluation data
└── report.json # Summary metrics

Methodology

  • Categories 1-4 accuracy (excludes adversarial)
  • Lenient LLM-as-judge grading
  • Standard evaluation prompt

Comparison

System LoCoMo Accuracy
Memvid 85.65%
Full-context 72.90%
Mem0g 68.44%
Mem0 66.88%
Zep 65.99%
LangMem 58.10%
OpenAI 52.90%

Baseline figures from arXiv:2504.19413. Some vendors dispute these results.

References

License

MIT

About

Benchmark tool for evaluating Memvid on the LoCoMo (Long-term Conversational Memory) benchmark.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

AltStyle によって変換されたページ (->オリジナル) /