Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

EverMind-AI/EverMemBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

13 Commits

Repository files navigation

EverMemBench

A comprehensive benchmark for quantifying and diagnosing memory systems in large language models

English | 简体中文

📖 Project Description

EverMemBench is a benchmark designed to quantify and diagnose the memory systems of large language models. It introduces, for the first time, a three-tiered evaluation framework for memory systems consisting of: Factual Recall, Applied Memory, and Personalization Generalization.

This layered approach enables researchers to go beyond traditional retrieval-style evaluations and conduct fine-grained diagnostics of model capabilities, precisely locating performance bottlenecks in information extraction, contextual reasoning, or style adaptation. By offering a reproducible and standardized testing framework, EverMemBench not only reveals the significant shortcomings of current state-of-the-art models in achieving deep personalization, but also provides clear guidance for targeted optimization of memory systems.

🌟 Key Contributions

  1. Progressive memory evaluation framework: We partition memory-system capabilities into three hierarchical layers — Factual Recall, Applied Memory, and Personalization Generalization — establishing a clear progression from pure retrieval to context integration to persona-consistent generation, thereby facilitating precise identification of performance bottlenecks.

  2. Realistic and diagnostic long-horizon multi-party chat dataset: Grounded in real workplace communication scenarios, we construct a long-horizon corpus with a multi-role, multi-group, cross-context setting that explicitly models temporal persona drift and community-switching effects, enabling the assessment of memory robustness under concurrent topics and frequent context switches.

  3. Unified quantification and standardized evaluation protocol: We provide consistent task formulations and measurement interfaces across the three core dimensions, supporting reproducible and comparable cross-model evaluation while reducing experimental bias in comparisons across systems and models.

  4. Systematic cross-model empirical analysis: We comprehensively evaluate mainstream memory systems (e.g., MemOS, MemoryOS, Mem0, A-Mem) and state-of-the-art LLMs (e.g., GPT-4.5, GPT-4.1, Gemini-2.5-Pro), conducting side-by-side comparisons within a unified framework and revealing notable deficiencies in the memory capabilities of current advanced models.

🗂️ Benchmark Description

To systematically and reproducibly assess and diagnose LLM memory capabilities, we construct a long-horizon, multi-party group-chat dataset grounded in realistic workplace communication. The dataset centers on a "multi-role—multi-group—cross-context" communication setting, explicitly modeling the dynamism and context-dependence of individual profiles. In real work scenarios, a person’s behavior and communicative style may drift over time as conversations unfold; at the same time, the same individual may act differently across communities/teams due to role relations and power structures. For example, a department director may be more decisive and stern within a direct-report team chat, yet more restrained in a cross-department strategic group among peers. We embed such "time-varying" and "community-varying" personas and interaction patterns into the data construction process to faithfully reflect the complex and common communication ecology of enterprises.

Benefiting from this design, the dataset supports fine-grained and diagnostic evaluation of model memory systems under conditions of long conversations, concurrent topics, and frequent context switches. We summarize memory capability assessment along three core dimensions:

  1. Fine-grained Detailed Recall. Tests retrieval ability, requiring the model to accurately reconstruct concrete facts from prior context.

  2. Memory Awareness. Evaluates retrieval accompanied by understanding: the model must recall past events and integrate them to produce contextually appropriate answers.

  3. User Profile Understanding. Focuses on personalization and adaptive generation. The model is expected to develop a stable understanding of individual preferences, roles, and tone based on historical interactions, and to adjust content and expression accordingly—avoiding replies that contradict the persona or are overly generic.

主图

📊 Benchmark Data

Coming Soon...

🏗️ Benchmark Curation Pipeline

Coming Soon...

📈 Performance on EverMemBench

Based on EverMemBench, we conducted a comprehensive evaluation of mainstream memory systems (e.g., MemOS, MemoryOS, Mem0, A-Mem) and state-of-the-art LLMs (e.g., GPT-4.5, GPT-4.1, Gemini-2.5-Pro), performing standardized measurements and cross-model comparisons across three core dimensions.

License

MIT license

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

AltStyle によって変換されたページ (->オリジナル) /