-
Notifications
You must be signed in to change notification settings - Fork 0
Releases: playeriv65/EasyLocomo
v0.1.0: Legacy Logic Alignment & Baseline Freeze
Release v0.1.0 - Baseline Alignment Verified
We are excited to announce the initial release of EasyLocomo. The primary focus of this version is to provide a streamlined, easy-to-use interface for the LoCoMo benchmark while maintaining strict logical and data consistency with the original official repository.
🎯 Baseline Alignment Verification
We have conducted extensive testing using gpt-4o-mini to verify that EasyLocomo produces results consistent with the original author's implementation. The minor differences observed are primarily due to the non-deterministic nature of LLM outputs and randomized option ordering in specific categories (Category 5).
Performance Comparison (Macro F1)
Note on Reproducibility: Due to the use of unordered set containers in the original implementation (introducing prompt-level randomness), and the inherent limitations of the legacy F1 scoring logic—which fails to recognize semantically equivalent but phrased-differently responses—re-running the evaluation even with identical models and code typically results in a variance of up to 5%. Consequently, exact bit-level parity is mathematically unattainable, but macro-statistical alignment has been achieved.
The following table compares the F1 scores between the official LoCoMo logic (Original) and the EasyLocomo implementation:
| QA Category | Original (Official) F1 | EasyLocomo F1 | Difference |
|---|---|---|---|
| Temporal | 0.3439 | 0.3551 | +0.0112 |
| Single-hop | 0.2594 | 0.2885 | +0.0291 |
| Multi-hop | 0.3808 | 0.3935 | +0.0127 |
| Open-domain | 0.6231 | 0.6202 | -0.0029 |
| Adversarial | 0.1883 | 0.1794 | -0.0089 |
| Overall Accuracy | 0.4284 | 0.4301 | +0.0017 |
📦 Release Attachments
For full transparency, the following 6 JSON files containing raw predictions and statistical summaries are included in the release assets:
new_res.json: Raw predictions from EasyLocomo.new_res_stats.json: Detailed per-question metrics for EasyLocomo.new_res_summary.json: Aggregated performance summary for EasyLocomo.old_res.json: Raw predictions from the original official code.old_res_stats.json: Detailed per-question metrics for the original code.old_res_summary.json: Aggregated performance summary for the original code.
🚀 Key Improvements in v0.1.0
- Streamlined Workflow: Unified environment management via
uv. - OpenAI Standard: Support for all OpenAI-compatible API endpoints.
- Robustness: Integrated breakpoint resumption and JSON-mode parsing error handling.
- Cost Control: Built-in token estimation utility.