-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Domain-adaptive fine-tune as orthogonal R@5 lift on top of MemPal raw #1249
-
|
Hi MemPal team, We've been using LongMemEval to evaluate a small open-source library What we measuredSame dataset (
Three findings worth flagging:
Possible integration shapeIf interesting, a
We don't have strong feelings about the shape, happy to defer to Reproducepip install adaptmem git clone https://github.com/nakata-app/adaptmem cd adaptmem make bench-longmemeval # FT-100 self-contained run Three committed result JSONs in
Either outcome is fineIf this isn't a fit for mempal's direction, no problem, adaptmem Thanks again for the open work, the project structure made Nakata |
Beta Was this translation helpful? Give feedback.
All reactions
-
🚀 1
Replies: 6 comments 1 reply
-
Congrats on v3.3.4 — the DB size reduction is impressive. Quick question: did the storage optimisation affect the index structure at all, or is the longmemeval_bench.py protocol identical to v3.3.3? The numbers in the post above were run against the previous release — want to check if a rerun against v3.3.4 is needed before the comparison goes stale.
Beta Was this translation helpful? Give feedback.
All reactions
-
I'm reading your work! Excited to learn more.
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Quick follow-up on the May 1 question about v3.3.4+ protocol equivalence, I re-ran all three rows on v3.3.5 (latest release as of today) and also did a controlled v3.3.3 repro to isolate the source of any movement. Numbers below. Three runs on v3.3.5 (full 500q, matched protocol)Same
Three takeaways
What the deltas mean
Encoder fine-tune and hybrid retrieval are still adding lift on top of each other at v3.3.5. R@5 is ceiling-bounded (close to 1.000), so R@1 is the honest comparison and the orthogonality reads clearly there. Reproducecd ~/Projects/mempalace && git checkout v3.3.5 cd ~/Projects/adaptmem PYTHONPATH=/path/to/mempalace python benchmarks/mempal_bench_with_ft.py \ --bench-script /path/to/mempalace/benchmarks/longmemeval_bench.py \ --data-file /path/to/longmemeval_s_cleaned.json \ --ft-model /path/to/minilm-lme-ft-300 \ --mode {raw|hybrid_v4} \ --out results.jsonl The three v3.3.5 result JSONLs are committed in If hybrid_v4 reruns on top of these numbers are useful to compare against your own internal measurements, happy to share the result JSONLs directly. Otherwise this is just to close the May 1 question with current numbers. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Quick update on the v3.3.5 rerun comment, running on the same matched-protocol harness, the ft-v4 encoder upgrade plus a three-stage rerank stack pushes the R@1 0.95 row to R@1 0.99 (5 fails / 500).
Stages on top of hybrid_v4 + ft-v4:
- trust-gated CE rerank: chat-ce-v3 (chat domain), margin=1.0 confidence gate. Plain pure-CE rerank had a measurable overcorrect bug (helped 7 / hurt 4 on preference); the trust gate keeps the bi-encoder top-1 unless CE's margin is high. Net: +0.010 R@1, 0 hurt.
- time-aware temporal proximity: same regex + gaussian proximity boost we had at v3.3.5; reuses the Sprint 1 task3 logic on the trust-gate output. Net: +0.004 R@1, 0 hurt.
- targeted LLM rerank on residual fails only: DeepSeek V4 Flash, 3-vote self-consistency, top-K=10. Only fires on the ≤10% of queries the deterministic stages leave with low CE confidence. Net: +0.004 R@1, 0 hurt.
Remaining 5 fails decompose as 1 abstain (_abs ground-truth, structural eval noise, unrecoverable) + 4 hard cases (cousin-wedding, chocolate-cake, milestone-4-weeks-ago, book-discount-trunc). Noise-adjusted ceiling looks like ~0.998.
Repo: nakata-app/adaptmem, results/sprint_0p99/SPRINT_4_FINAL.md has the per-stage numbers, fail diagnoses, and the three rerank scripts.
Two possible integration shapes if interesting: an opt-in mempal --rerank adaptmem plugin keeping mempal's API surface unchanged, or upstream PR of just the deterministic layers (trust gate + time-aware) without the paid-LLM dependency. The LLM stage is intentionally optional; V4 Flash costs ~0ドル.05 per 500-query benchmark, but plugin users get 0.987 from the free Llama-70B NIM fallback alone.
Happy to share JSONL artefacts and pipeline scripts under whichever direction fits.
Beta Was this translation helpful? Give feedback.
All reactions
-
|
jphein, Önceki yanıt için teşekkürler. 20 probe'luk ablation üzerinde paired bootstrap (10K resample, 95% CI) koşturdum, iki tarafın da görmesi için sayıları aşağı koyuyorum. B vs A (heading-aware vs paragraph), bizim corpus ve probe set:
Her tek probe için rank birebir aynı çıkıyor. Paragraph ve heading-aware aynı drawer parçalanışı üretiyor (3759 vs 3747 chunk @ cs=400). Yani bizim probe set'inde markdown heading ayrımı "ateşlemiyor". Kavramsal argümanın yanlış demiyorum, ölçemiyorum. C vs A (AST vs paragraph), senin "complexity without lift" tavsiyenin tersi:
cs=800'de AST, iki encoder ile de 95% CI sıfırın üzerinde lift veriyor. cs=400'de kayboluyor. Talep: Bizim probe set 20 entry hard-coded (
Probe YAML'ı (script'in Code için "structured extraction + graph traversal" yaklaşımının yazısı yayında mı? Pipeline'ı yazıya görmek isterim, bizim retrieval surface'inde paralel bir track yararlı olabilir. teşekkürler, |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
@nakata-app — thanks for running the paired bootstrap with the CIs; the B-vs-A flat reading and the C-vs-A cs=800 lift on your 20-probe set both look defensible at the n you ran. Quick reply to your three asks, plus a cross-reference that may compose with the additive-axes story. The n=200 probe setLives on the fork at Shape: 200 questions, file-shaped The YAML is self-contained — no On the "structured extraction + graph traversal" questionThe "skip chunking for code, do AST-extraction-into-graph" framing in this thread came from @xg-gh-25 on #1384, not from us — worth attributing there. That said, the parallel-track angle is reasonable because our fork is doing graph traversal at the substrate layer, just from a different starting point:
So we have the graph traversal substrate but not the AST-to-graph extraction step. xg-gh-25's pipeline note suggests the missing piece is upstream of the graph, not in it. Worth their own writeup; I'll let them speak to that. FT-300 independent reproduction (just landed)Cross-reference your additive-axes story directly: reproduced FT-300 end-to-end on katana this morning from
Same on 500q full (training questions included): R@5 = 0.9980 (5/6 categories saturate at 1.000; small dip on single-session-assistant at 0.9821). Wall clock 56s train + 18s test on the GPU. Reproduces inside published noise — your FT-300 protocol is portable. Full writeup + reproducible split JSON: For methodological completeness — three code-tuned variants from your Composition direction worth checking nextYour matched-protocol numbers had hybrid_v4 + FT-300 + 3-stage rerank at R@1 = 0.99 in SPRINT_4_FINAL.md. Substrate-floor parity in our SME #9 thread confirms postgres+pgvector + MiniLM = chromadb + MiniLM byte-identically (R@5 = 0.9660, per-category exact match across all 6 qtypes). So in principle the FT-300 + hybrid_v4 + rerank stack should compose into our postgres substrate the same way it composes into upstream chromadb. We haven't measured that yet — the hybrid retrieval layer on the postgres backend is the next item that needs an SME-side reading. Will post when that lands. Question back: your 🫏 |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
@nakata-app — wanted to follow your additive-axes story with a cross-domain data point that I think extends it rather than challenges it. Short version: your in-domain lift reproduces and holds up for us; when we carried the same encoder to a different corpus it flattened; and a finding from our side suggests there may be a fourth orthogonal layer worth stacking on top. First — the in-domain lift is real, and it reproducesYour orthogonal-layers framing is compelling, and the numbers back it. Your published table has MemPal raw R@5 0.966 → +FT-300 0.980 → +hybrid_v4+FT-300 0.990, with R@1 climbing 0.806 → 0.862 → 0.916 — encoder fine-tune and hybrid retrieval each adding lift on top of the other. We reproduced the FT-300 leg end-to-end on our own hardware (katana, fresh seed=42 300/200 split) and the held-out 200q test hit R@5 = 1.000 (R@1 0.925) — inside your published noise. So the in-domain encoder lift isn't a one-machine artifact; the protocol is portable and the R@5 lift toward ceiling is genuine. No argument from us there. Where it gets interesting — a cross-domain transfer testWe then did something your thread hadn't covered: carried the same FT-300 encoder to a deliberately different corpus — jp-realm-v0.1, a 30-question probe set over a personal technical knowledge base (135k drawers of code, infra notes, RFCs), scored by substring
Here the lift didn't transfer: R@5 0.5172 → 0.5172, flat. 24 of 29 covered questions move exactly 0.0 — the FT encoder ranks the same drawers as base. A from-recipe re-train of the fine-tune (third leg) landed within ±2pp of base too, so two independently trained FT encoders both no-op'd on this corpus. (One honest detail: the published FT-300 we have carries code/scientific-computing training content, so against a personal technical KB it's genuinely out-of-domain — the cleanest version of the test.) Read together with your numbers, this is completely consistent if the lift is domain-specific: strong when the fine-tune corpus and eval corpus are the same family, flat across a corpus shift. That's not a knock on the method — it's a boundary on it. So the real question back, collaborator-to-collaborator: have you seen the orthogonal lift hold across a corpus shift, or does it want hard-negative re-mining on the target corpus to travel? Your A possible fourth orthogonal layerOne more finding that I think composes with your encoder+hybrid stack rather than competing with it. On oracle LongMemEval — gold session pinned in context, retrieval held at its 0.974 R@5 ceiling — we measured reader QA at only ~50%: a ~45pp R@5→QA gap (our #116). The right evidence is in front of the reader and it still misses. So on that corpus an encoder lift driving R@5 from 0.966 toward 1.000 is real but doesn't, on its own, move end-to-end QA — the bottleneck has shifted downstream to the reader/consumption layer. On top of your stack — encoder-FT, hybrid retrieval, the rerank cascade — this reads like one more orthogonal layer: reader/prompt design. (Our stratified n=150 retrieval A/B also had graph/age fusion ~neutral — R@5 92.67% vs 92.00% — which is why we're now spending our attention on the reader rather than retrieval.) Full writeups and the convergent findings are on our results page: https://techempower-org.github.io/multipass-structural-memory-eval/site/#benchmarks Genuinely — the reproducibility of your protocol is what let us run the cross-domain test at all. Curious to hear if corpus-shift transfer is something you've poked at. 🫏 |
Beta Was this translation helpful? Give feedback.