Domain-adaptive fine-tune as orthogonal R@5 lift on top of MemPal raw · MemPalace/mempalace · Discussion #1249

nakata-app
Apr 28, 2026

Hi MemPal team,

We've been using LongMemEval to evaluate a small open-source library
called adaptmem, a 200-
line hard-negative mining + contrastive fine-tune wrapper around
SentenceTransformers, and the numbers we got line up cleanly with
the work you've already published. Wanted to share back, see if
it's interesting.

What we measured

Same dataset (longmemeval_s_cleaned.json), same encoder family
(MiniLM-L6, ~90MB), run through your own longmemeval_bench.py
(monkey-patched to swap the encoder, zero changes to your eval logic).
Only the fine-tune step differs.

System	R@1	R@5	R@10	n
MemPal raw default (your bench script)	0.806	0.966	0.982	500
MemPal raw + adaptmem FT-300 (your bench script)	0.862	0.980	0.994	500
MemPal hybrid_v4 + adaptmem FT-300 (your bench script)	0.916	0.990	0.998	500

Three findings worth flagging:

Raw baseline R@5 = 0.966 matches your published number exactly.
Independent confirmation that your protocol is fully reproducible,
we didn't need any hints beyond the repo README.
FT-300 + raw mode: +5.6pt R@1, +1.4pt R@5. R@1 is where
contrastive fine-tuning moves the needle most, the model learns to
rank the right session first, not just in top-5.
FT-300 + hybrid_v4: +11pt R@1, +2.4pt R@5. Fine-tune and
hybrid retrieval stack orthogonally, each adds lift on top of the
other.

Possible integration shape

If interesting, a mempal-adapt integration could look like:

mempal stays the storage / room / dialect / hybrid-retrieval layer.
adaptmem adds the encoder-side fine-tune step as an optional
"adapter": before ingestion, point adaptmem at the labelled-query
set (if available), it produces a domain-tuned encoder that mempal
then uses for embedding.
No changes to the mempal API surface; the encoder swap happens at
config load time.

We don't have strong feelings about the shape, happy to defer to
your design preferences. The point of this thread is just to put
the numbers in front of you and see whether there's a productive
conversation here.

Reproduce

pip install adaptmem
git clone https://github.com/nakata-app/adaptmem
cd adaptmem
make bench-longmemeval # FT-100 self-contained run

Three committed result JSONs in benchmarks/:

results_minilm_baseline_400.json, raw protocol confirmation.
results_ft100_400.json, self-contained FT-100 reproduce.
results_ft300_direct.json, FT-300 reference run.

Either outcome is fine

If this isn't a fit for mempal's direction, no problem, adaptmem
will keep on as a standalone tool. Just thought it was worth showing
the numbers and the integration sketch given how cleanly the
protocol confirmation came out.

Thanks again for the open work, the project structure made
independent reproduction straightforward.

Nakata

Replies: 6 comments 1 reply

nakata-app
May 1, 2026
Author

Congrats on v3.3.4 — the DB size reduction is impressive. Quick question: did the storage optimisation affect the index structure at all, or is the longmemeval_bench.py protocol identical to v3.3.3? The numbers in the post above were run against the previous release — want to check if a rerun against v3.3.4 is needed before the comparison goes stale.

1 reply

@jphein

jphein May 11, 2026
Collaborator

I'm reading your work! Excited to learn more.

nakata-app
May 13, 2026
Author

Quick follow-up on the May 1 question about v3.3.4+ protocol equivalence, I re-ran all three rows on v3.3.5 (latest release as of today) and also did a controlled v3.3.3 repro to isolate the source of any movement. Numbers below.

Three runs on v3.3.5 (full 500q, matched protocol)

Same longmemeval_bench.py, same FT-300 model file (mtime Apr 26, unchanged since the original post), encoder swap via the monkey-patch wrapper documented earlier.

System	R@1	R@5	R@10
MemPal raw default (v3.3.5)	0.806	0.966	0.982
MemPal raw + adaptmem FT-300 (v3.3.5)	0.932	0.992	0.996
MemPal hybrid_v4 + adaptmem FT-300 (v3.3.5)	0.950	0.998	1.000

Three takeaways

Raw default identical across versions. Raw mode R@1 = 0.806 / R@5 = 0.966 on v3.3.5 matches v3.3.3 bit-for-bit (controlled repro, same venv, only mempal HEAD switched). PR fix(search): CLI hybrid rerank, legacy-metric warning, invariant tests (3.3.4) #1179 (BM25 hybrid rerank fix) and PR feat(searcher): candidate_strategy="union" — BM25 candidates joined with vector pool before hybrid rerank #1306 (candidate_strategy="union" opt-in) don't touch the raw retrieval path, which is what we'd expect. Reproduction protocol is stable across the v3.3.3 → v3.3.5 window.
Hybrid_v4 + FT-300 went up: R@1 +0.034, R@5 +0.008, R@10 +0.002 relative to the Apr 28 run. This is consistent with the v3.3.5 BM25 hybrid rerank fix, the rerank pass is FT-300-encoder-aware now in a way it wasn't before, and the encoder layer's lift composes with the fixed rerank rather than getting clipped by it. The encoder-as-its-own-axis framing from the Chunking-strategy ablation: heading-aware-md gives a small Pareto win on markdown probes; AST-Python loses; what should we do? #1384 thread holds up under v3.3.5.
Raw + FT-300 moved from 0.862 → 0.932 R@1. This one is not a mempal-side change, controlled repro on v3.3.3 with today's venv reproduces 0.932 identically. The Apr 28 → today delta is from upgraded dependency versions (chromadb 1.5.8, sentence-transformers 5.4.1, numpy 2.4.4 at present; the Apr 28 venv was older, exact versions not preserved). Flagging it explicitly so the Apr 28 numbers don't look retroactively re-stated without disclosure.

What the deltas mean

Encoder alone (raw + FT-300 vs raw default): +0.126 R@1, +0.026 R@5.
Encoder + hybrid retrieval stacked (hybrid_v4 + FT-300 vs raw default): +0.144 R@1, +0.032 R@5.

Encoder fine-tune and hybrid retrieval are still adding lift on top of each other at v3.3.5. R@5 is ceiling-bounded (close to 1.000), so R@1 is the honest comparison and the orthogonality reads clearly there.

Reproduce

cd ~/Projects/mempalace && git checkout v3.3.5
cd ~/Projects/adaptmem
PYTHONPATH=/path/to/mempalace python benchmarks/mempal_bench_with_ft.py \
 --bench-script /path/to/mempalace/benchmarks/longmemeval_bench.py \
 --data-file /path/to/longmemeval_s_cleaned.json \
 --ft-model /path/to/minilm-lme-ft-300 \
 --mode {raw|hybrid_v4} \
 --out results.jsonl

The three v3.3.5 result JSONLs are committed in benchmarks/v335/ in the adaptmem repo. The v3.3.3 controlled-repro JSONL (run4b_v333_raw_ft300.jsonl) is alongside them for anyone who wants to verify the version-equivalence claim independently.

If hybrid_v4 reruns on top of these numbers are useful to compare against your own internal measurements, happy to share the result JSONLs directly. Otherwise this is just to close the May 1 question with current numbers.

0 replies

nakata-app
May 16, 2026
Author

Quick update on the v3.3.5 rerun comment, running on the same matched-protocol harness, the ft-v4 encoder upgrade plus a three-stage rerank stack pushes the R@1 0.95 row to R@1 0.99 (5 fails / 500).

Stages on top of hybrid_v4 + ft-v4:

trust-gated CE rerank: chat-ce-v3 (chat domain), margin=1.0 confidence gate. Plain pure-CE rerank had a measurable overcorrect bug (helped 7 / hurt 4 on preference); the trust gate keeps the bi-encoder top-1 unless CE's margin is high. Net: +0.010 R@1, 0 hurt.
time-aware temporal proximity: same regex + gaussian proximity boost we had at v3.3.5; reuses the Sprint 1 task3 logic on the trust-gate output. Net: +0.004 R@1, 0 hurt.
targeted LLM rerank on residual fails only: DeepSeek V4 Flash, 3-vote self-consistency, top-K=10. Only fires on the ≤10% of queries the deterministic stages leave with low CE confidence. Net: +0.004 R@1, 0 hurt.

Remaining 5 fails decompose as 1 abstain (_abs ground-truth, structural eval noise, unrecoverable) + 4 hard cases (cousin-wedding, chocolate-cake, milestone-4-weeks-ago, book-discount-trunc). Noise-adjusted ceiling looks like ~0.998.

Repo: nakata-app/adaptmem, results/sprint_0p99/SPRINT_4_FINAL.md has the per-stage numbers, fail diagnoses, and the three rerank scripts.

Two possible integration shapes if interesting: an opt-in mempal --rerank adaptmem plugin keeping mempal's API surface unchanged, or upstream PR of just the deterministic layers (trust gate + time-aware) without the paid-LLM dependency. The LLM stage is intentionally optional; V4 Flash costs ~0ドル.05 per 500-query benchmark, but plugin users get 0.987 from the free Llama-70B NIM fallback alone.

Happy to share JSONL artefacts and pipeline scripts under whichever direction fits.

0 replies

nakata-app
May 17, 2026
Author

jphein,

Önceki yanıt için teşekkürler. 20 probe'luk ablation üzerinde paired bootstrap (10K resample, 95% CI) koşturdum, iki tarafın da görmesi için sayıları aşağı koyuyorum.

B vs A (heading-aware vs paragraph), bizim corpus ve probe set:

encoder	cs	95% CI
default (MiniLM)	400	[0, 0]
default	800	[0, 0]
FT-300 (code-FT)	400	[0, 0]
FT-300	800	[0, 0]

Her tek probe için rank birebir aynı çıkıyor. Paragraph ve heading-aware aynı drawer parçalanışı üretiyor (3759 vs 3747 chunk @ cs=400). Yani bizim probe set'inde markdown heading ayrımı "ateşlemiyor". Kavramsal argümanın yanlış demiyorum, ölçemiyorum.

C vs A (AST vs paragraph), senin "complexity without lift" tavsiyenin tersi:

encoder	cs	ΔMRR	95% CI	p_rev
default	400	0.0000	[0, 0]	n/a
default	800	+0.0750	[+0.008, +0.167]	0.013
FT-300	400	-0.0292	[-0.100, +0.013]	0.36
FT-300	800	+0.0400	[+0.004, +0.096]	0.011

cs=800'de AST, iki encoder ile de 95% CI sıfırın üzerinde lift veriyor. cs=400'de kayboluyor.

Talep: Bizim probe set 20 entry hard-coded (chunk_strategy_ablation.py:PROBES). Senin tarafta daha geniş bir probe set ile koştuysan (50+, ya da evals/ altında otomatik üretilen bir set varsa), aynı bootstrap analizini koşturmak isterim. İki olası ayrıştıran faktör:

Probe karışımı. Bizim 15/20 probe .py'yi hedefliyor, sadece 5/20 .md'yi. Bu B'yi körleştiriyor olabilir.
Corpus farkı. Biz mempal package'ını mine ediyoruz. Sen full repo (docs, RFC'ler, scratch) ile koşuyorsan B'nin heading sinyali oradan geliyor olabilir.

Probe YAML'ı (script'in --probes flag'i docstring'de var ama parser'da yok, eklemek için küçük PR de açabilirim) veya raw soru listesi paylaşırsan, monkey-patch ile aynı harness üzerinden koşar, sayıları geri yollarım.

Code için "structured extraction + graph traversal" yaklaşımının yazısı yayında mı? Pipeline'ı yazıya görmek isterim, bizim retrieval surface'inde paralel bir track yararlı olabilir.

teşekkürler,
Atakan

0 replies

jphein
May 17, 2026
Collaborator

@nakata-app — thanks for running the paired bootstrap with the CIs; the B-vs-A flat reading and the C-vs-A cs=800 lift on your 20-probe set both look defensible at the n you ran. Quick reply to your three asks, plus a cross-reference that may compose with the additive-axes story.

The n=200 probe set

Lives on the fork at techempower-org/multipass-structural-memory-eval, sme/corpora/mempalace_git_probes_v2/questions.yaml. The construction is deterministic — scripts/derive_probes_from_git.py walks the techempower-org/mempalace commit log (14-month window) and produces (commit subject, primary changed file) pairs. Each probe carries the source commit hash in why: so anything you find can be traced back to a single commit.

Shape: 200 questions, file-shaped expected_sources (136 ×ばつ .py, 48 ×ばつ .md, 16 misc). Mix is heavier on Python than your 15/20 → 13/20 probe-mix concern, but the markdown slice is the same shape your B-vs-A claim hinges on, so the bootstrap on the 48-.md subset should give B a fair test at higher n.

The YAML is self-contained — no --probes flag wiring needed. If you'd like a thin loader to plug it into your chunk_strategy_ablation.py harness as-is, happy to PR one against nakata-app/adaptmem; or just yaml.safe_load + map expected_sources to your relevant-doc structure.

On the "structured extraction + graph traversal" question

The "skip chunking for code, do AST-extraction-into-graph" framing in this thread came from @xg-gh-25 on #1384, not from us — worth attributing there. That said, the parallel-track angle is reasonable because our fork is doing graph traversal at the substrate layer, just from a different starting point:

Apache AGE Cypher queries in-database on the postgres backend (mempalace.backends.postgres + AGE extension). The KG triples produced by mempalace.entity_detector land in AGE alongside drawer rows; a Cypher MATCH can pull related-entity neighborhoods as a retrieval candidate set before any vector search runs. Not AST-derived, but the shape of "retrieve via graph, not similarity" is the same.
The mempalace_traverse MCP tool in palace-daemon exposes this as a retrieval mode: take a seed entity, follow tunnel edges k hops, return all reachable drawers. Live but unmeasured against the LongMemEval shape — it's set up for "what's connected to X" rather than the matched-protocol retrieval the bench measures.

So we have the graph traversal substrate but not the AST-to-graph extraction step. xg-gh-25's pipeline note suggests the missing piece is upstream of the graph, not in it. Worth their own writeup; I'll let them speak to that.

FT-300 independent reproduction (just landed)

Cross-reference your additive-axes story directly: reproduced FT-300 end-to-end on katana this morning from nakata-app/adaptmem upstream. Same longmemeval_eval.py --mode train recipe, fresh seed=42 300/200 split, --device cuda for the fine-tune.

Your published FT-300 result	Our katana repro (200q test)
R@1	0.915	0.925
R@5	0.995	1.000
R@10	0.995	1.000

Same on 500q full (training questions included): R@5 = 0.9980 (5/6 categories saturate at 1.000; small dip on single-session-assistant at 0.9821). Wall clock 56s train + 18s test on the GPU. Reproduces inside published noise — your FT-300 protocol is portable.

Full writeup + reproducible split JSON: docs/benchmarks/2026-05-17-adaptmem-ft300-reproduction.md.

For methodological completeness — three code-tuned variants from your codesearchnet_train_colab.py line (one ft300 set we had locally cached at ~/Projects/adaptmem-cache/, plus ft300 and ft1000 from a separate download) gave us 0.9280 / 0.9660 / 0.9560 R@5 respectively on the same 500q full set. Same algorithm, different training corpus → swing in test recall ranging from -3.8pp (the cached ft300 variant) up to +3.4pp once retrained on LongMemEval-domain data (the FT-300 result above). Two code-ft300 weight sets produced different test recall despite identical training data — small-N MultipleNegativesRankingLoss is noticeably stochastic. Companion writeup at docs/benchmarks/2026-05-17-adaptmem-encoder-swap.md.

Composition direction worth checking next

Your matched-protocol numbers had hybrid_v4 + FT-300 + 3-stage rerank at R@1 = 0.99 in SPRINT_4_FINAL.md. Substrate-floor parity in our SME #9 thread confirms postgres+pgvector + MiniLM = chromadb + MiniLM byte-identically (R@5 = 0.9660, per-category exact match across all 6 qtypes). So in principle the FT-300 + hybrid_v4 + rerank stack should compose into our postgres substrate the same way it composes into upstream chromadb. We haven't measured that yet — the hybrid retrieval layer on the postgres backend is the next item that needs an SME-side reading. Will post when that lands.

Question back: your sprint_0p99 stack uses chat-ce-v3 as the trust-gated CE reranker. The cross-encoder's training distribution presumably matters the same way the bi-encoder's does — is chat-ce-v3 conversational-domain, and if so does the same domain-mismatch curve we just measured for the bi-encoder apply to the CE? If yes, the rerank-axis additivity story has a parallel "domain-match required" qualifier.

🫏

0 replies

jphein
May 29, 2026
Collaborator

@nakata-app — wanted to follow your additive-axes story with a cross-domain data point that I think extends it rather than challenges it. Short version: your in-domain lift reproduces and holds up for us; when we carried the same encoder to a different corpus it flattened; and a finding from our side suggests there may be a fourth orthogonal layer worth stacking on top.

First — the in-domain lift is real, and it reproduces

Your orthogonal-layers framing is compelling, and the numbers back it. Your published table has MemPal raw R@5 0.966 → +FT-300 0.980 → +hybrid_v4+FT-300 0.990, with R@1 climbing 0.806 → 0.862 → 0.916 — encoder fine-tune and hybrid retrieval each adding lift on top of the other. We reproduced the FT-300 leg end-to-end on our own hardware (katana, fresh seed=42 300/200 split) and the held-out 200q test hit R@5 = 1.000 (R@1 0.925) — inside your published noise. So the in-domain encoder lift isn't a one-machine artifact; the protocol is portable and the R@5 lift toward ceiling is genuine. No argument from us there.

Where it gets interesting — a cross-domain transfer test

We then did something your thread hadn't covered: carried the same FT-300 encoder to a deliberately different corpus — jp-realm-v0.1, a 30-question probe set over a personal technical knowledge base (135k drawers of code, infra notes, RFCs), scored by substring expected_sources recall against a frozen ChromaDB backup, fully offline. Cross-domain, not cross-machine.

Leg	Encoder	R@1	R@5	R@10
A	`all-MiniLM-L6-v2` (base)	0.3448	0.5172	0.6207
B	FT-300 (your published model)	0.3621	0.5172	0.6034

Here the lift didn't transfer: R@5 0.5172 → 0.5172, flat. 24 of 29 covered questions move exactly 0.0 — the FT encoder ranks the same drawers as base. A from-recipe re-train of the fine-tune (third leg) landed within ±2pp of base too, so two independently trained FT encoders both no-op'd on this corpus. (One honest detail: the published FT-300 we have carries code/scientific-computing training content, so against a personal technical KB it's genuinely out-of-domain — the cleanest version of the test.)

Read together with your numbers, this is completely consistent if the lift is domain-specific: strong when the fine-tune corpus and eval corpus are the same family, flat across a corpus shift. That's not a knock on the method — it's a boundary on it. So the real question back, collaborator-to-collaborator: have you seen the orthogonal lift hold across a corpus shift, or does it want hard-negative re-mining on the target corpus to travel? Your chat-ce-v3 trust-gate work already gestures at a "domain-match required" qualifier on the cross-encoder axis, so I suspect this rhymes with something you've already noticed.

A possible fourth orthogonal layer

One more finding that I think composes with your encoder+hybrid stack rather than competing with it. On oracle LongMemEval — gold session pinned in context, retrieval held at its 0.974 R@5 ceiling — we measured reader QA at only ~50%: a ~45pp R@5→QA gap (our #116). The right evidence is in front of the reader and it still misses. So on that corpus an encoder lift driving R@5 from 0.966 toward 1.000 is real but doesn't, on its own, move end-to-end QA — the bottleneck has shifted downstream to the reader/consumption layer. On top of your stack — encoder-FT, hybrid retrieval, the rerank cascade — this reads like one more orthogonal layer: reader/prompt design. (Our stratified n=150 retrieval A/B also had graph/age fusion ~neutral — R@5 92.67% vs 92.00% — which is why we're now spending our attention on the reader rather than retrieval.)

Full writeups and the convergent findings are on our results page: https://techempower-org.github.io/multipass-structural-memory-eval/site/#benchmarks

Genuinely — the reproducibility of your protocol is what let us run the cross-domain test at all. Curious to hear if corpus-shift transfer is something you've poked at.

🫏

0 replies

Domain-adaptive fine-tune as orthogonal R@5 lift on top of MemPal raw #1249

Uh oh!

nakata-app Apr 28, 2026

What we measured

Possible integration shape

Reproduce

Either outcome is fine

Replies: 6 comments · 1 reply

Uh oh!

Uh oh!

nakata-app May 1, 2026 Author

Uh oh!

jphein May 11, 2026 Collaborator

Uh oh!

nakata-app May 13, 2026 Author

Three runs on v3.3.5 (full 500q, matched protocol)

Three takeaways

What the deltas mean

Reproduce

Uh oh!

nakata-app May 16, 2026 Author

Uh oh!

Uh oh!

nakata-app May 17, 2026 Author

Uh oh!

Uh oh!

jphein May 17, 2026 Collaborator

The n=200 probe set

On the "structured extraction + graph traversal" question

FT-300 independent reproduction (just landed)

Composition direction worth checking next

Uh oh!

Uh oh!

jphein May 29, 2026 Collaborator

First — the in-domain lift is real, and it reproduces

Where it gets interesting — a cross-domain transfer test

A possible fourth orthogonal layer

nakata-app
Apr 28, 2026

Replies: 6 comments 1 reply

nakata-app
May 1, 2026
Author

jphein May 11, 2026
Collaborator

nakata-app
May 13, 2026
Author

nakata-app
May 16, 2026
Author

nakata-app
May 17, 2026
Author

jphein
May 17, 2026
Collaborator

jphein
May 29, 2026
Collaborator