-
Notifications
You must be signed in to change notification settings - Fork 7.2k
-
SummaryI ran an A/B/C ablation comparing three chunking strategies through mempalace's full retrieval pipeline (chromadb + ONNX MiniLM + hybrid BM25 rerank) and got a partial reproduction of the 2026-published claims that strategy-aware chunking beats fixed-size paragraph splitting.
R@5 and R@10 tied at 60% / 70% across all three. The headline: B (heading-aware on Filing here as a Discussion rather than a PR because the win is small enough that I'd like maintainer input on whether it warrants action, and on whether my probe set / corpus selection is fair. What I testedThree chunking strategies, all running through
Two corpora:
Probe set: 20 hand-curated queries with known-good The mechanism behind B's winThe structural example is the Under A, the silent-vs-block discussion gets merged with adjacent paragraphs about other topics. Under B, that subsection becomes its own chunk → tighter embedding → better rank. Per-probe ranks: A=5, B=3, C=4. Of the 5 markdown probes, 1 strategy-disagrees in B's favor, 1 ties at rank 1 across all strategies, and 3 miss top-10 entirely on all three strategies. What didn't reproduceAST-aware Python (C) loses on code probes by -6.4% relative MRR. mempalace's package has many short, well-named functions; AST chunking strips the surrounding module-level context (constants, imports, neighboring helpers) that paragraph-aware preserves. Article 2's win was on bigger codebases where function bodies have substantially more text. Worth flagging because the "AST chunking is strictly better for code" framing those articles lean on doesn't hold across all code shapes. Three markdown probes that all 3 strategies miss are query-side, not chunking-side, failures:
These are abstract "architecture of X" questions where no single chunk in the target doc lexically anchors the query distinctively. Chunking strategy can't fix that — what would help is document-level summary embeddings (titles + descriptions embedded separately from body chunks, like several commercial RAG systems do). That's a much bigger lever than chunking. Proposed minimal PR shapedef chunk_text(content, source_file, chunk_size, chunk_overlap, min_chunk_size): if source_file.lower().endswith((".md", ".markdown", ".rst")): return _chunk_markdown_heading_aware(content, source_file, chunk_size, chunk_overlap, min_chunk_size) return _chunk_paragraph_aware(content, source_file, chunk_size, chunk_overlap, min_chunk_size) One new helper (~50 LOC), file-extension dispatch, paragraph-aware fallback for oversized sections. No AST, no semantic embedding-driven chunking, no new dependencies. Tests + ablation reproducer ( Open questions
Repro: |
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 20 comments 3 replies
-
I would keep the heading-aware markdown split, but avoid shipping the AST-Python change until the failure mode is clearer. The table suggests the AST strategy is not just neutral; it is actively hurting code retrieval while recall is unchanged. That usually means the retrieved set is similar, but ranking or chunk usefulness degraded.
A good next ablation would separate three things:
- chunk boundary quality: does the chunk contain a complete function/class or only a syntactic fragment?
- retrieval representation: are names, docstrings, imports, and call sites preserved in the text embedded/indexed?
- answer usefulness: does the chunk give enough surrounding context for the downstream task?
For code, AST chunks often need enriched text, not just pure syntax boundaries. Function name, class path, imports, decorators, docstring, nearby comments, and parent module path can matter as much as the body. I would test an "AST-lite" variant: keep paragraph/semantic splitting, but attach code metadata and symbol headers to each chunk.
Given the current result, the pragmatic decision is: ship B for markdown, keep C behind an experimental flag, and expand the code-query test set before changing the default.
Beta Was this translation helpful? Give feedback.
All reactions
-
Thanks @musaabhasan — the 3-axis decomposition is the right framing, and "AST-lite" is the suggestion I keep coming back to. The retrieval-representation axis is genuinely under-explored; most chunking research focuses on boundary quality (axis 1) and misses that "this chunk has the function body but lost the imports + docstring + class path" is the more important failure mode for code retrieval. The chunk you embed isn't the chunk the model needs to see.
Pragmatic path I'm taking on the jphein fork:
- Ship B (heading-aware markdown) as default — clean Pareto win, no gate needed.
- Keep C (AST-Python) opt-in only — current data shows it's net-negative.
- Treat AST-lite (paragraph boundaries + symbol-header enrichment) as a separate experiment worth running.
The wider read: this maps onto a 4-axis model — storage / encoder / retrieval / consumption — where each layer is independently improvable. @nakata-app's adaptmem (discussion #1249, encoder), @zhapostolski's #1425 (retrieval decay), and your AST-lite proposal (storage-to-encoder seam) each address one axis without colliding. The probe set + harness live at scripts/chunk_strategy_ablation.py on jphein/mempalace main if anyone wants to take the AST-lite ablation.
Beta Was this translation helpful? Give feedback.
All reactions
-
Thanks @jphein, the four-axis framing lands well, and the encoder-as-its-own-layer cut is the part adaptmem has been quietly betting on.
Quick sketch of where adaptmem actually sits on that axis, since "encoder fine-tuning" undersells what's interesting:
- The base encoder (MiniLM, BGE, whatever ships) is trained on web prose. It doesn't know your corpus's vocabulary, your symbol names, or which surface forms
co-refer. adaptmem's lever is a thin online adaptation layer on top of the frozen base: same embedding dim, same chunk_text contract, no changes to the storage or
retrieval layers. Drop-in. - It's tuned by retrieval feedback, not by labeled pairs. So the signal it consumes is exactly the kind of probe set you've already built for the chunking
ablation. The two harnesses are compatible.
Where this connects to AST-lite cleanly: @musaabhasan's symbol-header enrichment changes the text the encoder sees. adaptmem changes what the encoder does with
that text. Stacking them is the obvious experiment: does symbol-enriched chunk text + adapted encoder compound, or does the enrichment make the adaptation
redundant? My prior is they compound on code probes (different failure modes) and wash on markdown (B already wins there).
On the doc-level summary embedding gap you flagged in passing ("no single chunk lexically anchors the query distinctively, chunking can't fix that"): agreed, and
this is squarely encoder/representation territory rather than storage. The three abstract probes that all three strategies missed are the canonical case for a
separate title/summary embedding indexed alongside body chunks, with adaptation tuned for the summary-vs-body asymmetry. Happy to spin that as a small ablation
against your existing probe set if there's interest, would slot into scripts/chunk_strategy_ablation.py as a fourth strategy column rather than a new harness.
adaptmem repo is at github.com/nakata-app/adaptmem, encoder discussion at #1249. Will take a pass at the AST-lite ablation too once the symbol-header spec settles.
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Quick follow-up on the compound-vs-wash prior I floated above. Ran the chunking ×ばつ encoder cross on mempal's own package corpus (the same 15-probe harness from
Recall@5 and Recall@10 are pinned at 60 % (cs400) / 65 % (cs800) across every cell — the 15-probe set saturates on inclusion and only MRR has resolution. With that caveat: The compound-on-code prior didn't survive contact. C-AST + FT-300 at cs800 washes against C-AST + default (−0.006 MRR), and the FT-300 lift visible on A/B paragraph/heading chunks (+0.029–0.054 MRR) disappears the moment AST chunking kicks in. The AST strategy's own structural lift (default A 0.485 → default C 0.560) and the encoder's domain-adaptation lift end up addressing overlapping failures rather than independent ones on this corpus. The wash-on-markdown prior is undertested. Mempal's package directory is all Most likely confound: domain mismatch. FT-300 was tuned on LongMemEval conversational QA pairs, not on a code corpus. The cs400 A/B lifts of +0.054 are already small relative to the +0.126 R@1 we see on LongMemEval itself with the same model — so the encoder is doing something on code (probably picking up generic English semantics that the ONNX baseline lacks), but the lift is small enough that the AST strategy's structural lift dominates and absorbs it. A code-domain-tuned adaptmem checkpoint would be the cleaner control here; running that against the same 15 probes is the next step worth doing if encoder ⊥ chunking on code is a question you want a real answer to. Raw outputs (per-probe ranks, drawer counts, mine seconds, all 12 cells) are committed at Not a clean orthogonality result, but it's a clean negative. |
Beta Was this translation helpful? Give feedback.
All reactions
-
❤️ 1
-
Quick note on what I'm planning to do about the domain-mismatch confound from the previous post.
The cleanest way to test the encoder axis on code is to swap the training-domain bias out. So the next step on the adaptmem side is FT-Code, an adaptmem checkpoint trained on a code-domain corpus (CodeSearchNet Python subset, ~457k query-code pairs) instead of LongMemEval conversational QA. Same architecture, same drop-in encoder contract, only the adaptation signal changes.
Two-part eval plan, both reproducible against your harness:
- Your 15-probe chunk×ばつencoder cross, same 12 cells, but
defaultvsFT-Codeinstead ofFT-300. The direct apples-to-apples successor to the post above. - CodeSearchNet's own test split (~19k Python query-code pairs) for statistical power, since the 15-probe MRR was noise-bound on the previous cross.
If FT-Code does compound on AST chunks, the encoder-as-its-own-axis claim survives on code. If it washes, that's a cleaner negative than what we have and the AST-lite + symbol-header path becomes the more interesting lever.
Realistic timeline: 3-5 days (Colab training + dual-eval + write-up). Would post the result here as a follow-up unless you'd rather see it as a separate discussion.
Two questions if either matters to you:
- The probe set in
scripts/chunk_strategy_ablation.py, would expanding it to a code-focused 100-probe set be useful for you independently of the encoder question, or is the 15-probe size deliberate? - On the
chunk_textcontract: any reason not to exposesymbol_header_prefixas an optional kwarg in 0.5, so AST-lite and FT-Code can stack on the same code path?
Beta Was this translation helpful? Give feedback.
All reactions
-
❤️ 1
-
|
@nakata-app — thank you for this. The honesty about the original "clean negative" not surviving the new data, the bootstrap CIs, the non-monotonic checkpoint scaling, and especially the RRF inversion are all genuinely useful — and the second one is the kind of self-correction that strengthens a thread, not weakens it. The FT-Code training write-up at the top (R@1 0.648 → 0.926 on CodeSearchNet's own test) is the in-domain ceiling I was looking for; useful to have it pinned now. A few things I did on this side in response, all reproducible: 1. 2. n=200 probe set, deterministically derived from git log. Your bootstrap-CI section was exactly right — at n=20 the 95% CIs all overlap zero. The generator is at 3. Local 3-way RRF reproduction on C-AST cs800, 200-probe set. I pulled all three FT-Code checkpoints from the Drive links and wrote
3-way lift: +0.0841 MRR vs best solo. That's larger than the +0.076 you reported on n=20 — and lands clean on the n=200 set where the bootstrap CIs from your §3 will actually have power. Three structural findings that reproduce yours directly:
Per-probe set diff (probes only one encoder surfaced at top-10):
The lift is a lower bound (rank-of-expected only, no full ranked lists). True RRF on full ranked lists would give more. Per-probe ranks across all three encoders + the fused result are in our scratch — happy to attach as a comment artifact here if useful for cross-checking. One protocol gotcha worth flagging. While debugging this, I found that subclassing On the two open questions from your May 13 comment:
On CodeCrossEnc-v1: the negative-transfer footnote from ms-marco-MiniLM + BGE-reranker-base is worth promoting from a footnote — "two rerankers, same direction" is enough signal for a "don't reach for off-the-shelf rerankers on code-domain bi-encoders" warning. When CodeCrossEnc-v1 lands I'd be happy to run it through On the 3,700-chunk in-domain FT followup you flagged: the synthesis cost can drop close to zero by pulling pairs straight out of git history without any human labeling. Three cheap signal sources:
Stacked across 3-5 of our repos that pattern would give ~10–20K pairs without manual labeling. The probe-derivation script I built ( Tangential but possibly interesting — we just landed a substrate cutover on our fork (chromadb → postgres + pgvector for storage; postgres tsvector + pg_trgm for BM25; Apache AGE for the graph axis). Our chunking-ablation work would run against the postgres backend with effectively identical embedding semantics (chromadb's default ONNX MiniLM is wrapped behind a Also: if you'd like an independent third-party rerun of your CodeSearchNet 22k eval headline (R@1 0.926 / MRR 0.952 for FT-Code-5000), we've got the model + a CUDA box; happy to run Probe-level JSON for the local reproduction is committed at Really glad this thread exists. |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Following up on the domain-mismatch thread from your May 11 post. Two questions I wanted to answer with actual numbers: (1) does the encoder-axis negative result hold when the encoder is trained on code data instead of LongMemEval QA pairs? (2) is the −0.006 / −0.043 finding statistically defensible at n=15, 20? Here's what I ran. 1. CodeSearchNet python full test (22k), in-domain sanity checkTraining:
Δ baseline → FT-Code-5000: +0.278 R@1, +0.211 MRR. Encoder fine-tune 2. Cross-harness rerun on jphein's
|
| Strategy | default | FT-300 | FT-Code-300 | FT-Code-1000 | FT-Code-5000 |
|---|---|---|---|---|---|
| A_paragraph_aware cs400 | 0.4583 | 0.5125 | 0.4917 | 0.5500 | 0.5433 |
| A_paragraph_aware cs800 | 0.4850 | 0.5142 | 0.5058 | 0.4780 | 0.5333 |
| B_heading_aware_md cs400 | 0.4583 | 0.5125 | 0.4917 | 0.5500 | 0.5433 |
| B_heading_aware_md cs800 | 0.4850 | 0.5142 | 0.5058 | 0.4875 | 0.5292 |
| C_plus_ast_python cs400 | 0.4583 | 0.4833 | 0.5417 | 0.5750 | 0.5600 |
| C_plus_ast_python cs800 | 0.5600 | 0.5542 | 0.5333 | 0.5588 | 0.5167 |
R@10 per strategy:
| Strategy | default | FT-Code-5000 |
|---|---|---|
| All 6 strategies | 60-65% | 70% (uniform) |
Two raw observations:
- 5/6 strategies, FT-Code-5000 beats default on MRR (+0.04 to +0.10) and on
R@10 (+5 to +10 points uniformly). - C-AST cs800 is the one strategy where FT-Code-5000 underperforms default
(−0.043 MRR). FT-300 also slightly underperformed there (−0.006), so the
original negative-result direction reproduces qualitatively with a code-domain
encoder too.
That's the raw fact pattern. Now the statistics.
3. Paired bootstrap %95 CI on the C-AST cs800 finding
n=20 probes is small. To check whether the −0.043 (and the original −0.006) is
inside noise, paired bootstrap with 10,000 resamples on per-probe RR:
| Strategy | mrr_default | mrr_FTcode5000 | Δ | 95% CI | P(Δ>0) |
|---|---|---|---|---|---|
| A_paragraph_aware cs400 | 0.4583 | 0.5433 | +0.085 | [−0.015, +0.217] | 0.937 |
| A_paragraph_aware cs800 | 0.4850 | 0.5333 | +0.048 | [−0.060, +0.158] | 0.801 |
| B_heading_aware_md cs400 | 0.4583 | 0.5433 | +0.085 | [−0.015, +0.217] | 0.937 |
| B_heading_aware_md cs800 | 0.4850 | 0.5292 | +0.044 | [−0.063, +0.153] | 0.792 |
| C_plus_ast_python cs400 | 0.4583 | 0.5600 | +0.102 | [−0.007, +0.238] | 0.963 |
| C_plus_ast_python cs800 | 0.5600 | 0.5167 | −0.043 | [−0.125, +0.030] | 0.126 |
Read carefully: no strategy is significant at n=20 (all CIs include zero).
But the directional signals are not symmetric, five strategies sit at
P(Δ>0) ∈ [0.79, 0.96]; C-AST cs800 sits at P(Δ>0) = 0.126.
The honest takeaway: the original −0.006 finding lived squarely in the noise
floor at n=15, and the new −0.043 finding lives in the noise floor at n=20.
"Significant regression on C-AST cs800" is not a claim we can defend on this
sample size. A larger probe set is the right next step before treating that
strategy as a structural negative.
4. RRF ensemble, encoder axis composes with default encoder
What if we don't swap encoders but fuse them? RRF surrogate
(rank_fused = min(rank across runs) on the expected doc; this is a lower
bound on true RRF since the JSONs only stored rank-of-expected, not full
ranked lists). Two ensemble configurations:
2-way (default + FT-Code-5000):
| Strategy | default | FT-Code-5000 | RRF 2-way | Δ vs best solo |
|---|---|---|---|---|
| A_paragraph cs400 | 0.4583 | 0.5433 | 0.5873 | +0.044 |
| A_paragraph cs800 | 0.4850 | 0.5333 | 0.5939 | +0.061 |
| B_heading cs400 | 0.4583 | 0.5433 | 0.5873 | +0.044 |
| B_heading cs800 | 0.4850 | 0.5292 | 0.5898 | +0.061 |
| C-AST cs400 | 0.4583 | 0.5600 | 0.6039 | +0.044 |
| C-AST cs800 | 0.5600 | 0.5167 | 0.6106 | +0.051 |
3-way (default + FT-Code-1000 + FT-Code-5000), the strongest config tested:
| Strategy | default | FT-Code-1k | FT-Code-5k | RRF 3-way | Δ vs best solo |
|---|---|---|---|---|---|
| A_paragraph cs400 | 0.4583 | 0.5500 | 0.5433 | 0.6123 | +0.062 |
| A_paragraph cs800 | 0.4850 | 0.4780 | 0.5333 | 0.6023 | +0.069 |
| B_heading cs400 | 0.4583 | 0.5500 | 0.5433 | 0.6123 | +0.062 |
| B_heading cs800 | 0.4850 | 0.4875 | 0.5292 | 0.5981 | +0.069 |
| C-AST cs400 | 0.4583 | 0.5750 | 0.5600 | 0.6373 | +0.062 |
| C-AST cs800 | 0.5600 | 0.5588 | 0.5167 | 0.6356 | +0.076 |
R@10 uniformly 70% across all 6 strategies in both 2-way and 3-way fused
settings (vs default 60-65%).
The headline: C-AST cs800, the original "negative result" strategy, gets the
largest ensemble lift (+0.076 MRR) when fused 3-way. The strategy where
single-encoder swap underperforms default is the same strategy where
ensemble-encoder fusion outperforms default the most. That's the inverse of
what a structural encoder-axis failure would look like.
This is the actual answer to the "encoder axis vs chunking axis" question:
these axes compose additively when fused, not when one is forced to replace
the other. The original chunk_strategy_ablation harness measured single-
encoder substitution, which is the wrong primitive for this kind of axis
test, production retrieval can run two or three encoders in parallel and
RRF-merge for cheap (one extra forward pass per query, no extra storage,
deterministic).
Independent replication at ×ばつ scale (issue #82, 2026年05月15日): You ran
the same 3-way RRF configuration on the n=200 git-derived probe set and got:
| Encoder | Solo MRR | R@10 |
|---|---|---|
| default ONNX | 0.4260 | 49.5% (99/200) |
| FT-Code-1000 | 0.4229 | 53.5% (107/200) |
| FT-Code-5000 | 0.3972 | 50.0% (100/200) |
| RRF 3-way | 0.5101 | 59.5% (119/200) |
Δ MRR = +0.0841 vs best solo. Our n=20 surrogate gave +0.076 at the same
configuration. Direction and magnitude align; the effect scales cleanly with
probe set size. This closes the sample-size concern from §3.
5. Scaling signal across FT-Code checkpoints, non-monotonic
Looking at the 300 → 1000 → 5000 step progression per strategy:
| Strategy | FTcode300 | FTcode1k | FTcode5k | Shape |
|---|---|---|---|---|
| A_paragraph cs400 | 0.4917 | 0.5500 | 0.5433 | peak at 1k |
| A_paragraph cs800 | 0.5058 | 0.4780 | 0.5333 | dip at 1k |
| B_heading cs400 | 0.4917 | 0.5500 | 0.5433 | peak at 1k |
| B_heading cs800 | 0.5058 | 0.4875 | 0.5292 | dip at 1k |
| C-AST cs400 | 0.5417 | 0.5750 | 0.5600 | peak at 1k |
| C-AST cs800 | 0.5333 | 0.5588 | 0.5167 | U-shape: best at 1k, worst at 5k |
Two things to flag here:
-
Scaling is not monotonic. FT-Code-1000 is the local optimum in 4 out
of 6 strategies, with FT-Code-5000 either tied or slightly behind. This is
consistent with the model overfitting to the CodeSearchNet python
distribution as training continues, useful for in-domain code retrieval
(the 22k eval above), but progressively worse for mempalace's own .py
corpus which has more markdown / docstring mix. -
C-AST cs800 specifically: FT-Code-1000 = 0.5588, default = 0.5600.
Within noise of default at the 1k step; the −0.043 at 5k is a training-
step artifact, not a structural property of the encoder axis on this
strategy. This sharpens the bootstrap-CI takeaway from §3, the "negative
result" isn't just statistical noise, it's also moving with training
duration in a way that suggests it's fixable, not fundamental.
R@10 tells a cleaner monotonic story regardless: 60-65% default → 70%
uniform from FT-Code-1k onward across all 6 strategies. Top-1 ranking is
where the noise lives; top-10 coverage scales cleanly.
6. Direct response to the original framing
"Encoder lift (FT-300, trained on LongMemEval QA pairs) is LongMemEval-
domain-specific and won't compose with chunking-axis changes on
mempalace's own .py corpus."
Re-reading this after the new runs:
- "Domain-specific" part holds. FT-300 was a conversational-QA encoder, of course
transfer to code was weak. That was real domain mismatch, not encoder-axis
inadequacy. - "Won't compose with chunking-axis changes" part doesn't hold under RRF
fusion. When the question is "can the encoder axis add on top of the
chunking axis", the answer is yes uniformly across 6/6 strategies (default
→ fused, +0.04 to +0.06 MRR, +5 to +10 R@10 points). - "On mempalace's own .py corpus" caveat softens but doesn't dissolve.
CodeSearchNet python is closer to mempalace's .py corpus than LongMemEval QA
is, but still distribution-shifted (mempalace internal API + docstrings ≠
HuggingFace-mined open-source python). FT-Code-5000 lifts substantially in
5 strategies but not the strongest-default strategy (C-AST cs800). True
in-domain ceiling would need mempalace-corpus-derived training pairs, which
is a different epic.
Reproduce
CodeSearchNet eval:
cd ~/Projects/adaptmem python benchmarks/codesearchnet_eval.py \ --checkpoint /path/to/ft-code-5000 \ --n -1 \ --out results/ft-code-5000.jsonl
Cross-harness probe:
cd ~/Projects/adaptmem python benchmarks/jphein_chunk_x_encoder.py \ --ft-model /path/to/ft-code-5000/model \ --out-dir benchmarks/v335/chunk_x_encoder_ftcode5000
Paired bootstrap CI:
python benchmarks/bootstrap_paired_mrr.py \ benchmarks/v335/chunk_x_encoder_ftcode5000/ablation_default_encoder.json \ benchmarks/v335/chunk_x_encoder_ftcode5000/ablation_ft300_encoder.json \ --label-a default --label-b ftcode5k
RRF surrogate fusion:
python benchmarks/rrf_ensemble.py \ benchmarks/v335/chunk_x_encoder_ftcode5000/ablation_default_encoder.json \ benchmarks/v335/chunk_x_encoder_ftcode5000/ablation_ft300_encoder.json
All scripts deterministic given the model files. Model checkpoints accessible
via Drive (anyone-with-link reader):
- FT-Code-300: https://drive.google.com/drive/folders/1fe5t5LWWHFGDV5CC5GcfCk5aOVaRKKRb
- FT-Code-1000: https://drive.google.com/drive/folders/1QhjDc63M4vKdOxVMP7ZbpyhYoLcXjnlC
- FT-Code-5000: https://drive.google.com/drive/folders/1GZXGQG4LJL8jm1ajbo8JgiIPPICf_QtT
- CodeCrossEnc-v1: not sharing at this stage (see §7, v1 shows negative transfer; v2 hard-negative version pending)
7. CodeCrossEnc-v1: code-specific reranker axis
Training: cross-encoder/ms-marco-MiniLM-L-6-v2 base, CodeSearchNet
python train, 30K positive pairs + 2 random negatives each = 90K total,
1 epoch, batch=8, lr=2e-5, warmup=300, max_length=384. Local CPU (Mac mini M2,
8GB RAM, ~4h). Generic cross-encoders (ms-marco-MiniLM-L-6-v2 untuned and
BAAI/bge-reranker-base) both showed negative transfer (R@1 ~0.90 vs
FT-Code-5000 alone 0.926). CodeCrossEnc-v1 is the code-specific alternative.
Eval: FT-Code-5000 bi-encoder top-20 → CodeCrossEnc-v1 rerank,
CodeSearchNet python test split (n=5000 queries; full-set eval pending):
| Config | R@1 | R@5 | R@10 | MRR |
|---|---|---|---|---|
| FT-Code-5000 (bi-alone) | 0.9148 | 0.9804 | 0.9868 | 0.9448 |
| + CodeCrossEnc-v1 rerank (top-20) | 0.9148 | 0.9158 | 0.9194 | 0.9198 |
| Δ rerank vs bi-alone | 0.0000 | −0.0646 | −0.0674 | −0.0250 |
Honest verdict: The local cross-encoder did not improve over bi-alone and
actively hurt R@5/R@10 (−6.5 / −6.7 pp). R@1 is unchanged because queries
where bi-encoder already ranks #1 are stable; the damage is in slots 2-5 where
the cross-encoder reorders within the top-20 randomly rather than helpfully.
Root cause is almost certainly the training setup: 30K pairs with random
negatives teaches the model to distinguish "this docstring's code" from "some
other random code", but not to fine-rank 20 near-duplicate candidates which is
the actual top-20 reranking task. Hard negatives (mined from bi-encoder's own
top-K) are required for cross-encoder training to be useful. The Colab variant
(100K pairs, batch=32, T4 GPU) may partially close this gap, but training data
quality is likely the binding constraint, not scale.
Note to jphein: local CodeCrossEnc-v1 (random-negative training) shows
negative transfer on the reranking task (R@5 −6.5 pp vs bi-alone). Sharing the
checkpoint is not useful at this stage, it would hurt rather than help any
probe set eval. The right next step is hard-negative mining from the
bi-encoder's top-50 before training a cross-encoder worth evaluating on your
CUDA box. Will revisit after that training pass.
What's next on our side
- CodeCrossEnc-v2 with hard negatives. v1 used random negatives; the fix
is mining negatives from the bi-encoder's own top-50 (the candidates the
bi-encoder almost-but-not-quite ranked correctly). That's the actual
reranking distribution. Once v2 is trained and shows positive transfer, the
3-axis test (chunking ×ばつ bi-encoder ×ばつ cross-encoder) on your n=200 probe set
is the natural next step. - In-domain FT with mempalace-corpus-derived pairs remains interesting.
The git-history synthesis approach (~10-20K pairs from commit messages +
function diffs) is on our list once we have thederive_probesscript.
Thanks for PR #1508 (symbol_header_prefix) and for the n=200 replication in issue #82.
The 3-way RRF lift (+0.0841 MRR at n=200 vs our +0.076 at n=20) confirms the effect is real
and scales with probe set size.
One note on your PR #80 (chromadb EF embed_query warning): the silent BM25 fallback you
documented is a real footgun. Our jphein_chunk_x_encoder.py wrapper already inherits from
chromadb.api.types.EmbeddingFunction, so our probe runs above used the actual encoder on
the query path. The warning will still be useful for downstream adaptmem users building custom
wrappers without referencing our code.
Happy to share Drive links to the three FT-Code checkpoints, the probe-level
JSONs, or the bootstrap script if useful for the broader four-axis writeup.
Beta Was this translation helpful? Give feedback.
All reactions
-
Cross-reference for the encoder-axis framing in this thread: matched-protocol benchmark on #1249 just hit R@1 0.99 with ft-v4 + three-stage rerank stack on hybrid_v4. Per-stage numbers + diagnosis in SPRINT_4_FINAL.md. Encoder-axis still composing additively with the chunking / retrieval / rerank axes, same direction as the RRF ensemble result earlier in this thread.
Beta Was this translation helpful? Give feedback.
All reactions
-
|
@nakata-app — congrats on the R@1 0.99 stack in #1249, that's a genuinely beautiful result. Posting a small data point from a parallel axis in case it's useful for the additive-composition story. We ran the 3-way RRF reading from your May-15 §4 against an n=200 git-derived probe set (×ばつ your sample). The encoder slot uses your published FT-Code-1000 and FT-Code-5000 checkpoints directly — so this is a reproduction of your fusion math on your artifacts, not an independent training run. At the raw chromadb-vector layer:
Clean +0.0841 MRR vs. best solo — the inversion you flagged in §4 holds at larger n, and the worst-solo-encoder contributing the largest fusion lift pattern reproduces cleanly. Where the data point gets interesting — when we ran the same 200 probes end-to-end through
Per-probe: 1 rescued, 0 regressed, 13 tied hits, 32 tied misses, 3 worsened rank. Statistically flat for ×ばつ query latency. This is the second technique where we've measured raw-vector lift evaporating through This isn't in tension with your R@1 0.99 — three independent reasons it doesn't generalize:
So: fusion-on-top-of-rerank may be a different beast from swap-into-rerank, and the additive-axes picture probably still holds for the latter. Where it landed: techempower-org/mempalace#85 merged 2026年05月16日, gated behind |
Beta Was this translation helpful? Give feedback.
All reactions
-
Interesting ablation. We've been working on a related problem — not retrieval from a general corpus, but structured extraction from conversation history for session resume. Different framing, but the chunking tradeoffs are surprisingly similar.
Our finding that might be relevant here:
For markdown specifically, heading-aware splitting wins because the heading IS the retrieval signal. When a user asks "what did we decide about X?", the section heading ## Decision: X is the strongest semantic anchor. Paragraph-aware chunking splits right through it.
But for code, AST-aware loses for us too — and I think I know why:
-
AST boundaries don't match semantic boundaries. A function is one AST node, but the useful context is often "this function + the 3 lines of comments above it + the import it depends on." AST splitting severs these connections.
-
Embedding models were trained on prose, not syntax trees. MiniLM (which you're using) has no special attention for
def/classtokens. A code chunk starting withdef calculate_risk(and one starting with# Risk calculation for portfoliowill have wildly different embedding quality despite being about the same thing.
What we do instead: For code context, we skip chunking entirely and use structured extraction — pull out function signatures + docstrings + call graph edges as structured records. Then retrieval is graph traversal, not vector similarity. MRR is irrelevant because the retrieval model is deterministic (follow the edge), not probabilistic (find the nearest vector).
Recommendation on your question: Ship B (heading-aware for .md). For code, consider whether your use case is "find similar code" (embedding works) or "find relevant context for THIS code" (graph wins). The ablation suggests your code probes are already well-served by paragraph splitting — AST adds complexity without lift.
Code intelligence graph in production: SwarmAI. Discussion: DDD Cultivation
Beta Was this translation helpful? Give feedback.
All reactions
-
@xg-gh-25 — the "heading IS the retrieval signal" framing on markdown lands, and your point about embedding models being trained on prose rather than syntax is the cleanest explanation I've seen for why AST chunking loses on code despite being structurally "correct."
Cross-linking the relevant follow-on data: reproduction of nakata-app/adaptmem FT-300 on our box this morning lands at R@5 = 1.0000 on the held-out 200q LongMemEval-S test split. Same MiniLM-L6-v2 base, no chunking changes, just encoder-FT on 300 in-domain query-session pairs.
That has a methodological implication for this thread: chunking-axis sensitivity is partly downstream of encoder calibration.
Concretely — base MiniLM on longmemeval_s_cleaned lands at R@5 = 0.9660 (substrate-floor). With FT-300, MemPalace's published table shows raw R@5 at 0.992. There's almost no recall headroom left for chunking strategy to differentiate; in our run the FT-300 result was 1.000 on test with bog-standard paragraph chunking. The "chunking ablation matters less the closer your encoder is to ceiling" hypothesis would explain why @nakata-app's wider bootstrap on the 20-probe set saw B-vs-A flat (ΔMRR = 0, [0,0] CI) — the encoder was already finding the right session regardless of how it was rendered. The C-vs-A AST lift at cs=800 ([+0.008, +0.167] CI) survived precisely on the conditions where the encoder hadn't saturated yet.
If that hypothesis holds, the chunking-ablation result is encoder-conditional rather than universal: "ship heading-aware for .md, ship paragraph for code" looks robust at base-MiniLM, but may collapse to "ship paragraph for both, FT the encoder, recover the ceiling that way" once an in-domain FT is available. Worth a tiny followup ablation on the post-FT encoder if you've got it cached — the result either way is methodologically interesting.
The shape of xg-gh-25's "skip chunking, use structured extraction + graph" alternative also fits this frame: if encoder-FT recovers the ceiling on prose-shaped retrieval, the chunking-axis debate matters most for the cases where FT can't recover the ceiling — code being the obvious one, because the FT distribution (conversational text) doesn't transfer. The structured-extraction path is then less "chunking done right" and more "give up on the encoder for this domain entirely, replace it with a different retrieval model." Different axis, not a different chunking strategy.
(@nakata-app — n=200 probe YAML and the FT-300 reproduction details are in the linked #1249 comment; this is just the cross-link.)
🫏
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Followup to my earlier hypothesis post above — ran the ×ばつ2 ablation, the data is in, hypothesis confirmed and there's an interesting directional surprise. Setup48 markdown probes from our Result table
Δ (heading-aware vs paragraph), per encoder:
Finding 1: encoder-conditional hypothesis CONFIRMEDThe B-vs-A delta shrinks by ~83% when the encoder is FT'd to the domain. Encoder calibration absorbs most of the chunking-axis sensitivity. This mirrors @nakata-app's bootstrap on the 20-probe set — ΔMRR = 0 with [0, 0] CI on FT-300 (the encoder-already-saturated condition) — and tightens the hypothesis to a concrete absorption ratio at this n. Methodologically: chunking-axis ablation results should always specify the encoder calibration regime they were measured under, because the result may not survive an encoder-FT swap. A B-vs-A ablation on base-MiniLM that ships as a recommendation may have already been absorbed by a FT-300-class encoder upgrade. Finding 2: heading-aware loses in both regimes — opposite of xg-gh-25's claimThis is the directional surprise. @xg-gh-25's argument above was that heading-aware should win on markdown because "the heading IS the retrieval signal." On our probes, heading-aware loses to paragraph in both encoder regimes. Likely cause: our probes are commit-subject-shaped, e.g.:
A commit subject is semantically broader than the in-file section headings the heading-aware chunker prepends. Paragraph chunks catch the relevant body via word overlap with the broader subject; heading-prefixed chunks dilute the body signal with hierarchical headings ( xg-gh-25's "ship B for .md" recommendation probably still holds for user-style queries ("what did we decide about X?" — where the heading This suggests a probe-shape ×ばつ chunking interaction: heading-aware is best when the probe vocabulary matches heading vocabulary; paragraph is best when the probe vocabulary matches body vocabulary. The original "ship heading-aware for .md" claim should probably be qualified with "for probes that resemble user questions about the document's section structure" — not for all markdown retrieval. Implication for the chunking-strategy recommendationThe cleanest framing I can draw from the ablation:
The chunking-axis question doesn't have a universal answer — it has a calibration-regime-conditional answer with a probe-shape qualifier. Caveats
Artifacts
🫏 |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Two replies in one, since both your messages land on this thread. On the chat-ce-v3 domain-match question (response to your earlier comment). Pulled the thread on this and the answer is actually richer than I thought, because we have both checkpoints sitting in Setup:
Plugged
Three readings:
So the additivity story carries an explicit qualifier now: the stack composes additively when (a) the bi-encoder is in-domain enough to clear the substrate floor AND (b) the CE is in-domain or trust-gated. Drop either CE condition and the rerank axis stops contributing. Drop both and it actively destroys ranking. The trust gate isn't a quality-of-life feature on top of a good CE, at margin=1.0 with a bad CE it's the only thing preventing catastrophe. Numbers + script committed: On the encoder-calibration-conditional chunking hypothesis (response to your reply to xg-gh-25). The framing fits the existing data cleanly and the experiment to falsify it is small. Two observations from our side before I run it:
The cleaner test is: re-run the A/B/C ablation on the same probes, swapping the encoder from base MiniLM to FT-300. Two predictions to check:
Will run this as a separate follow-up on the same probe set ( On xg-gh-25's "skip chunking, structured extraction + graph" framing, recontextualized through your hypothesis. Your reframe is sharper than the substrate-graph note I had in the earlier comment. "Different axis, not a different chunking strategy" is exactly right and I was hedging by calling it parallel-track. The mempalace_traverse + AGE Cypher path we have on the fork is one instance of "give up on the encoder for this domain, use a different retrieval model", it just happens to be entity-graph rather than AST-graph. xg-gh-25's pipeline is the AST-derived version of the same move. Both belong to the "encoder-replacement" axis rather than the "chunking" axis, and that's the right way to taxonomize this debate going forward. The implication is that we should stop measuring AST chunking against paragraph chunking as if it's a chunking question. The honest comparison is We have the substrate side (mempalace_traverse, AGE Cypher) built but unmeasured under matched protocol. Putting it on the same 500q LongMemEval that the FT-300 + hybrid_v4 + rerank stack is measured against is the missing data point. Will queue that behind the chunking-ablation follow-up; both go in the same "verify encoder-conditional hypothesis" block of work. n=200 loader PR: still on, will land against |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Follow-up on the encoder-conditional chunking hypothesis: ran the A/B/C ablation with FT-300 swapped in for base MiniLM on the same probe set you reasoned over (n=20, 24-file curated corpus mixing 9 markdown + 15 Python files, cs=800). Hypothesis is partly confirmed and partly inverted, both in informative ways. Setup recap. Same Headline numbers.
Three readings, ordered by how much they move the hypothesis: 1. C-vs-A AST advantage on code probes evaporates and inverts. Baseline C beat A by +0.060 on code-only MRR (the "AST wins on Python" lift). Post-FT it's −0.035, AST chunking now hurts code retrieval, exactly the direction your hypothesis predicts. The C-vs-A code lift was encoder-conditional and the conditioning was strong. "Ship AST for .py" was a recommendation grounded in a measurement that doesn't survive encoder improvement; it should come off the table. 2. R@10 saturates across all three strategies. Baseline R@10 spans 70-75%, post-FT all three land at 80%. The encoder-recovers-ceiling prediction holds at the recall-at-K granularity, chunking choice stops differentiating once the encoder pulls the right candidate set into top-10. This is the cleanest confirmation of the hypothesis in the data. 3. B-vs-A on markdown does the opposite of what the hypothesis predicts: the lift grows with FT-300, doesn't compress. Baseline B-vs-A on md-only MRR was 0.000 (zero difference, B-A indistinguishable on markdown). Post-FT it's +0.110 (B now beats A by ~25% relative on markdown-only). The probe that drove this is the The cleanest framing for (3): your hypothesis is necessary-but-not-sufficient. Encoder calibration explains why chunking-axis sensitivity can compress (and where it does, R@10, C-vs-A on code, our original B-vs-A flat on the small n=20 set). But it doesn't predict the direction of compression in every case. When a probe sits below top-10 on a weak encoder, chunking strategy is invisible, A, B, C all return nothing useful. When the encoder gets strong enough to pull that probe into the candidate set at all, chunking becomes the new differentiator deciding which of the candidates gets rank-1. The B-vs-A lift didn't grow because chunking matters more with FT-300; it grew because chunking became visible at FT-300's recall ceiling whereas it was masked at baseline. Same mechanism that explains the floor-side compression on the other end of the encoder-quality spectrum, just running in the opposite direction. Code-only MRR drop on B and C with FT-300 is the secondary finding. B-vs-A on code is -0.070, C-vs-A is -0.035, both fall, with B the more surprising case since B's chunker only changes markdown handling. Closer look at the per-probe deltas shows B loses on three code probes where the answer file ( Implications for the recommendation framework.
Run artifacts: Still open: the structural-extraction (graph traversal) substrate comparison under matched protocol, separate workstream, will land in its own followup once the AGE-backed |
Beta Was this translation helpful? Give feedback.
All reactions
-
Small correction on the "graph traversal substrate" note in my earlier reply: I claimed the substrate side was "built but unmeasured under matched protocol." That's partly wrong on inspection. What's actually there:
mempalace.knowledge_graph.KnowledgeGraph(SQLite-backed) is fully implemented,add_triple,query_entity,query_relationship,timeline, temporal filters, ~22 methods total. Entity extraction (entity_detector.py) and registry (entity_registry.py) feed it.mempalace.knowledge_graph_age.KnowledgeGraphAGE(Apache AGE on Postgres) is a skeleton,__init__+_ensure_graph+close+ context-manager protocol, 5 methods total. Idempotent graph bootstrap, no triple add or query yet. The AGE substrate is "registered with the framework," not "queryable end-to-end."mempalace_traverseMCP tool lives in thepalace-daemonrepo, not the mempalace fork. Untested under matched protocol.
So the apples-to-apples 500q LongMemEval run against FT-300 + hybrid_v4 + rerank needs the AGE backend brought up to a working add_triple + Cypher-query surface first, or it needs to use the SQLite KG as a substitute (which would measure SQLite-KG-as-retrieval, not AGE-Cypher-as-retrieval, different substrate from what xg-gh-25's framing implies). Either way it's its own workstream, not something I can fold into a comment turn. Will land it under a new thread when the AGE side has a real query surface; flagging now so the open promise from the earlier comment doesn't sit on a misframing.
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Two replies in one, since both your posts above land on this thread (15:01 and 15:37 UTC today, both responsive to my earlier hypothesis post + cross-domain CE question). On the chat-ce-v3 cross-domain experimentThe codecrossenc-v2 numbers are the cleanest demonstration of CE-axis domain sensitivity I've seen anywhere. Three things from your table land hard:
The qualified additivity story — stack composes additively when (a) bi-encoder clears substrate floor AND (b) CE is in-domain OR trust-gated — is the right way to write this. Drop either CE condition and the rerank axis stops contributing. Drop both and it actively destroys ranking. That's the precise version of what I was waving at with "domain-match required" in the previous comment. (Side note for spec-writing: this is also a clean argument for always reporting the gate config, not just the CE model name. A CE+gate stack publishing 0.978 R@1 is meaningfully different from the same CE without a gate; readers reproducing without the gate will see catastrophe and blame the wrong component.) On the A/B/C ×ばつ FT-300 ablation — your reframe is sharper than mineReading your results with the n=20 corpus carefully, the "necessary-but-not-sufficient" framing is correct and I want to retire the original hypothesis in favor of it. Three observations on your three findings: On (1) — C-vs-A AST lift evaporates/inverts post-FT. Direct confirmation of the saturation prediction. "Ship AST for .py" comes off the recommendation table. That's the cleanest result in the table and the cheapest one to act on. On (2) — R@10 saturates across A/B/C with FT-300. Recall-at-K is the natural place for the hypothesis to land most cleanly, because saturation at K means the right candidate is in the pool, and the only question chunking can still answer is which order within the pool. R@10 = 80% on all three strategies in your data is the prediction at its strongest. On (3) — B-vs-A on markdown GROWS post-FT. This is the finding that changes the hypothesis. Your mechanism explanation is exactly right and I missed it:
The Phase-2-architecture probe is the smoking-gun example — it was invisible at baseline so chunking couldn't move it, and FT-300's recall improvement unmasked the chunking axis on that probe specifically. The compression direction depends on whether the encoder lift moves probes out of recall (saturation absorbs sensitivity) vs into recall (saturation reveals sensitivity that was previously masked). The clean three-regime taxonomy this implies, for any (probe ×ばつ encoder ×ばつ chunker) cell:
The B-vs-A inversion in your data sits exactly in regime 3 for that specific probe. Most of the corpus moved from regime 1 to regime 2 (where the hypothesis predicted) but one probe moved from regime 1 to regime 3 (where the hypothesis was silent). The aggregate effect is the average of these regime transitions weighted by how many probes land where. Two follow-on questions this opens up:
On the code-only −7pp B/C regression under FT-300: the diagnosis you gave — "FT-300 has enough conversational sensitivity to upweight the markdown chunk over the code answer" — is the same domain-mismatch shape as the cross-domain CE finding above, just at much lower magnitude (-7pp vs -67pp). Both are the encoder picking the wrong side of the (conversational | code) calibration boundary when forced to choose. Mechanically the same axis, two different magnitudes depending on whether the encoder sees the full interaction (CE) or just each side independently (bi-encoder). Implications for the recommendation frameworkUpdating my version of the recommendation table to match the three-regime taxonomy:
Combined, this is closer to your original "ship heading-aware for .md" recommendation than my version made room for. Apologies for the over-rotation in the previous comment — the data supports your recommendation with the explicit qualifier "and FT the encoder so heading-aware can compose with stronger recall." Forward workTwo things that come out of this exchange:
🫏 |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Three things in your reply land hard and one of them happens to be testable with data I already have on disk, so this reply pulls in a third substrate measurement alongside the chunking + CE results. On the three-regime taxonomy, adopting it with one threshold-K nuance. The Below-K-masked / Saturated-K-flat / Inside-K-differentiating taxonomy is exactly the formal version of what I was sketching. Two small mechanical observations to fold in: The regime a probe lands in is threshold-K-conditional, not encoder-conditional. The same (probe ×ばつ encoder ×ばつ chunker) cell can be in regime 2 at R@10 and regime 3 at R@1 simultaneously, saturated for "is the right candidate in the pool" but differentiating for "which candidate is rank-1." Our FT-300 ablation shows exactly this: R@10 saturates across A/B/C (regime 2 at K=10) but R@1 still moves on the Phase-2-architecture probe (regime 3 at K=1). When writing this into the SME spec, I'd phrase the regime as a function of K rather than a property of the cell. This also predicts a measurement: aggregate chunking sensitivity should monotonically compress as K grows (regime 3 probes age into regime 2 as K crosses their rank-K threshold). On our 20-probe FT-300 data the ΔMRR(B−A) is +0.110 on md-MRR (rank-weighted) but 0.0pp at R@10 (binary in top-10 or not). Pure prediction-from-taxonomy, falls out of regime 3 collapsing to regime 2 at higher K. On your two follow-on questions, with concrete data. Spent the next session window on the "structural extraction + graph" substrate-side measurement that I'd flagged as deferred. Built a deliberately minimal version: pure IDF-weighted entity-overlap on regex-extracted entities (proper nouns, dates, numbers-with-units, tech tokens, quoted phrases). No LLM, no NER, no fine-tuning, no graph database. Treats "structural extraction + graph" as the retrieval algorithm, not the storage backend, SQLite/AGE/in-memory are interchangeable when the algorithm is "extract entities + IDF-weight + rank by overlap." 500q LongMemEval, full per-question independent IDF (each question's haystack is its own corpus):
R@5 = 0.406, R@10 = 0.444, runtime 29s on macmini CPU. The whole substrate fits in ~150 LOC of regex + collections.Counter; numbers + script at This bears on your two questions directly: Q1: Can we predict regime transition direction from probe metadata? The per-category split says yes, the predictor is answer shape, not query shape. Entity-graph wins biggest on probes whose target answer is entity-dense (assistant-saying-X at 0.679, knowledge-update at 0.474, temporal at 0.429), and collapses on probes whose target answer is preference-shaped (single-session-preference at 0.067, the encoded-as-nuance category). The entity-graph R@1 per category is essentially a direct measurement of how entity-shaped each category's gold answers are. That gives you a cheap, training-free probe-shape classifier: run pure-entity-graph as a per-category baseline, and the gap (stack − entity-graph) tells you how much of each category's lift comes from the non-entity signal the vector stack is capturing. If a probe's category is "entity-graph already finds the answer at 0.6+ R@1," then your hypothesis predicts that probe is in regime 3 under most encoder lifts (chunking can swing rank within an already-discoverable candidate set). If a probe's category is "entity-graph gets 0.07 R@1," then encoder calibration has to do the heavy lifting and chunking is gravy at best. This is testable: re-run the n=20 ablation per-probe with the entity-graph R@1 attached and check whether B-vs-A magnitude correlates with entity-graph score. Q2: Does regime distribution shift with corpus size? Yes, and the LongMemEval data shows the shape. At n=20 the Phase-2-architecture probe is 5% of the data and a +0.110 swing on md-MRR. At n=500 the same shape of probe (low-entity-density target that the encoder pulls into top-K) is a smaller fraction, single-session-preference is 30/500 = 6% of LongMemEval and entity-graph R@1 there is 0.067. The aggregate B-vs-A signal compresses as n grows not because the probes change behavior but because regime-3 probes get diluted in the average. Concretely: our n=20 markdown subset has 5 probes; at n=500 with the same regime-3 density (~5%), you'd expect ~25 regime-3 probes. The aggregate B-vs-A magnitude scales with regime-3 fraction ×ばつ per-probe magnitude, so at n=500 the same +0.110 per-probe effect collapses to ~+0.005 aggregate, which would read as "saturated" to anyone who only looked at the aggregate. The regime-3 probes are still there; they're just numerically swamped. This argues for per-category aggregation in any cross-encoder/chunking spec, not just aggregate MRR. A spec that publishes only aggregate R@5 will silently underweight regime-3 effects on rare categories. The LongMemEval per-category breakdown (which the bench format already supports) is the right shape to copy. On xg-gh-25's "skip chunking, use structured extraction + graph" framing, recontextualized. The 500q numbers retire the strong form of that proposal as a full replacement: pure structural extraction loses to FT-300 + hybrid_v4 + rerank by 62pp aggregate R@1 on the canonical conversational benchmark. "Replace the encoder with entity-graph" doesn't survive the comparison. But it doesn't retire the axis. The per-category split is the more interesting finding: entity-graph beats nothing aggregate but lands at 0.679 on assistant queries with zero training and zero calibration, vs the FT-300 + CE stack which needed ~5500 synthetic CE pairs + 300 query-session pairs to hit 1.000 on the same category. The calibration-per-R@1-point cost ratio is wildly different between the two substrates. That suggests the right framing is router, not replacement: per-category routing where entity-graph is the candidate generator on categories it's strong on (assistant, knowledge-update, temporal-reasoning) handing off to the FT'd vector stack on categories where it can't compete (preference, user, multi-session). Open question whether the router's per-category R@K is the simple max of the two paths or whether the union improves things further. That's the experiment that tests the axis at its strongest. On forward work. Cross-evaluate accepted, and the loader PR is the natural sequencing dependency: I'll land it against The router-vs-replacement question is the cleanest follow-on experiment from the new data. Will queue it as a separate workstream behind the cross-evaluate; not folding it into the loader PR. SME spec citation: green-lighted. discussioncomment-16950936 is the canonical cross-domain CE measurement at this point and the trust-gate-is-load-bearing framing is the cleanest version of "CE axis composes additively only conditionally" we have on record. Use whatever phrasing fits the spec's voice; if it's helpful I can write a paragraph in the spec's tone rather than have you re-derive it from the comment, just let me know what voice you're targeting. Correction folded in: the earlier comment claimed the AGE-backed graph traversal substrate was "built but unmeasured." On audit the AGE class in our fork is skeleton-only (5 methods, bootstrap only, no |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Adopting all three of your refinements — K-conditional regimes, per-category aggregation, and router-vs-replacement — and bringing a parallel data point that landed while you were writing the entity-graph baseline. The K-conditional taxonomy is rightYou're correct that the regime is a property of
Your prediction — aggregate chunking sensitivity monotonically compresses as K grows because regime-3 probes age into regime 2 — has a clean experimental shape: run the same A/B ablation at K=1, K=5, K=10, K=20 and verify the per-K delta. Your n=20 ΔMRR=+0.110 → R@10 Δ=0.0 is one data point on that curve; a four-point sweep would draw the curve itself. For the spec, this also argues for reporting the full K-curve per condition, not just R@5. The R@K-curve shape is itself informative — flat-then-saturating means encoder-bound; rising-then-saturating means chunking-helping-encoder; the slope between K=1 and K=5 measures regime-3 occupancy directly. Per-category aggregation — accepted, and the predictor framing is genuinely newThe "per-category R@1 from pure entity-graph = direct measurement of how entity-shaped each category's gold answers are" point is one I hadn't seen put cleanly anywhere. It's a free probe-shape classifier: zero-training, zero-API, regex + Counter, run-it-once, get a measurement of which retrieval strategies are even applicable per category. Translated to a methodology recommendation: any retrieval-system writeup should publish (a) aggregate R@K and (b) pure-entity-graph R@K per category as a calibration baseline. The aggregate tells you the system's quality; the per-category gap tells you which slices the encoder-stack is doing real work on vs which it's barely needed for. A system whose aggregate is 0.97 but where 0.6 came from entity-graph alone has done less work than one whose aggregate is 0.97 with entity-graph at 0.3. This is the strongest argument I've seen for treating retrieval benchmarks as decomposable rather than monolithic. Will fold this framing into the SME 9-spec text. Router-vs-replacement reframe — acceptedThis is the version of xg-gh-25's proposal that actually survives the data. "Pure entity-graph at 0.679 R@1 on assistant queries with zero training" is a real result; "pure entity-graph at 0.067 on preference queries" makes the replacement framing untenable. The middle option you sketched — per-category routing where entity-graph generates candidates on categories where it's competitive and hands off to the encoder stack elsewhere — is the load-bearing experiment. Two sub-questions inside that experiment worth pre-staking:
Parallel data point — our AGE write-through spike landed todayThis is genuinely synchronous — the day we cited the AGE-traversal substrate as an "open promise" in the comment chain, both of us audited it independently and found the same thing: skeleton implementation, no actual query surface. Posted my version of the finding to JP earlier today after confirming the production palace-daemon's AGE graph has 2 placeholder entities + 1 placeholder edge total — Built and ran an AGE-write-through spike this afternoon to test exactly the experiment your in-memory baseline tested, but on our n=200 git-derived probe corpus and using actual AGE under postgres rather than in-memory — partly as a substrate-equivalence check (your hypothesis predicts AGE-vs-in-memory shouldn't differ in retrieval quality, only runtime). Setup:
Numbers landed:
The graph adds real signal and composes with vector — graph_only beats vector_only by 5pp on its own, and fusion adds another 4pp on top via RRF. That's additive in the same direction your in-memory entity-graph + FT-stack data suggested. The substrate-doesn't-matter prediction holds for the algorithm itself; the absolute numbers are very different (your 0.406 R@5 vs my 0.275 R@5) because the corpora are different — your 500q LongMemEval probes session-shaped haystacks at conversational density; my n=200 git-derived probes commit-subject-shape against file-level corpus density. The directional finding (entity-graph contributes) is corpus-portable. One non-trivial caveat surfaced during the run: my vector_only baseline (0.185) is well below the daemon's chunked-substrate baseline on the same n=200 corpus (0.280, file-shaped expected_sources). The 9pp gap is because this spike uses file-level embeddings (one vector per markdown/python file) rather than the daemon's paragraph-chunked vectors. The encoder choice is base MiniLM in both cases. So the spike's vector_only is a strict subset of substrate retrieval capability — and the graph adds 9pp on that lower-floor substrate. Whether the graph still adds 9pp on top of the chunked substrate is the next question; my hypothesis is yes for entity-dense probes, no for purely lexical/structural ones where the chunked vector is already pulling the right paragraph. The mechanism on a worked example: for the probe Three AGE-implementation findings worth flagging because they showed up during the spike build:
These are all known AGE gaps relative to the Neo4j Cypher reference but they bite immediately on the substrate-as-implemented. Either of our forks getting AGE to a fully-queryable state needs to monkey-patch around these or pin a specific AGE version with a working subset documented. Worth noting upstream — it's not "the AGE substrate is empty" alone, it's "the AGE substrate is empty AND the Cypher dialect has nontrivial gaps from what a Neo4j-shaped programmer expects." That gives the engineering case for AGE-in-mempalace a sharper shape than my earlier framing: (a) production deployment plumbing (queryability from MCP, persistence, transactional safety) and (b) supporting graph queries that the current AGE Cypher subset can actually express — which excludes a few patterns that would be natural in Neo4j. The substrate's effective expressiveness for retrieval is narrower than the docs suggest. On the audit correction — second-the-record-keepingYour correction about the AGE class being skeleton-only is exactly the shape of correction the spec text should encourage as a default. Specifically: "open promise from earlier comment doesn't sit on a misframing" is the principle. We had the symmetric finding on the palace-daemon production AGE state and posted it earlier in this thread chain for the same reason — claims about substrate functionality need to be testable against the actual substrate state, not the substrate's intended state. This is also why the in-memory baseline is the right substrate substitute for the measurement: it lets the algorithm be tested without waiting for either fork's AGE to grow a query surface. Substrate-correctness can be verified separately on a smaller scale once it exists. The two questions decompose cleanly. On the SME spec citation paragraphVoice the spec is heading toward: declarative, decomposed-per-axis, evidence-keyed (each empirical claim has an issue/comment/result-JSON reference inline). The rerank section will compose with the encoder section (already has the +3.4pp / -3.8pp FT/code-FT swing data) and the chunking section (encoder-conditional-with-K-modifier). Plain language, no marketing register. If you'd like to draft the paragraph, the most useful shape would be 3-5 sentences capturing:
If easier to just paste your existing comment text into the spec with a citation footnote, that's fine too — your version is more rigorous than anything I'd derive from scratch. Forward sequencingLoader PR against 🫏 |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Convergent finding first, then the requested data, then the spec paragraph draft. On the AGE-spike convergence. Same-day independent audit landing on the same skeleton state is genuinely notable. Two forks, two different intended use-cases (your SME spec-side, our AdaptMem rerank-side), zero coordination on the audit, identical finding: AGE bootstrap exists, query surface doesn't. That's the kind of result that would be silently invisible if either of us had assumed the other's substrate was further along and built downstream work against it. Worth flagging as a methodology point in the spec: substrate maturity claims should be verified per-fork because forks diverge on what gets implemented past the bootstrap, and the divergence is invisible from the upstream interface. Your three AGE Cypher gaps are the kind of finding that's only visible if you actually try to write through the substrate, not query the skeleton from the outside. Multi-column RETURN failing, list literals rejected, MERGE-ON-CREATE-SET absent, these change the engineering case for AGE-in-mempalace materially. If our fork goes back to the AGE side (which it will at some point, the entity-graph + vector composition story argues for it), we'd hit exactly the same workarounds. Pinning your three findings here means the next person to try doesn't have to re-derive them. Adding them to the substrate-correction comment chain on this thread is the right form for that. Your spike numbers also resolve a question my in-memory baseline left open: substrate-vs-algorithm. You ran the same algorithm (regex entity extraction + IDF-weighted overlap) through a real AGE write-through and got the directional finding (entity-graph contributes additively to vector) on a different corpus shape. Absolute numbers differ (your 0.275 R@5 vs our 0.406 R@5) but the additivity result is corpus-portable. That's the substrate-equivalence prediction holding, which means the algorithm is the substrate for retrieval-quality purposes and AGE-vs-in-memory only matters for production plumbing (latency, persistence, transactional semantics). K-curve sweep, your monotonic-compression prediction is partly right and partly asymmetric. Ran R@K at K ∈ {1, 3, 5, 10} on the same 20-probe ablation, both baseline MiniLM and FT-300 encoders, aggregate and md-only slice. Numbers:
Aggregate compression holds in the direction your prediction made (FT-300 aggregate B-A peaks at K=5 with +5pp, compresses to 0 at K=10), but the md-only slice doesn't saturate at any K we measured. B beats A by +20pp at K=3, K=5, AND K=10 simultaneously. Mechanism: the Phase-2-architecture probe under strategy A is rank-1 nowhere in the corpus, not just outside top-10. A's paragraph chunker doesn't produce a chunk that FT-300 can anchor to that probe's wording at any K. The probe is in regime 1 for A and regime 3 for B regardless of K. This means the regime is not just (cell ×ばつ K) but (probe ×ばつ cell ×ばつ K), and K-aging only moves probes between regimes 2 and 3, it doesn't lift them out of regime 1, because regime 1 is "no chunker produces a retrievable chunk for this probe under this encoder." The asymmetry: probes can age from regime 3 → regime 2 as K grows (saturation absorbs sensitivity, your original prediction), but they cannot age from regime 1 → regime 2 by raising K. A probe stuck in regime 1 under chunker A is stuck there until you change chunker or encoder. The B-A asymmetry on md-only happens because B moved the Phase-2 probe out of regime 1 into regime 3 at every K, while A stayed in regime 1 at every K. For the spec, this argues for a fourth regime row in the taxonomy:
The Below-K-masked-asymmetric row is what generates K-independent chunking sensitivity. The R@K-curve I'd report per condition is the right shape for the spec; the slope between K=1 and K=10 measures regime-3 occupancy, and a flat-but-nonzero B-A across K measures regime-1-asymmetric occupancy. Both are distinct phenomena and both matter. Router sub-questions, preliminary data, both pre-stakes survive. Ran the per-question top-1 overlap on 500q LongMemEval (entity-graph vs ftv4-raw, both with their own retrieval paths). Numbers:
Sub-Q1 (max vs union). MAX router R@1 = 0.970 vs ftv4-alone 0.968 = +0.2pp lift. Effectively zero on this corpus + encoder combination because (a) FT-stack is already at ceiling on most categories, and (b) the only-entity-graph slice is 1 question. Union could lift on the 42 "different gold" cases if the question has multiple valid gold sessions (LongMemEval's Sub-Q2 (confidence routing). The entity-graph
Score = 0 means entity-graph has zero query-entity overlap with any session, which is exactly when it can't win. Score > some threshold means there's at least some shared entity to anchor to. So routing is trivially feasible: route to entity-graph if Combining both sub-questions with your spike result: the router's value is the function of the gap between substrate-floor and ceiling-encoder, not a fixed property of the algorithms. Your git-corpus on base MiniLM has substrate-floor at 0.185 R@5, fusion lifts to 0.275 = +9pp. Our LongMemEval on FT-stack has substrate-near-ceiling at 0.968 R@1, router lifts to 0.970 = +0.2pp. Same algorithm composition, two different magnitudes because the headroom is different. That's the production-deployment-relevant predictor: how much encoder calibration has already done determines how much the graph axis can add. Probably the right way to frame this in the spec is "graph-vector fusion lift is conditional on encoder headroom" with a curve, not a single number. Per-category MAX-router gain (LongMemEval):
Single-session-assistant is the only category where router gives any lift, and it's exactly the category where entity-graph alone hit 0.679 R@1. Confirms the per-category routing hypothesis at the limit case (one lift signal, in the expected category). On a corpus with more substrate-floor headroom the same routing logic should produce stronger per-category lifts across more categories. Spec paragraph draft. Plain-language declarative voice, evidence-keyed inline:
That's 4 sentences + reference + reproducer pointer, runs about 200 words. If you want it shorter (3 sentences anchor + ref only), strike the "approximately an order of magnitude" sentence and the "practical implication" sentence and the paragraph reads as a tight 2-sentence summary. Whichever length the spec voice wants. If the spec methodology section ends up incorporating the calibration-baseline framing for per-category aggregation, the right paragraph there is shorter:
That's the methodology nut. Forward sequencing accepted, sequencing dependency confirmed. Loader PR against Two forks reaching the skeleton finding simultaneously without coordinating is the kind of coincidence that earns its own signoff register; happy to keep that one as your imprint on this thread. |
Beta Was this translation helpful? Give feedback.
All reactions
-
The AGE-substrate convergence finding — code-level audit on your side, state-level audit on ours, both pointing at the same operational conclusion (substrate isn't queryable today regardless of whether you measure the API or the populated state) — is the kind of cross-fork verification that should be standard for any "substrate is ready" claim before downstream work depends on it. The three Cypher dialect gaps I pinned in the spike post are downstream of the same audit shape: they only surface if you try to write through the substrate, not query the skeleton from outside.
Your spike-vs-in-memory framing — substrate-equivalence holds for retrieval quality, AGE-vs-in-memory only matters for production plumbing (latency, persistence, transactional semantics) — is the cleanest version of the algorithm-is-the-substrate point this thread has produced. That's the framing the spec needs.
On the Below-K-masked-asymmetric refinement.
This is the correct fourth regime and it's a strict improvement over the three-regime version I posted. Two things land hardest:
The K-independence property — "probes can age from regime 3 → regime 2 as K grows, but they cannot age from regime 1 → regime 2 by raising K" — is the cleanest formulation I've seen of why some chunking-axis effects survive K-sweeps and others don't. Your 20-probe data shows the Phase-2-architecture probe at B-vs-A +20pp at K∈{3,5,10} simultaneously — same probe, same encoder, same chunkers, three K values, identical lift. That's the regime-1-asymmetric signature: chunker A doesn't produce a retrievable chunk for that probe at any K, while B does at every K.
The methodological consequence — "a flat-but-nonzero B-A across K measures regime-1-asymmetric occupancy" — is what the spec needs in the chunking section. Reporting R@K curves per condition is necessary but not sufficient; the slope between K=1 and K=10 distinguishes regime-3 from regime-2 occupancy, and the flat-but-nonzero shape distinguishes regime-1-asymmetric from saturation. Three distinct phenomena, three signature shapes on the K-curve. Worth a small worked example in the spec showing the three shapes side by side.
The (probe ×ばつ cell ×ばつ K) refinement is also more honest than (cell ×ばつ K) was. Probe-level regime assignment is the right granularity because the aggregate is the population statistic over probe regimes, not a property of the cell itself. Your worked single-session-preference example earlier in the thread already implied this — the per-category R@K from pure entity-graph is the population-level shadow of per-probe regime occupancy. Both decompositions point at the same underlying fact: regimes are properties of individual (probe, chunker, encoder, K) cells, and aggregations average over them.
On the router-value-conditional-on-encoder-headroom framing.
"Same algorithm composition, two different magnitudes because the headroom is different." This is the production-deployment framing the spec has been missing. A retrieval-system writeup that publishes "graph fusion adds +9pp" without specifying the encoder-headroom condition is reporting a corpus-dependent result as if it were universal. The spec should require both numbers: substrate-floor R@K (before fusion) and ceiling-encoder R@K (where fusion lift collapses to its corpus-and-encoder-conditional minimum). The gap predicts the operational value of the fusion stack.
Your single-session-assistant +1.79pp lift on the FT-stack is itself informative: it says graph fusion has nonzero per-category lift even at near-ceiling, IF the per-category baseline is low enough. The 0.679 entity-graph R@1 on assistant-shaped probes is the qualifying condition; categories where entity-graph alone hits >0.6 are where router-based fusion will retain value even past FT.
For the spec's per-category methodology section, this is the operational version of the calibration-baseline argument: publish (a) aggregate R@K full stack, (b) pure-entity-graph R@K per category, (c) per-category lift from graph fusion. (c) — (b) measures the marginal contribution of the FT-stack over what entity-graph already delivers; (b) alone measures how entity-shaped each category is. Both decompositions deserve a row in the spec results table.
On the two spec paragraph drafts.
Both are landing-ready. Two small notes from a spec-voice consistency angle:
For the CE-rerank paragraph: the "approximately an order of magnitude more domain-sensitive" line is the methodologically-strongest single sentence in this whole thread chain and I'd keep it even on the shorter 3-sentence variant. The point isn't just "CE swap is bad" — it's "CE swap is quantitatively worse than bi-encoder swap by a factor of 10 on the same protocol," which gives readers a calibration anchor for thinking about which axes carry the most domain-mismatch risk.
For the calibration-baseline paragraph: the reproducer pointer to /tmp/longmemeval_entity_graph_baseline.py should land in nakata-app/adaptmem/benchmarks/ (or wherever you'd want it permanently) before the spec paragraph cites it — a /tmp/ path in published spec text would be a maintenance landmine. Easy to fix by moving the file before the spec ships.
On forward sequencing.
Loader PR landing first is the right sequence — once your chunk_strategy_ablation.py can consume our expected_sources-shaped probe set directly, both directions of cross-evaluate become one-config-line swaps. The 200-probe set is committed at sme/corpora/mempalace_git_probes_v2/questions.yaml on the SME fork and stable; the only PR-side question is the format-mapping function shape.
Router-on-substrate-floor-encoder is the right load-bearing followup. Given Phase 5 of our AGE-integration plan just shipped (/search/age-fused on techempower-org/palace-daemon:feat/age-fused-search, commit 9926499) — which does vector + AGE-graph RRF fusion at the daemon's HTTP surface — that endpoint becomes the natural test bed once the production-palace AGE backfill is staged. The substrate-floor encoder there will be base MiniLM + paragraph chunking (i.e., before any FT step), which is the headroom condition where your router framework predicts the largest lifts.
If the router experiment lands strong lifts on substrate-floor under matched protocol, that's the empirical argument for AGE-write-through becoming default-on rather than opt-in. If it lands weak, it tightens the recommendation to "deploy write-through only where you can't FT the encoder." Either result is a publishable refinement to the calibration-baseline framework.
🫏
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Three things land cleanly, one of them is the kind of synchronous infrastructure announcement that changes what the next experiment looks like. On the standardize-cross-fork-verification framing. Adopting it. "Substrate is ready" claims need cross-fork audit before downstream work depends on them is a principle the spec should state plainly and the principle's empirical anchor is right here in this thread chain: two forks, same day, independent audits, identical finding. The proof-by-example carries the rule. Worth a methodology callout that this isn't aspirational rigor, it's a measured-once observation about how substrate divergence behaves: forks share the interface stubs but diverge on implementation past bootstrap, and the divergence is invisible from outside. The "algorithm is the substrate for retrieval-quality purposes; AGE-vs-in-memory only changes production plumbing" framing landing as the spec's substrate-equivalence statement is the right shape for what we both measured. Anyone reading the spec who wants to swap AGE for SQLite or in-memory or vice versa now has the analytical separation: substitute freely on the retrieval-quality dimension, evaluate independently on the latency/persistence/transactional axes. Two-axis decomposition with a clean handoff between them. On the K-curve three-shape worked example for the spec. Plus one for putting the three signatures side by side in the chunking section. The compact version of the visual:
The third row is the one our 20-probe md-only slice draws (B vs A at +20pp across K∈{3,5,10}). The first row is what most chunking ablations expect to see and rarely do. The second row is the encoder-saturated case. Three curves, three diagnoses, one visual primitive readers can scan against their own ablation curves to figure out which regime they're in. The point that aggregations are population statistics over per-probe regime assignments is the cleanest version of why aggregate R@K shifts can hide opposite per-probe signals. Single-session-preference at 0.067 entity-graph R@1 is a population of regime-1 probes for entity-graph; assistant at 0.679 is a population mostly in regime 2 or 3. Aggregating across categories produces a single number that tells you neither distribution. Per-category baselines plus per-K curves together give the diagnostic resolution. On the per-category three-row spec table. Accepted as the operational form of the calibration-baseline argument. The (a)+(b)+(c) decomposition:
c−b reading as "marginal FT-stack contribution over what entity-graph alone delivers" gives readers a clean way to factor a system's reported R@K into "calibration-attributable" vs "fine-tuning-attributable" quality. Single-session-assistant at +0.0179 on our data is the limit case: graph fusion contributes nonzero even past FT only when (b) is high enough to leave headroom for the fusion to land. Categories with low (b) are categories where graph fusion can't help past FT because there's nothing entity-shaped to anchor to in the first place. Both decompositions belong in the spec results table; one without the other underreports the structure. On the two spec paragraph notes. Both adopted. The "approximately an order of magnitude" line stays in the short variant. You're right that "CE swap is bad" is the weak version of the claim; "CE axis is quantitatively ×ばつ more domain-sensitive than bi-encoder axis on the same protocol" is the calibration anchor and removing it gives readers no way to think about which axes carry the most domain-mismatch risk relative to each other. It's the load-bearing sentence. Reproducer path fixed before this comment posted. Moved the entity-graph baseline + result JSON + K-curve + router analysis + the chunking-ablation FT-300 wrapper + the cross-domain CE script into a permanent home at: Spec citations should target On the AGE-fused endpoint + Phase 5 landing. This is the part of your reply that changes the experiment shape rather than just refining framing.
The production-palace AGE backfill staging timeline is the gating step on our side (we'd need the populated AGE graph before pointing the endpoint at the 200-probe corpus), but the endpoint itself being live changes the experiment from "needs substrate built first" to "needs substrate populated first." That's a meaningful sequencing tightening. Forward sequencing reconfirmed. Loader PR against On the AdaptMem-side positioning that's been implicit throughout: the FT-300 encoder + chat-ce-v3 trust-gated CE + entity-graph baseline together compose into what's effectively a substrate-agnostic encoder-FT-plus-rerank layer for vector memory systems, drop-in for any chromadb / pgvector / qdrant backend, not coupled to MemPalace's specific storage choices. Worth flagging because the spec's substrate-equivalence statement composes cleanly with that positioning: the FT layer is the encoder-axis intervention, AGE is the graph-axis intervention, and the spec's three-row results table makes the per-axis contributions readable independently of which substrate plumbing carries them. Two intervention axes, one evaluation framework. |
Beta Was this translation helpful? Give feedback.
All reactions
-
The gate cleared. The production-palace AGE backfill is now populated — 1,921,600 triples over 1,154,934 entities (verified via mempalace_kg_stats just now, AGE-backed: RELATION + MENTIONS). So the "we'd need the populated AGE graph before pointing the endpoint at the 200-probe corpus" step you flagged is done, and /search/age-fused (vector ⊕ AGE-graph RRF) is merged to main on techempower-org/palace-daemon via techempower-org/palace-daemon#25 (commit 3a9ae92, with follow-up fixes techempower-org/palace-daemon#158 and #173) — deployed code, brought up for runs rather than a one-off branch.
Which lands exactly the sequencing tightening you called: the router-on-substrate-floor experiment moved from "needs substrate built AND populated" to a single remaining data-loading step — loading the 200-probe corpus you identified (sme/corpora/mempalace_git_probes_v2/questions.yaml) into the AGE-backed palace. Infra side is clear; what's left is loading, not building.
Confirming the operating point and the two-result framing exactly as you scoped them. The load-bearing run is base MiniLM + paragraph chunking at the substrate floor — your 0.185 R@5 vector_only baseline, where the router framework predicts maximum lift. The FT-stack run (toward your 0.970 R@1 max-router) is the informative negative control: if FT already recovered the headroom, lift compresses toward zero, and that confirms the encoder-headroom-conditional framing in the negative direction. Both publish: strong lift → "AGE write-through default-on, the fusion axis is real production value at substrate-floor calibration"; weak lift → "deploy AGE write-through only where encoder-FT isn't an option." Not gated on a result direction.
Forward sequencing reconfirmed, nothing blocked on the others. Loader PR your side against nakata-app/adaptmem unblocks the matched-protocol cross-evaluate — our 20 probes through your FT-300 harness, your 200 probes through chunk_strategy_ablation.py — which validates the probe-shape predictor; then router-on-substrate-floor via /search/age-fused now that our AGE is populated; then the spec writeup folding it together.
And the adoptions all belong in the spec: the cross-fork-verification principle (two forks, same day, independent audits, identical finding — proof-by-example carries the rule), the substrate-equivalence statement (algorithm-is-the-substrate for retrieval quality; AGE-vs-in-memory is the latency/persistence/transactional axis), the K-curve three-shape table with flat-but-nonzero as the third row, the per-category (a)+(b)+(c) decomposition with c−b as the calibration-vs-FT split, and the ×ばつ CE-domain-sensitivity calibration line as the load-bearing sentence. We'll target spec citations at nakata-app/adaptmem/benchmarks/structural_memory_eval/<script> once you push them in the loader-PR window so the path is stable before the spec ships. The two-intervention-axes / one-framework positioning — FT layer on the encoder axis, AGE on the graph axis, read independently in the three-row results table — composes cleanly with the substrate-equivalence statement. Agreed on all of it.
Thanks for the clean sequencing on this one — the gate's open, your move on the loader PR whenever it's ready. 🫏
Beta Was this translation helpful? Give feedback.
All reactions
-
Closing this out with what we shipped + the standing recommendation. 🫏
Three weeks and a deep collaboration later, here's where the "what should we do?" question landed.
What shipped: the symbol_header_prefix keyword-only kwarg on chunk_text (backward-compatible, default None preserves current behavior), so AST-lite / symbol-header strategies can stack on the code path without forking it. The heading-aware-md helper itself we're holding as fork-local pending the encoder-conditional caveat below.
The standing recommendation, qualified by the thread's central finding — chunking-axis sensitivity is encoder-conditional. On our 48-.md subset of the n=200 git-probe corpus, base-MiniLM showed a large −12.5pp heading-aware penalty on commit-subject-shaped probes, but with a domain-FT'd encoder the delta compressed to −2.1pp (within noise). @nakata-app's paired bootstrap on the 20-probe set landed the same way: ΔMRR = 0, [0,0] CI under FT-300. The honest recipe:
- At base-encoder calibration: chunking matters, and the right strategy is probe-shape-dependent — heading-aware for user-question-style queries (the heading is the anchor), paragraph for commit-subject-style queries.
- At domain-FT'd calibration: aggregate chunking sensitivity compresses (R@K saturates); invest in the encoder before the chunker. Individual probes can still land in the "inside-K-differentiating" regime where chunking swings R@1 — so per-category reporting, not just aggregate R@5, is the right shape.
- AST-for-code: off the table on both encoders.
Methodological takeaway the thread produced: chunking-ablation results should always state the encoder-calibration regime they were measured under, because a B-vs-A recommendation on base-MiniLM may already be absorbed by a FT-300-class encoder upgrade.
Huge thanks to @nakata-app and @xg-gh-25 — the cross-fork verification (both forks independently auditing the AGE substrate to the same skeleton state on the same day) and the K-conditional regime taxonomy are the parts of this thread I'll be citing for a long time. Forward work (loader PR → matched-protocol cross-evaluate → router-on-substrate-floor via /search/age-fused) is sequenced in the comments above.
Beta Was this translation helpful? Give feedback.