-
Notifications
You must be signed in to change notification settings - Fork 3.2k
synthesize_concepts writes concept pages that are never chunked/embedded → unreachable by retrieval despite source-boost ×ばつ #2163
Description
Summary
synthesize_concepts writes concepts/ pages via putPage, but those pages end up with 0 content_chunks and a NULL embedding — so they are invisible to query (vector retrieval) and effectively unreachable. This is despite src/core/search/source-boost.ts assigning 'concepts/': 1.3, i.e. the retrieval layer is explicitly designed to rank concept pages above bulk content. In practice that boost is dead — there are no concept chunks to boost.
The CHANGELOG's own worked example shows the intended behavior (concepts/widget-pattern → "returned (rank 1)"), but a deployment whose concept pages are never embedded can never reproduce it.
Environment
Observed on a v0.42.x deployment. Verified the gap still exists on main at v0.42.42: src/core/cycle/synthesize-concepts.ts still only calls engine.putPage('concepts/${title}', ...) with no chunk/embed step, and the header comment still reads // ... dedup-by-embedding-similarity ship in v0.42+.
Evidence (deployment with ~1,700 concept pages)
SELECT count(*) FROM content_chunks c JOIN pages p ON p.id = c.page_id WHERE p.slug LIKE 'concepts/%'→ 0- Every
concepts/page hasembedding_signature IS NULL. - A 20-question retrieval probe (questions chosen to match concept topics) returned a concept page in top-5 in 0/20 cases.
- FTS
searchalso fails to surface them even on near-verbatim title queries: the pages do havesearch_vectorpopulated, but being short they are out-ranked by longer pages and fall below the result limit. - Control:
distilled/anddream-cycle-summaries/pages (also dream-generated) do have chunks. So thedream_generatedanti-loop guard is not the cause — it is specific to thesynthesize_concepts(andextract_atoms) write path.
Expected vs actual
- Expected: concept pages — the curated "distilled framework" output the schema treats as first-class and
source-boost.tsprioritizes at ×ばつ — should be in the retrieval surface (chunked + embedded) soquery/searchcan return them. - Actual:
synthesize_conceptswrites them but they are never chunked/embedded, so they are unreachable via the primary retrieval paths.
Possible directions (maintainers to weigh)
- Have
synthesize_conceptsenqueue chunking/embedding for the concept pages it writes. This is the natural home and it composes with the planned v0.42+ embedding-similarity dedup, which will need concept embeddings anyway. - Alternatively, document that concepts need a separate
reindex/embedpass, and add adoctorcheck that flags "concept pages present but unchunked" so operators know to run it.
Note
atoms/ are intentionally substrate (no source-boost entry) and presumably should not be embedded — this report is specifically about concepts/.