×ばつ · Issue #2163 · garrytan/gbrain" /> ×ばつ · Issue #2163 · garrytan/gbrain" />
Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

synthesize_concepts writes concept pages that are never chunked/embedded → unreachable by retrieval despite source-boost ×ばつ #2163

Open

Description

Summary

synthesize_concepts writes concepts/ pages via putPage, but those pages end up with 0 content_chunks and a NULL embedding — so they are invisible to query (vector retrieval) and effectively unreachable. This is despite src/core/search/source-boost.ts assigning 'concepts/': 1.3, i.e. the retrieval layer is explicitly designed to rank concept pages above bulk content. In practice that boost is dead — there are no concept chunks to boost.

The CHANGELOG's own worked example shows the intended behavior (concepts/widget-pattern → "returned (rank 1)"), but a deployment whose concept pages are never embedded can never reproduce it.

Environment

Observed on a v0.42.x deployment. Verified the gap still exists on main at v0.42.42: src/core/cycle/synthesize-concepts.ts still only calls engine.putPage('concepts/${title}', ...) with no chunk/embed step, and the header comment still reads // ... dedup-by-embedding-similarity ship in v0.42+.

Evidence (deployment with ~1,700 concept pages)

  • SELECT count(*) FROM content_chunks c JOIN pages p ON p.id = c.page_id WHERE p.slug LIKE 'concepts/%'0
  • Every concepts/ page has embedding_signature IS NULL.
  • A 20-question retrieval probe (questions chosen to match concept topics) returned a concept page in top-5 in 0/20 cases.
  • FTS search also fails to surface them even on near-verbatim title queries: the pages do have search_vector populated, but being short they are out-ranked by longer pages and fall below the result limit.
  • Control: distilled/ and dream-cycle-summaries/ pages (also dream-generated) do have chunks. So the dream_generated anti-loop guard is not the cause — it is specific to the synthesize_concepts (and extract_atoms) write path.

Expected vs actual

  • Expected: concept pages — the curated "distilled framework" output the schema treats as first-class and source-boost.ts prioritizes at ×ばつ — should be in the retrieval surface (chunked + embedded) so query/search can return them.
  • Actual: synthesize_concepts writes them but they are never chunked/embedded, so they are unreachable via the primary retrieval paths.

Possible directions (maintainers to weigh)

  1. Have synthesize_concepts enqueue chunking/embedding for the concept pages it writes. This is the natural home and it composes with the planned v0.42+ embedding-similarity dedup, which will need concept embeddings anyway.
  2. Alternatively, document that concepts need a separate reindex/embed pass, and add a doctor check that flags "concept pages present but unchunked" so operators know to run it.

Note

atoms/ are intentionally substrate (no source-boost entry) and presumably should not be embedded — this report is specifically about concepts/.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

      Relationships

      None yet

      Development

      No branches or pull requests

      Issue actions

        AltStyle によって変換されたページ (->オリジナル) /