Show & tell: a Postgres+pgvector+AGE fork of MemPalace, an HTTP/MCP daemon, an eval harness — and where the bottleneck actually is #1659

jphein started this conversation in Show and tell

@jphein jphein

May 30, 2026

· 0 comments

Return to top

jphein
May 30, 2026
Collaborator

Sharing back a cluster of work that grew out of running MemPalace in frustration at Caude forgetting and then trying to measure it honestly. Three forks, many collaborations, and one finding that surprised us. 🫏
image

1. A Postgres + pgvector + Apache AGE fork

techempower-org/mempalace moved the substrate from ChromaDB to PostgreSQL + pgvector (vector + tsvector/pg_trgm BM25) with Apache AGE as the graph layer. The knowledge graph is live and non-trivial — ~1.9M triples (1,921,600) over ~1.15M entities right now — which is what makes graph-fused retrieval (below) a real candidate generator rather than a toy.

The migration is substrate-equivalent on retrieval, which we checked before trusting anything built on top of it: on LongMemEval-S (500q, same MiniLM encoder, same scoring), Postgres+pgvector and upstream ChromaDB agree to four decimals — R@5 = 0.9660, byte-identical per-question rankings across all six question types. So the backend swap costs no recall; everything downstream is measured against a floor we know matches upstream.

2. An HTTP/MCP gateway — palace-daemon

techempower-org/palace-daemon (a fork of @rboarescu's palace-daemon) is the gateway that serves the palace over HTTP and MCP: /search, /search/age-fused (vector + AGE-graph RRF fusion at the HTTP surface), FlashRank cross-encoder rerank, and the MCP tool surface (mempalace_search, mempalace_traverse, diary, drawers, KG). It's what lets multiple clients — agents, evals, an MCP host — share one palace without stepping on each other.

3. An eval harness — multipass SME

techempower-org/multipass-structural-memory-eval (a fork of M0nkeyFl0wer's multipass-structural-memory-eval) is the structural-memory eval framework that produced the benchmarks here. Its posture is diagnostic, not leaderboard: deltas under controlled A/B/C/D conditions, multi-corpus, locally runnable. It's what we used to measure the fork instead of guessing.

4. An encoder-FT collaboration with @nakata-app

The most productive thread on this repo for us was @nakata-app's adaptmem domain-adaptive fine-tune work. We reproduced it end-to-end and then stress-tested its boundary:

In-domain, it's real and reproduces. nakata's published FT-300 lifts LongMemEval R@5 from 0.966 (raw) → 0.980 (raw+FT) → 0.990 (hybrid_v4+FT). We retrained from his recipe on our own GPU (seed=42, 300/200 split) and the held-out 200q test hit R@5 = 1.000 (R@1 0.925) — inside published noise. The protocol is portable; the in-domain encoder lift is genuine.
It's domain-specific — no free cross-domain transfer. We carried the same FT encoder to a deliberately different corpus (a personal technical KB, code + infra notes + RFCs) and the lift flattened to a robust null: best delta +1.7pp at R@1, R@5 dead flat, 24/29 probes moving exactly 0.0. Two independently trained FT encoders both no-op'd. Read together with nakata's numbers, that's completely consistent: the lift is strong when fine-tune corpus and eval corpus are the same family, and it wants hard-negative re-mining on the target corpus to travel. A boundary on the method, not a knock on it.

5. Locating the QA gap: retrieval delivery, not reasoning

Retrieval is near-ceiling. A category-stratified LongMemEval-S A/B (n=150) puts plain /search at R@5 = 92.67% and /search/age-fused at 92.00% (graph fusion is neutral on representative data, a targeted re-ranker rather than a blanket win); the oracle retrieval ceiling is R@5 = 0.974.
Once retrieval is solved, where does QA go? We first read a ~0.61 "oracle" QA ceiling as a reasoning limit — but that "oracle" wasn't true oracle: context reached the reader through /search at limit=5, and for single-session-assistant questions our upstream-parity ingest dropped the assistant-authored turns (USERONLY, which upstream's own README recommends against). Hand the reader the actual gold (true oracle, evidence sessions verbatim) and five of six categories recover — single-session-assistant 0.32→0.98, temporal 0.36→0.75, knowledge-update 0.70→0.91, multi-session 0.71→0.87, single-session-preference 0.80→0.93 — lifting the ceiling from 0.610 to 0.868, within 0.2pp of the published GPT-4o oracle (0.870). (The canonical GPT-4o-style judge is now wired in; it fixed a real preference/abstention scoring confound, but the true-oracle test is what moved the number.)

This surfaced from a reader floor-lift experiment that returned a null — and trusting the null enough to chase it is exactly what exposed the confound and corrected a number we'd already published.

The bottom line

Pulling it together: MemPalace is a competitive, honestly-measured memory system, and the headroom to frontier-quality memory is in its own hands. Retrieval matches the field (R@5 0.927 representative, 0.966 in byte-identical parity with upstream — the Postgres+pgvector+AGE move cost zero recall); handed the right context the reader essentially matches GPT-4o's oracle (0.868 vs 0.870); so the lever is what the memory system delivers — ingestion fidelity and retrieval breadth — not a bigger model. The part worth keeping is the method: SME can say that precisely because it decomposes the QA gap layer by layer and scores structural qualities — ingestion integrity, gap-detection, ontology coherence, invocation discipline — that no leaderboard captures, and because it was willing to chase a null and correct its own number in public to get there. 🫏

Full writeups, per-category tables, and reproducers are on the results page: https://techempower-org.github.io/multipass-structural-memory-eval/site/#benchmarks

Happy to share artifacts, scripts, or run independent reproductions. Thanks to the upstream maintainers each of these builds on: the MemPalace team for an architecture clean enough to fork and measure, @rboarescu for the palace-daemon the gateway forks, @M0nkeyFl0wer for the multipass-structural-memory-eval framework the harness forks, and @nakata-app for a protocol reproducible enough to push to its boundary.

A special thanks to all of you!! The mempalace contributors, and users! We have more PRs than issues 3 to 2! 🫏

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show & tell: a Postgres+pgvector+AGE fork of MemPalace, an HTTP/MCP daemon, an eval harness — and where the bottleneck actually is #1659

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

jphein
May 30, 2026
Collaborator

1. A Postgres + pgvector + Apache AGE fork

2. An HTTP/MCP gateway — palace-daemon

3. An eval harness — multipass SME

4. An encoder-FT collaboration with @nakata-app

5. Locating the QA gap: retrieval delivery, not reasoning

The bottom line

Replies: 0 comments

Select a reply

Uh oh!

Show & tell: a Postgres+pgvector+AGE fork of MemPalace, an HTTP/MCP daemon, an eval harness — and where the bottleneck actually is #1659

Uh oh!

Uh oh!

jphein May 30, 2026 Collaborator

1. A Postgres + pgvector + Apache AGE fork

2. An HTTP/MCP gateway — palace-daemon

3. An eval harness — multipass SME

4. An encoder-FT collaboration with @nakata-app

5. Locating the QA gap: retrieval delivery, not reasoning

The bottom line

Replies: 0 comments

jphein
May 30, 2026
Collaborator