kostadis
May 11, 2026

MemPalace parallelism rollout — fork notes

Authors: Kostadis Roussos + Claude (Sonnet/Opus via Claude Code, claude-opus-4-7)
Branch: kostadis/mempalace:kostadis-dev
Status: shipped + verified at scale on a real palace, ready for upstream review/feedback.

What

Six PRs that take MemPalace's serial mining + refinement paths and make them
parallel-capable, with one shared ParallelPipeline harness and a --workers N
flag. The architecture preserves ChromaDB's HNSW single-writer invariant by
routing every collection write through one consumer thread while N producer
threads do the IO-bound work (embedding HTTP, LLM chat HTTP).

PR	scope	commit
#6	Remote embedding providers (Ollama + OpenAI-compat) — landed earlier; pre-req	`79c09fe`
#7	LLM config-mirror (`MEMPALACE_LLM_*` env / config keys) + parallelism design doc + `.gitignore` hygiene	`be1e759` etc.
#9	`ParallelPipeline` harness + parallel `miner._mine_impl` + `--workers` CLI + `urllib3.PoolManager` keep-alive on embedding clients	`cf8e2bb` etc.
#10	Parallel `llm_refine.refine_entities` + `closet_llm.regenerate_closets` + LLM-client keep-alive	`38b0add` etc.
#11	Parallel `convo_miner.mine_convos`	`a2151df`
#12	Pre-existing pagination bug in `closet_llm` — surfaced at >32K drawers	`2dd4729`

Design doc lives in-tree at docs/design/embrace-parallelism.md.

Why

The trigger was a calibration exercise on a DGX Spark (GB10, 128 GB unified
memory) running vLLM for embeddings + LLM. With a single vLLM client and a
serial mempalace mine, the GPU was idle most of the wallclock. The hypothesis
was that mempalace's serial loops were the bottleneck and parallel HTTP would
unlock the GPU.

The hypothesis was half right (see findings).

How (architecture)

mempalace/parallel.py is a ~250-line module that bridges:

N producer threads (ThreadPoolExecutor) do file IO + chunking + the
embedding/LLM HTTP call.
1 consumer thread drains a bounded queue.Queue and runs the
side-effect (collection.upsert(embeddings=...), dict merge, etc.).

The single-writer guarantee is the caller's responsibility — the harness only
promises that consumer_fn is invoked from exactly one thread. The miner uses
that thread to be the only path to collection.upsert(embeddings=...), which
bypasses ChromaDB's collection-level EF call and keeps hnswlib happy with
num_threads=1.

KeyboardInterrupt propagates correctly through the pool — pinned by tests.

Worker count defaults to MempalaceConfig().workers:

1 for embedding_provider=onnx (GIL-bound, no win from concurrency).
8 for ollama / openai-compat (HTTP releases the GIL on the socket wait).

Override via --workers N CLI flag or MEMPALACE_WORKERS env.

Empirical findings — Spark + vLLM

This is the interesting part. The architecture works; whether you see wallclock
speedup depends on whether the endpoint is throughput-bound or
latency-bound.

Mining (embedding): ×ばつ — no speedup, GPU already saturated

Run	wallclock
`workers=1`, CG corpus (183 files, 3550 drawers)	1m 19s
`workers=8`, same corpus	1m 23s

Total embed tokens ≈ 700K. Serial HTTP wait ≈ 55s → 12.7K tok/s, right at the
~11.4K tok/s ceiling we measured separately for nomic-embed-text-v1.5 on a
single Spark GPU. One vLLM embedding client already saturates the GPU.
Eight clients can't extract more aggregate throughput.

Confirmed at scale: a from-scratch re-mine of 906 source files / 90,435
drawers (CG + mytools + this repo) ran in 24m 33s at workers=8 — same band
as workers=1. Parallelism gives correct results, not faster results, on this
workload.

LLM refinement / closet regeneration: ×ばつ speedup against vLLM chat

LLM token generation is autoregressive — one stream uses a fraction of the
GPU's compute waiting on each token. vLLM's continuous batching extracts real
concurrency gains on this workload.

Synthetic bench (200 fake candidates, 8 batches), refine_entities against
vLLM-chat / Qwen2.5-14B-Instruct-AWQ:

workers	wallclock
1	530.85s
8	127.12s

×ばつ speedup.

Real-corpus bench (closet_llm on 16 real source files from campaign-dev,
heterogeneous sizes from 790 chars to 56K chars):

workers	wallclock	succeeded	failed
1	5m 09s	15/16	1
8	52s	15/16	1

×ばつ speedup — beats the synthetic bench because heterogeneous prompt
sizes give vLLM's continuous batching better utilization (small responses
don't block large ones, gaps fill naturally).

Honest gotchas

The very first attempt to verify Phase 3 went via Ollama port 11434, not
vLLM port 8001. Ollama serializes chat by default (OLLAMA_NUM_PARALLEL
applies to chat but had model-load contention in our test) so we measured
×ばつ — a misleading "almost no speedup" until we realized we'd benched the
wrong endpoint. Lesson: read your own setup doc before benchmarking.
One source file in the closet_llm bench (planning.py, 56K chars) failed in
both runs — qwen2.5-14B-AWQ's 32K context window. Pre-existing scaling
issue, not a parallelism bug. Workaround would be content-aware chunking
before sending; not done here.
closet_llm had a latent pagination bug (drawers_col.get(limit=total)
binds one SQL parameter per drawer id, exceeds SQLITE_MAX_VARIABLE_NUMBER
at ~32K drawers). PR Add OpenClaw skill integration #12 fixes it. Independent of this rollout but surfaced
by it.

Test footprint

metric	baseline (pre-PR-#9)	after #11 + #12
passing	1702	1748
failing	6 (pre-existing, tracked in issue #8)	6 (same)
new tests added	—	+46
regressions introduced	—	0

New test surfaces:

tests/test_parallel_pipeline.py — unit tests for the harness, including
the single-thread-consumer invariant.
tests/test_miner_parallel.py — integration test that wraps
ChromaCollection.upsert to assert all calls share one thread id.
tests/test_embedding_pool.py — pool-path tests for the urllib3 keep-alive
in both embedding clients + the LLM client.
Plumbing assertions in tests/test_config.py, tests/test_cli.py,
tests/test_convo_miner_unit.py, tests/test_llm_refine.py.

What's deferred

Retire the MEMPALACE_HTTP_KEEPALIVE=0 env shim: it exists so existing
tests that patch mempalace.embedding_*.urlopen keep passing without
modification. Should be removed once those tests are ported to patch
urllib3.PoolManager.request. Mechanical follow-up.
dedup.py and repair.py still have serial read-heavy chromadb loops.
Thread-pool friendly, lower priority.
Pre-existing failures in issue Non-interactive mode for all commands (agent-friendly) #8 (×ばつ test_corpus_origin_integration +
test_hnsw_capacity + test_save_hook_mines) — all unrelated to this work
but tracked.

Open questions for the dev team

Workers default — current code sets workers=8 for remote embedding
providers. With the GPU-saturation finding, that's wasteful on the
embedding path (no benefit) but right for LLM. Is there appetite for a
separate embedding_workers / llm_workers split?
Test-shim removal — happy to do the port to urllib3.PoolManager.request
patching as a follow-up, but wanted to land the parallelism behind the
shim first to keep each PR focused.
Upstream-ability — this work lives on kostadis-dev. If MemPalace
would take it, what does the upstream review process look like?

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

So I decided to hook up mempalace with my DGX Spark and made some of the system go in parallel #1446

Uh oh!

{{title}}

Uh oh!

kostadis
May 11, 2026

MemPalace parallelism rollout — fork notes

What

Why

How (architecture)

Empirical findings — Spark + vLLM

Mining (embedding): ×ばつ — no speedup, GPU already saturated

LLM refinement / closet regeneration: ×ばつ speedup against vLLM chat

Honest gotchas

Test footprint

What's deferred

Open questions for the dev team

Replies: 0 comments

Select a reply

Uh oh!

So I decided to hook up mempalace with my DGX Spark and made some of the system go in parallel #1446

Uh oh!

kostadis May 11, 2026

MemPalace parallelism rollout — fork notes

What

Why

How (architecture)

Empirical findings — Spark + vLLM

Mining (embedding): ×ばつ — no speedup, GPU already saturated

LLM refinement / closet regeneration: ×ばつ speedup against vLLM chat

Honest gotchas

Test footprint

What's deferred

Open questions for the dev team

Replies: 0 comments

kostadis
May 11, 2026