Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

So I decided to hook up mempalace with my DGX Spark and made some of the system go in parallel #1446

kostadis started this conversation in Ideas
Discussion options

MemPalace parallelism rollout — fork notes

Authors: Kostadis Roussos + Claude (Sonnet/Opus via Claude Code, claude-opus-4-7)
Branch: kostadis/mempalace:kostadis-dev
Status: shipped + verified at scale on a real palace, ready for upstream review/feedback.

What

Six PRs that take MemPalace's serial mining + refinement paths and make them
parallel-capable, with one shared ParallelPipeline harness and a --workers N
flag. The architecture preserves ChromaDB's HNSW single-writer invariant by
routing every collection write through one consumer thread while N producer
threads do the IO-bound work (embedding HTTP, LLM chat HTTP).

PR scope commit
#6 Remote embedding providers (Ollama + OpenAI-compat) — landed earlier; pre-req 79c09fe
#7 LLM config-mirror (MEMPALACE_LLM_* env / config keys) + parallelism design doc + .gitignore hygiene be1e759 etc.
#9 ParallelPipeline harness + parallel miner._mine_impl + --workers CLI + urllib3.PoolManager keep-alive on embedding clients cf8e2bb etc.
#10 Parallel llm_refine.refine_entities + closet_llm.regenerate_closets + LLM-client keep-alive 38b0add etc.
#11 Parallel convo_miner.mine_convos a2151df
#12 Pre-existing pagination bug in closet_llm — surfaced at >32K drawers 2dd4729

Design doc lives in-tree at docs/design/embrace-parallelism.md.

Why

The trigger was a calibration exercise on a DGX Spark (GB10, 128 GB unified
memory) running vLLM for embeddings + LLM. With a single vLLM client and a
serial mempalace mine, the GPU was idle most of the wallclock. The hypothesis
was that mempalace's serial loops were the bottleneck and parallel HTTP would
unlock the GPU.

The hypothesis was half right (see findings).

How (architecture)

mempalace/parallel.py is a ~250-line module that bridges:

  • N producer threads (ThreadPoolExecutor) do file IO + chunking + the
    embedding/LLM HTTP call.
  • 1 consumer thread drains a bounded queue.Queue and runs the
    side-effect (collection.upsert(embeddings=...), dict merge, etc.).

The single-writer guarantee is the caller's responsibility — the harness only
promises that consumer_fn is invoked from exactly one thread. The miner uses
that thread to be the only path to collection.upsert(embeddings=...), which
bypasses ChromaDB's collection-level EF call and keeps hnswlib happy with
num_threads=1.

KeyboardInterrupt propagates correctly through the pool — pinned by tests.

Worker count defaults to MempalaceConfig().workers:

  • 1 for embedding_provider=onnx (GIL-bound, no win from concurrency).
  • 8 for ollama / openai-compat (HTTP releases the GIL on the socket wait).

Override via --workers N CLI flag or MEMPALACE_WORKERS env.

Empirical findings — Spark + vLLM

This is the interesting part. The architecture works; whether you see wallclock
speedup depends on whether the endpoint is throughput-bound or
latency-bound.

Mining (embedding): ×ばつ — no speedup, GPU already saturated

Run wallclock
workers=1, CG corpus (183 files, 3550 drawers) 1m 19s
workers=8, same corpus 1m 23s

Total embed tokens ≈ 700K. Serial HTTP wait ≈ 55s → 12.7K tok/s, right at the
~11.4K tok/s ceiling we measured separately for nomic-embed-text-v1.5 on a
single Spark GPU. One vLLM embedding client already saturates the GPU.
Eight clients can't extract more aggregate throughput.

Confirmed at scale: a from-scratch re-mine of 906 source files / 90,435
drawers (CG + mytools + this repo) ran in 24m 33s at workers=8 — same band
as workers=1. Parallelism gives correct results, not faster results, on this
workload.

LLM refinement / closet regeneration: ×ばつ speedup against vLLM chat

LLM token generation is autoregressive — one stream uses a fraction of the
GPU's compute waiting on each token. vLLM's continuous batching extracts real
concurrency gains on this workload.

Synthetic bench (200 fake candidates, 8 batches), refine_entities against
vLLM-chat / Qwen2.5-14B-Instruct-AWQ:

workers wallclock
1 530.85s
8 127.12s

×ばつ speedup.

Real-corpus bench (closet_llm on 16 real source files from campaign-dev,
heterogeneous sizes from 790 chars to 56K chars):

workers wallclock succeeded failed
1 5m 09s 15/16 1
8 52s 15/16 1

×ばつ speedup — beats the synthetic bench because heterogeneous prompt
sizes give vLLM's continuous batching better utilization (small responses
don't block large ones, gaps fill naturally).

Honest gotchas

  • The very first attempt to verify Phase 3 went via Ollama port 11434, not
    vLLM port 8001. Ollama serializes chat by default (OLLAMA_NUM_PARALLEL
    applies to chat but had model-load contention in our test) so we measured
    ×ばつ — a misleading "almost no speedup" until we realized we'd benched the
    wrong endpoint. Lesson: read your own setup doc before benchmarking.

  • One source file in the closet_llm bench (planning.py, 56K chars) failed in
    both runs — qwen2.5-14B-AWQ's 32K context window. Pre-existing scaling
    issue, not a parallelism bug. Workaround would be content-aware chunking
    before sending; not done here.

  • closet_llm had a latent pagination bug (drawers_col.get(limit=total)
    binds one SQL parameter per drawer id, exceeds SQLITE_MAX_VARIABLE_NUMBER
    at ~32K drawers). PR Add OpenClaw skill integration #12 fixes it. Independent of this rollout but surfaced
    by it.

Test footprint

metric baseline (pre-PR-#9) after #11 + #12
passing 1702 1748
failing 6 (pre-existing, tracked in issue #8) 6 (same)
new tests added +46
regressions introduced 0

New test surfaces:

  • tests/test_parallel_pipeline.py — unit tests for the harness, including
    the single-thread-consumer invariant.
  • tests/test_miner_parallel.py — integration test that wraps
    ChromaCollection.upsert to assert all calls share one thread id.
  • tests/test_embedding_pool.py — pool-path tests for the urllib3 keep-alive
    in both embedding clients + the LLM client.
  • Plumbing assertions in tests/test_config.py, tests/test_cli.py,
    tests/test_convo_miner_unit.py, tests/test_llm_refine.py.

What's deferred

  • Retire the MEMPALACE_HTTP_KEEPALIVE=0 env shim: it exists so existing
    tests that patch mempalace.embedding_*.urlopen keep passing without
    modification. Should be removed once those tests are ported to patch
    urllib3.PoolManager.request. Mechanical follow-up.
  • dedup.py and repair.py still have serial read-heavy chromadb loops.
    Thread-pool friendly, lower priority.
  • Pre-existing failures in issue Non-interactive mode for all commands (agent-friendly) #8 (×ばつ test_corpus_origin_integration +
    test_hnsw_capacity + test_save_hook_mines) — all unrelated to this work
    but tracked.

Open questions for the dev team

  1. Workers default — current code sets workers=8 for remote embedding
    providers. With the GPU-saturation finding, that's wasteful on the
    embedding path (no benefit) but right for LLM. Is there appetite for a
    separate embedding_workers / llm_workers split?
  2. Test-shim removal — happy to do the port to urllib3.PoolManager.request
    patching as a follow-up, but wanted to land the parallelism behind the
    shim first to keep each PR focused.
  3. Upstream-ability — this work lives on kostadis-dev. If MemPalace
    would take it, what does the upstream review process look like?
You must be logged in to vote

Replies: 0 comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Ideas
Labels
None yet
1 participant

AltStyle によって変換されたページ (->オリジナル) /