-
Notifications
You must be signed in to change notification settings - Fork 0
Releases: wolverin0/memorymaster
v3.19.0 — Phase 0 hardening (H1+H2+H3+H4)
39ac4ff Highlights
Phase 0 hardening release. Closes all four security/ops gaps from the GPT-5.4 review against docs/ROADMAP.md Phase 0. All four mechanisms ship opt-in by default — zero breaking changes for callers that don't set the new env vars.
- H1 (#113) — per-cycle LLM budget caps with reason-coded hard stops + per-provider circuit breaker
- H2 (#114) — dashboard HTTP auth (viewer/operator roles) + CSRF + bind-safety refusal
- H3 (#115) — webhook HMAC-SHA-256 signing + timestamp + 5-min replay window
- H4 (#116) — MCP db/workspace path allowlist + admin-mode bypass
Env vars reference
| Env var | Default | Purpose |
|---|---|---|
MEMORYMASTER_MAX_LLM_CALLS_PER_CYCLE |
0 (unlimited) | H1 cycle call cap |
MEMORYMASTER_MAX_TOKENS_PER_CYCLE |
0 (unlimited) | H1 cycle token cap |
MEMORYMASTER_MAX_PROVIDER_FAILURES_PER_CYCLE |
0 (unlimited) | H1 per-provider breaker |
MEMORYMASTER_DASHBOARD_TOKEN_VIEWER |
unset (legacy) | H2 read-only bearer |
MEMORYMASTER_DASHBOARD_TOKEN_OPERATOR |
unset (legacy) | H2 mutating bearer |
MEMORYMASTER_DASHBOARD_UNSAFE_BIND |
unset (refuse) | H2 non-loopback escape |
MEMORYMASTER_WEBHOOK_SECRET |
unset (no sig) | H3 HMAC signing key |
MEMORYMASTER_MCP_DB_ALLOWLIST |
unset (allow all) | H4 DB path allowlist |
MEMORYMASTER_MCP_WORKSPACE_ALLOWLIST |
unset (allow all) | H4 workspace allowlist |
MEMORYMASTER_MCP_ADMIN_MODE |
unset (enforce) | H4 allowlist bypass |
Tests
63 new tests, zero regressions on pre-existing suites.
test_llm_budget.py— 8 teststest_dashboard_auth.py— 25 tests (19 unit + 6 end-to-end HTTP)test_webhook_hmac.py— 13 teststest_mcp_path_policy.py— 17 tests (12 unit + 5 chokepoint integration)
What's next
- v3.20.0 — Phase 1 storage discipline (versioned migrations + SQLite/Postgres parity gate)
- A1 full LongMemEval-S QA-accuracy publication run — mechanism shipped in v3.18.0 (#109), now safer with H1 budget caps in place
Assets 2
v3.18.0 - LongMemEval-S R@5 0.972, claude_cli judge unlocked
f9e7b82 Highlights
Retrieval R@5 lifted 0.966 → 0.972 (+0.006) with per-question-type weight profiles (PR #110). Single-session-preference bucket alone: 0.80 → 0.90 (+0.10), every other bucket unchanged. First non-NULL retrieval improvement since v3.15.0.
claude_cli judge provider (PR #109) unblocks the full LongMemEval-S QA-accuracy bench for OAuth-only environments — no API keys required.
ROADMAP added (docs/ROADMAP.md) — pivots away from benchmark grinding toward differentiator surfaces (governance UI, MCP observability, postgres parity).
What's in this release
Added
- Per-question-type retrieval weight profiles:
MEMORYMASTER_RETRIEVAL_PROFILE_<TYPE>=lex,conf,fresh,vecenv-var family. Opt-in, isolated — no behavior change when unset. claude_clijudge provider intests/bench_longmemeval.py:JudgeClientrouting through Claude Code OAuth.
Improved
- Single-session-preference bucket R@5: 0.8000 → 0.9000 with the validated
SINGLE_SESSION_PREFERENCE=0.10,0.10,0.10,0.70profile.
Notes
docs/longmemeval-results.mdper-bucket table is stale (will refresh on next publication pass)- A1 full 500q overnight publication run deferred
PRs
- #109 - feat(bench): claude_cli judge provider (OAuth, no API key required)
- #110 - feat(retrieval): per-question-type weight profiles (S3, +0.006 R@5)
- #111 - chore(release): v3.18.0
Competitive position
LongMemEval-S R@5 = 0.972 leads the small published set (MemPalace 0.966, agentmemory 0.952).
Assets 2
v3.17.1 - steward auto-ingest hook for daydream
552f9e4 Opt-in env-gated hook (MEMORYMASTER_DAYDREAM_INGEST_DIR=/Daydreams) so the existing 6h steward cron also auto-ingests daydream insights. Default OFF. Error-isolated — never breaks the cycle. 4 safety tests pass. See CHANGELOG.md.
Assets 2
v3.17.0 - daydream insights ingest pipeline
718437f Closes the loop: vault -> daydream -> MemoryMaster candidate claims -> steward validation -> wiki-absorb -> vault. New CLI: python -m memorymaster --db <db> ingest-daydream <vault>/Daydreams. See CHANGELOG.md and docs/daydream-integration.md.
Assets 2
v3.16.0 - S1 architectural unblock + S2 honest-null
d22dd53 v3.16.0 ships the retrieval-weights architectural unblock (S1, KEEP) and a documented honest-null on RRF-as-tiebreaker (S2, NULL). R@5 unchanged at 0.966, still leading agentmemory. See CHANGELOG.md.
Assets 2
v3.15.1 - README benchmark chart + v3.16 roadmap
161246a Docs-only release on top of v3.15.0. Adds:
- README Benchmarks section with inline SVG chart comparing v3.14, v3.15, and agentmemory
- docs/benchmark-longmemeval.svg (chart asset)
- docs/v316-roadmap.md with ranked next-step levers (S1-S3 high-leverage, A1-A3 worth running, B1-B3 deferred)
R@5 = 0.966 from v3.15.0 stands. Production retrieval code unchanged.
Assets 2
v3.15.0 - LongMemEval-S R@5 0.894 -> 0.966
4e0c899 v3.15.0 ships the E01 bench-harness fix (PR #94) that lifted R@5 by +0.072 to 0.966. 5 follow-up experiments documented as honest-null/harm. MemoryMaster now leads agentmemory R@5 + MRR with retrieval-only. See CHANGELOG.md for full breakdown.
Assets 2
v3.14.0 — 29-PR auto-harness session
5b8531d v3.14.0 release notes
29 PRs merged this session (#61-#89):
D-hypothesis bug fixes (from PR #62 scenario explorer)
- #65 compactor writes artifact before flipping status to archived
- #68 dedup treats object_value mismatch as conflict, not duplicate
- #69 compact_summaries redacts claim text via sensitivity filter before LLM egress
Features
- #71 dashboard /healthz + /readyz operational endpoints
- #73 sensitivity filter broadened (bearer, JWT, DB URIs, AWS keys, GitHub PATs)
- #77 vault_linter detects orphan wiki articles
- #83 observability prometheus-style counters
- #84 CLI --dry-run flag for compact / dedup / decay
- #85 snapshot round-trip parity tests
- #87 mcp ingest_claim per-source-agent rate limit
- #88 federated_query cross-tenant sensitivity safety
Tests
- #70 dream-bridge sensitivity filter regression suite
- #72 lifecycle supersede ↔ replaced_by symmetry invariant
- #75 wiki-absorb idempotency
- #78 mcp filter-bypass attempt suite (5 categories)
- #79 concurrent supersession race safety
- #80 db_merge deterministic conflict resolution by updated_at
- #82 decay respects pinned=true
Docs / Audits
- #61 STRIDE + OWASP audit (15 findings)
- #62 lifecycle edge-case scenario explorer (24 scenarios across 12 dimensions)
- #63 Windows + git hooks + codex sandbox troubleshooting playbook
- #64 wiki frontmatter compliance audit
- #66 ADR for wiki auto-promote on N validations
- #67 cross-project federation contract
- #74 architecture refresh (module map + data flow)
- #76 SQLite ↔ Postgres schema parity audit
- #81 MCP tool compliance audit (citation / source_agent / sensitivity)
- #86 CLI cookbook (90 subcommands documented)
Quality
- #89 ruff lint clean (46 → 11 errors; remaining 11 are intentional E402)
Generated via auto-harness pattern: 7 dispatch waves of 2-5 codex exec processes in parallel git worktrees, ~4 hours wall-clock orchestration time.
Assets 2
v3.13.0 — Atlas Inbox V1 + audit cleanup + wiki UX + Windows steward fix
24 commits, 65 files, +9 862 lines since v3.12.0. Headline: Atlas Inbox V1 ships as a versioned backend contract for downstream consumers (LifeAgent, etc.). Plus the full F-audit cleanup batch, real provider adapters, wiki UX upgrades, and a critical Windows fix.
Atlas Inbox V1 — WhatsApp → claims/actions → Super Productivity (new headline feature)
End-to-end backend slice for ingesting external sources, extracting candidate claims and reviewable action proposals, and exporting approved tasks. Backend contract only — UI lives in consuming projects.
- Storage (SQLite + Postgres parity):
external_sources,source_items,evidence_items,action_proposals,media_retry_queuetables;sensitivitycolumn on source/evidence with allowed values{none, low, medium, high, redacted} - 17 new CLI subcommands:
import-whatsapp,extract-atlas-claims,propose-actions,action-proposals,resolve-action-proposal,edit-action-proposal,label-source-item,label-evidence-item,enqueue-media-retry,process-media-retry-queue,record-media-retry-outcome,list-media-retries,transcribe-source-item,ocr-source-item,export-actions,atlas-version+init-dbbrought into the contract - 3 new dashboard endpoints:
GET /api/action-proposals,POST /api/action-proposals/status,GET /api/atlas/version - Versioned contract at v1.5.1 — every Atlas envelope carries
meta.atlas_contract_versionandmeta.atlas_subcommand. Semver enforced; consumers must refuse on major mismatch. Spec:docs/atlas-api-contract-v1.md. Machine-pinned by 41 contract tests. - Real provider adapters behind existing Protocols:
OpenAIWhisperTranscriptionProvider(stdlib-only urllib + multipart, usesOPENAI_API_KEY+OPENAI_BASE_URL) andTesseractOcrProvider(lazy-import optional dep). Mock providers remain default. - Media retry queue for connectors that fetch external URLs (e.g. wacli media). Consumer owns the fetch; MemoryMaster owns durable state. Atomic claim via
FOR UPDATE SKIP LOCKEDon Postgres. - Sensitivity-on-reimport preservation: operator-set labels survive
import-whatsappre-runs.
PRs: #20 (vertical slice) + #27 (contract chain v1.0.0 → v1.5.1 collapsed merge of 7 commits).
Wiki UX (PR #28)
Two patterns borrowed from shannhk/llm-wiki, picked for signal-to-noise:
explored: true|falsefrontmatter on every wiki article. Operator-set human-review marker, distinct fromconfidence. Defaultfalseon new articles; preserved on re-absorb when an operator flipped totrue(operator decisions are sticky).- Inline
> [!contradiction]Obsidian callouts rendered at the top of each article body duringwiki-absorb. Detection shared withvault_linter._detect_contradictionsso wiki and lint agree.
Skipped: mandatory "Counter-arguments" / "Data gaps" sections — lint-vault already finds these globally.
Audit cleanup batch — 10 F-fixes
The full overnight-audit findings shipped as atomic PRs #10–#19:
- F-1 (#11)
llm_steward.py: scope filter uses source claim's scope, not literal"project" - F-2 (#10)
auto-ingesthook delegates to canonicalredact_textfilter (no local regex copy) - F-3 (#19)
verbatim_store.search_verbatim: hybrid-merge dedup keys on row id, notcontent[:100](prevented 25 894-row collisions on templated content) - F-4 (#15) Sensitivity filter covers
role+source_agent+ content jointly - F-5 (#12)
observe()default scope auto-derives from cwd viascope_from_cwd, not literal"project" - F-6 (#16)
precompactreads autosave marker via auto-ingest's sanitize form (no double-fire) - F-7 (#14) Dead
memorymaster-observe.pyremoved - F-8 (#13)
llm_stewardshadow mode treatswould_archiveas terminal (no double-count) - F-9 (#17)
compactordistinguishesscope=Nonefromscope='project'(no false aliasing) - F-10 (#18)
decayrecords event when claim has futureupdated_at(no silent skip)
v3.13.x precision: Jaccard dedupe lifecycle
The dedupe mechanism that's now active in production (PRs #1, #4, #5, #6, #7, #8):
- PR #1 Pre-steward Jaccard dedupe — skip LLM on near-duplicate candidates
- PR #4 Record would-archive pair details in shadow mode (audit trail before going active)
- PR #5 v3.13.1: wire dedupe into
MemoryService.run_cycle(the actual cron path; was only in standalone CLI) - PR #6 v3.13.2: match dedupe scan size to validator (200, not
policy_limit) - PR #7
scripts/audit_dedupe_precision.pyfor ongoing precision audits - PR #8 Repair broken dedup query (was scanning FTS5 for hash)
Steward hooks promoted dedupe shadow mode → active on 2026年05月01日 after 5/5 spot-check precision on real data.
Windows steward fix (PR #30)
Critical for headless scheduled-task contexts:
_call_claude_clinow passessubprocess.CREATE_NO_WINDOWon Windows- Without this, when a parent process has no console (
pythonw.exe, services), Windows creates a NEW console for everysubprocess.runchild — producing one popup window per claim during steward LLM-judgment cycles - No-op on POSIX (creationflags is Windows-specific)
Codex review cleanup (PR #29)
*.db.bak*and*.stackdumpadded to.gitignoreextract_entities.py: hardcodedG:/_OneDrive/...paths replaced with argparse + repo-relative defaultsextract_l2_refined.py: regex\n/\v(control chars) corrected to\b(word boundaries) — silently droppingagent coordinationandvector searchmatches
Other
- PR #28 wiki UX as above
- PR #9
precompactbypasses block when auto-ingest fired recently (avoids race) - PR #2 lazy-import numpy so package imports without
[ml]extra - PR #3 enable v3.13 dedupe in shadow mode by default (now superseded — active mode since 2026年05月01日)
Numbers
| v3.12.0 | v3.13.0 | |
|---|---|---|
| Tests | 1 029 | 1 953 |
| MCP tools | 22 | 24 |
| CLI subcommands | 64 | 86 |
Upgrade
pip install --upgrade memorymaster
memorymaster --db memorymaster.db init-db # idempotent — adds new tables/columnsExisting Atlas DBs are forward-migrated automatically (3-phase _ensure_atlas_source_schema: tables → ALTER ADD COLUMN → indexes). LifeAgent and other Atlas consumers should pin to meta.atlas_contract_version >= 1.5.1.
🤖 Generated with Claude Code
Assets 2
v3.12.0 — Wider GT (top-50) confirms feature null on lexical corpus
Definitive end of the recall-feature track. v3.10 hypothesised the labeled GT was too narrow to detect lift from new candidates. v3.12 tested it: re-labeled 953 prompts against the top-50 candidates per prompt (was top-15).
Result: baseline jumps 0.104 → 0.470 (+0.366), confirming the GT-coverage bottleneck was real. But every v3.9-v3.11 feature is STILL NEGATIVE on the wider GT. F1 −0.001, F6 boost-only −0.005 to −0.013, F5+F8 −0.033 to −0.058. The features don't aport signal at top-5 on this lexically-clean corpus, regardless of label coverage.
Defaults stay at 0.0 / OFF across F1/F5/F6/F8. Machinery + new wider GT ship as infrastructure for future investigations on different corpora.
Top-15 vs top-50 GT comparison
| Metric | top-15 GT (v3.10/3.11) | top-50 GT (v3.12) |
|---|---|---|
| Non-empty prompts | 248 (26%) | 646 (67.8%) |
| Total label IDs | ~750 | 2,734 |
| Baseline precision@5 | 0.104 | 0.470 |
| Baseline MAP@5 | 0.184 | 0.568 |
| Baseline hit@5 | 0.235 | 0.667 |
Sweep results vs baseline 0.470
| Config | precision@5 | Δ |
|---|---|---|
| F1 W=0.1 | 0.469 | -0.001 |
| F1 W=1.0 | 0.467 | -0.003 |
| F6 boost-only W=0.1 | 0.465 | -0.005 |
| F6 boost-only W=1.0 | 0.457 | -0.013 |
| F5+F8 W=0.1 | 0.437 | -0.033 |
| F5+F8 W=1.0 | 0.412 | -0.058 |
| Combined v3.11 best knobs | 0.421 | -0.049 |
Confirmed
- The GT-coverage hypothesis was REAL: top-15 GT depressed baseline because most labelled-correct claims were below rank 15.
- The recall-feature hypothesis is REFUTED: features don't help even when labels capture them.
Added
artifacts/real-prompts-1000-top50.jsonl+-labels.json: 953 prompts, 646 non-empty (67.8%), 2,734 IDs. Wider GT for future evals.artifacts/label-batches-top50/: raw chunk in/outs.artifacts/recall-measurement-top50-2026年04月27日.md: full sweep + v3.13+ research directions.
v3.13+ research moves AWAY from re-ranking
- Real-world recall capture — instrument hook for query+clicked-IDs corpus.
- Vector recall — W_VECTOR=0.0 today (no Qdrant). Different signal source.
- Compaction/dedup — reduce noise rather than re-rank.
Pure additive — no breaking changes, no schema changes.