Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Releases: jundot/omlx

0.4.4rc1

11 Jun 10:44
@jundot jundot

Choose a tag to compare

This release candidate focuses on DiffusionGemma support, DeepSeek V4 oQ/MTP support, and major cache-reuse correctness improvements for agent workloads.

Highlights

  • Added DiffusionGemma support via @Blaizzy's mlx-vlm, currently without cache. oMLX can now serve DiffusionGemma models through the mlx-vlm path.
  • Added DeepSeek V4 oQ quantization and MTP support. oMLX now supports fractional oQ levels, pre-quantized DeepSeek V4 oQ tensors, and safer DeepSeek V4 MTP loading and rollback behavior.
  • Dramatically improved agent cache reuse for Gemma and Qwen models. Paged SSD cache and prefix-cache correctness fixes now prevent stale layer-cache reuse, handle rotating-family prefix restore more safely, and strip superseded rotating-tip payloads so agent-style repeated prompts can reuse cache much more reliably. by @cfbraun in #1815 and @hojin12312 in #1807
  • Improved Memory Guard and preflight accounting. Scheduler preflight now avoids counting hot-cache bytes incorrectly, includes chunk KV growth, mirrors MLX SDPA fallback behavior more closely, and handles tracked model memory before load admission.

Improvements and Fixes

  • Added VLM MTP support with an external Qwen MTP drafter. VLM models can now use a separate Qwen MTP draft model for speculative decoding, with admin UI support for the new drafter settings. by @imi4u36d in #1791
  • Added a scheduler-facing output parser abstraction for Harmony, Gemma 4, and Cohere2 MoE style streamed output/tool parsing.
  • Fixed VLM MTP hidden-output preservation and made MTP draft rollback atomic.
  • Fixed DeepSeek V4 MTP rollback/loading and rejected unsupported DeepSeek V4 fp16 oQ configurations.
  • Added fractional oQ levels and support for pre-quantized DeepSeek V4 oQ tensors.
  • Fixed TurboQuant cache handling after hybrid cache restore and chunked prefill cache insertion. (#1793)
  • Fixed benchmark loading so VLM MTP benchmark paths are not forced through LM-only loading. by @imi4u36d in #1813
  • Added a benchmark option to force mlx-lm loading when needed.
  • Fixed paged SSD cache invalidation for stale layer cache signatures. by @cfbraun in #1815
  • Fixed prefix-cache restore for rotating-family models and stripped superseded rotating-tip payloads. by @hojin12312 in #1807
  • Added a prefix-cache divergence probe for always-miss diagnosis. by @popfido in #1784
  • Fixed scheduler preflight accounting for hot-cache bytes, chunk KV growth, SDPA fallback, and scheduler memory-guard test doubles. (#1796, #1797)
  • Fixed engine-pool settle waits so other serving engines are not delayed unnecessarily. by @JimStenstrom in #1785
  • Fixed logits_processors rows that are dropped to None during batch merge. by @efortin in #1799
  • Fixed API-key safety by fingerprinting rejected keys in logs and rejecting non-ASCII configured API keys at validation time. by @richgoodson in #1751 and #1804
  • Fixed admin global profile form synchronization. (#1816)
  • Refined speculative draft model dropdowns and removed the experimental label from VLM MTP.
  • Added cohere2_moe support via mlx-vlm. (#1809)
  • Improved native embedding, reranker, DFlash, and audio model paths, including BGE/XLM-R/BERT serving, DFlash memory/cache accounting, TTS language forwarding, Gemma 4 Unified discovery, and NeMo ASR discovery.

Thanks

Thanks to @richgoodson, @paalolav, @JimStenstrom, @apetersson, @popfido, @FaisalFehad, @scaryrawr, @hojin12312, @efortin, @imi4u36d, and @cfbraun for the reports and fixes that shaped this release.

New Contributors

Thank you to @paalolav, @hojin12312, and @efortin for making their first contributions in this release.

Full Changelog: v0.4.3...v0.4.4rc1

Assets 4
Loading
elnoxvie, HLGRL, and efortin reacted with thumbs up emoji rome3ro, Gallardo994, and HLGRL reacted with hooray emoji Saaaaa, jazzdevils, geodesicer, deepsweet, 0xClandestine, rome3ro, Gallardo994, shuft, HLGRL, awaescher, and TG-Lim reacted with heart emoji blueinkgz, rome3ro, jftuga, eokic, Gallardo994, octkim, HLGRL, mseibert, natrimmer, and linux4life1 reacted with rocket emoji
20 people reacted

0.4.4.dev1

10 Jun 02:39
@jundot jundot

Choose a tag to compare

0.4.4.dev1 Pre-release
Pre-release

This development release improves memory-pressure handling, paged SSD cache reliability, native embedding/reranker serving, DFlash memory/cache accounting, and audio model support.

  • Improved emergency memory handling for pinned workloads. Active requests are only aborted as a last resort when memory exceeds the real ceiling.
  • Improved paged SSD cache write-back reliability. Dirty hot-cache blocks now fall back to inline SSD writes instead of being dropped.
  • Improved API-key log safety. Rejected API keys are logged as fingerprints instead of raw values. by @richgoodson in #1751
  • Improved native BGE/XLM-R/BERT serving. bf16 reranker loads, embedding eval mode, and CLS pooling are handled correctly. by @paalolav in #1767
  • Improved DFlash prefill memory guarding. DFlash primary mode now applies the prefill memory guard before admission. by @JimStenstrom in #1770
  • Improved native embedding and reranker inference. Native paths now match shared serving behavior more closely.
  • Improved DFlash preflight memory safety. Unsafe MLX telemetry calls were removed from the preflight guard path.
  • Added TTS language forwarding. The audio speech language field now reaches mlx-audio lang_code. by @apetersson in #1773
  • Improved DFlash cache accounting. Prefix-cache hits are reported in prompt_tokens_details.cached_tokens. by @popfido in #1768
  • Fixed TTS argument forwarding. TTS engine argument order is preserved when language is forwarded.
  • Improved Gemma 4 Unified discovery. gemma4_unified models are detected as VLMs even without vision_config. by @FaisalFehad in #1744
  • Improved NeMo ASR discovery. NeMo ASR models are detected as speech-to-text models. by @scaryrawr in #1742
  • Improved pre-load memory admission. Tracked model memory now participates in LRU eviction decisions before loading another model. by @popfido in #1766
Loading
mrrodriguez and HLGRL reacted with thumbs up emoji HLGRL and empa76 reacted with heart emoji blueinkgz, eric-fulfil, HLGRL, and ericstone2010 reacted with rocket emoji
6 people reacted

0.4.3

09 Jun 13:22
@jundot jundot

Choose a tag to compare

This hotfix release focuses on macOS 27 compatibility, Throughput recovery for Qwopus and other affected models, and Memory Guard optimization and correctness fixes.

Highlights

  • Added macOS 27 beta compatibility. oMLX now handles the larger HOST_VM_INFO64 response shape used by macOS 27 and avoids fragile psutil memory-stat paths on macOS. (#1748, #1749)
  • Fixed slow Streaming decode on Qwopus and related models. Active Memory Guard polling no longer calls MLX/Metal telemetry from the background thread during active requests, removing a major source of decode stalls. (#1745)
  • Improved decode performance. In a single-run Qwen 3.6-35B-A3B tg512 check, throughput improved from 77.5 to 79.0 tok/s (+1.9%) compared with 0.4.2.
  • Improved per-model MTP behavior. MTP decode eligibility is now stored on each loaded model instance, so loading a non-MTP model later no longer disables MTP decode on an already-loaded Qwen/Qwopus MTP model. (#1758)
  • Optimized Memory Guard preflight estimates. TurboQuant KV, hybrid cache models, fused SDPA, and tiled SDPA scratch memory are now accounted for more accurately, reducing false rejections and avoiding unsafe underestimates. (#1763, #1764)

Improvements and Fixes

  • Fixed Memory Guard active-request polling so scheduler-recorded MLX memory samples are reused instead of querying MLX telemetry from the enforcer loop.
  • Fixed macOS memory detection so system memory and process enforcement remain stable when HOST_VM_INFO64 sizing changes on macOS 27 beta.
  • Fixed TurboQuant KV preflight accounting so Memory Guard no longer overestimates KV peak memory by several times on TurboQuant-enabled models. (#1763)
  • Fixed preflight support for hybrid ArraysCache models with TurboQuant enabled.
  • Fixed fused SDPA memory estimation so MLX fused attention is treated as linear-memory for all head_dim values where applicable. by @fqx (#1764)
  • Added tiled SDPA scratch accounting for high-head-dimension prefill paths so large VLM/Qwen-style models are guarded more accurately.
  • Fixed prefill Memory Guard errors to return a client-visible failure path instead of surfacing as an internal server failure.
  • Fixed DFlash fallback scheduler resolution and bumped dflash-mlx for the Qwen wrapper compatibility fix.
  • Fixed Llama 4 batch cache offsets. (#1752)
  • Fixed max_completion_tokens handling as an alias for max_tokens. (#1759)
  • Fixed Harmony encoding loading by retrying transient tokenizer/encoding load failures.
  • Fixed stored MarkItDown file placeholders so existing uploaded-file references remain usable after 0.4.2. (#1750)
  • Fixed logits_processors=None handling to avoid mlx-lm crashes. by @monroewilliams (#1747)
  • Added Thaw menu bar manager support. by @youvegotmoxie (#1743)
  • Bumped the mlx-lm, mlx-vlm, and dflash-mlx pins to include upstream compatibility fixes used by this hotfix.

Thanks

Thanks to @Collinw24, @ritbl, @orangeseasun205, @smkzw, @fqx, @monroewilliams, and @youvegotmoxie for the reports and fixes that shaped this release.

New Contributors

Thank you to @youvegotmoxie for making their first contribution in this release.

Full Changelog: v0.4.2...v0.4.3

Contributors

monroewilliams, youvegotmoxie, and 5 other contributors
Loading
blueinkgz, kasnol, rome3ro, secondtonone, mhjmaas, giosalinas, funnygeeker, DenMarty, and antstars reacted with hooray emoji jazzdevils, Saaaaa, HLGRL, rome3ro, pmioduszewski, htwyay, giosalinas, and eokic reacted with heart emoji d1slike, ilukashin, HLGRL, blueinkgz, rome3ro, secondtonone, supershania, and garyhooi reacted with rocket emoji
19 people reacted

0.4.2

08 Jun 16:21
@jundot jundot

Choose a tag to compare

This release focuses on native MarkItDown document processing, Qwen throughput and DFlash stability, adaptive Burst Decode throughput, Gemma 4 unified multimodal support, and broad cache/server reliability fixes.

Highlights

  • Added native MarkItDown document processing. Chat file uploads can now be converted through MarkItDown, including PDF, DOCX, PPTX, TXT, and Markdown inputs, with OpenAI-style file_data support.
  • Added selectable PDF processing engines. PDFs can use either MarkItDown conversion or VLM OCR from Integration settings.
  • Recovered Qwen throughput after the 0.4.0 regression. Qwen VLM/MTP batching and singleton decode handling were updated so single-row decode no longer falls into the slower batched cache path.
  • Improved Qwen DFlash stability. DFlash Qwen target ops now stay pointed at the real text wrapper after the mlx-lm pipeline wrapper update, and idle DFlash engines are isolated across model switches.
  • Added adaptive Burst Decode. oMLX can now coalesce multiple decode steps per executor hand-off to improve fast single-request decode throughput, with bounded responsiveness and Off / Light / Balanced / Aggressive controls.
  • Added Gemma 4 unified audio input support. Gemma 4 unified models can accept audio input alongside image inputs, with suppress-token handling for multimodal placeholders.
  • Improved long-context cache reliability. SSD cache pending-write saturation is tuned by block size and model KV size, transient writer backlog waits before dropping blocks, and hot-cache memory is reclaimed after model unload.
  • Improved model and server controls. Server-wide context window caps, comma-separated bind addresses, embedding context fallback, and better engine teardown behavior are now covered.

Performance

  • Burst Decode further reduces per-token executor overhead on fast local decode paths. Tokens may arrive in small bursts; the default Balanced mode can be changed from Global Settings -> Advanced.
  • Internal tg512 measurements show the main Qwen regression recovered while keeping Gemma performance stable:

Screenshot 2026年06月09日 at 01 25 30

Fixes

  • Fixed Qwen VLM/MTP decode behavior by keeping singleton cache layers out of BatchKVCache until a second request is actually appended.
  • Fixed Qwen DFlash output corruption after model switches by patching dflash-mlx Qwen target wrapper detection and unloading other idle DFlash engines before loading a new DFlash model. (#1707)
  • Fixed prompt-prefix token seeding in BatchGenerator so penalty processors see cached prompt tokens correctly. Thanks to @Yukon for identifying the missing token buffer seed.
  • Fixed generation recovery for MLX __next_prime overflow errors by resetting decode state and retrying affected requests serially. (#1725)
  • Fixed chunked prefill admission so prefilling requests count against the configured concurrency cap. (#1704)
  • Fixed SSD cache compatibility across model and cache-layout switches so valid persisted blocks are not deleted for other models or layouts.
  • Fixed SSD cache write saturation for long-context workloads by tuning pending-write capacity from real block/KV size and waiting through transient writer backlog. by @cfbraun (#1627)
  • Fixed SSD cache hit decode overhead by materializing restored cache backing arrays before decode starts.
  • Fixed scalar mRoPE cache offsets for cached VLM prefixes.
  • Fixed hot-cache memory retained after model unload and made the admin hot-cache clear action reclaim orphaned hot-cache owners and MLX buffers. by @khsd6327 (#1713)
  • Fixed engine close fallback paths so SSD cache managers are still released when shutdown/deep reset raises.
  • Fixed stuck engine teardown by treating long teardown stalls as fatal so a supervisor can restart from a clean process.
  • Fixed embedding context length handling so /v1/embeddings uses request limits, configured context caps, or the model's own context length instead of falling back to 512 tokens. by @JimStenstrom (#1718)
  • Fixed non-ASCII API keys returning 500; invalid credentials now return 401. by @richgoodson (#1719)
  • Fixed response_format downgrade visibility by returning a client-visible Warning header when grammar-constrained output cannot be enforced. by @richgoodson (#1564)
  • Fixed raw tool-call JSON recovery when model output contains tabs or newlines inside string values.
  • Added Hermes-style tool call parsing and streaming support. by @scubamount (#1596)
  • Fixed Gemma 4 MoE multi-turn tool-call corruption by stripping stray tool-call close markers before history is re-rendered. by @kreeger (#1665)
  • Fixed Gemma special tokens leaking into API output. by @JimStenstrom (#1698)
  • Fixed multimodal oQ precision by preserving protected vision and audio tensors as float32 where needed. by @dodams258 (#1682)
  • Fixed STT language handling so ISO language codes are preserved for backends that expect codes, while Qwen3-ASR-style backends still receive language names. (#1733)
  • Fixed mlx-audio resample export compatibility for input audio.
  • Fixed decoder-aware streaming detokenization and kept the fallback path compatible with older tokenizer wrappers.
  • Fixed Gemma 4 unified routing, prompt kwargs preservation, assistant drafter acceptance, and suppress-token handling across scheduler, DFlash, and VLM MTP paths.
  • Bumped the mlx-vlm pin to include Gemma 4 shared-KV/load fixes, Qwen quantized KV prompt-state fixes, Qwen3-VL visual mask alignment, Phi 3.5 VL EOS fixes, and prior unified audio/MTP fixes.
  • Fixed small-system memory behavior so sub-24GB Apple Silicon systems use the small-system reserve path.
  • Fixed idle CPU overhead while models are loaded and guarded idle wakeups for partial engine cores.
  • Fixed interactive Claude model picker launch behavior when tier models are configured. by @fparrav (#1638)
  • Fixed prerelease Homebrew formula updates so prerelease tags do not update the stable formula.
  • Fixed packaging documentation for the staged app path and stale DMG build wording. by @cfbraun (#1458)

macOS App and Admin UI

  • Added all DFlash settings to the native model config screen, including verify mode, draft window and sink sizes, quantization controls, in-memory cache entries, and SSD cache size.
  • Added the VLM MTP toggle to model settings and gated overlapping speculative controls while VLM MTP is enabled. by @jabagawee (#1654)
  • Added the server-wide context window cap to the admin settings UI.
  • Added the Burst Decode setting to Global Settings -> Advanced.
  • Added support for comma-separated bind addresses in the Host setting, including validation and alias detection. by @fqx (#1606)
  • Fixed the chat composer at 768px tablet portrait width so the sidebar no longer covers it. by @pigeonstorm (#1699)
  • Fixed the Settings menu so it stays available when the server is stopped.
  • Fixed the login page so Auto theme honors the system dark-mode preference. by @monroewilliams (#1728)
  • Fixed localized Memory Guard strings so placeholder interpolation no longer leaves stale tokens or duplicated units. by @fqx (#1730)
  • Improved process naming so the server appears as omlx-server. by @iamckun (#1658)

New Contributors

Thank you to everyone making their first contribution in this release:

@jabagawee, @Cmerrill1713, @kreeger, @sje397, @dodams258, @pigeonstorm, @JimStenstrom, @jackwh, @iamckun, @Kistaro.

Full Changelog: v0.4.1...v0.4.2

Loading
Goekdeniz-Guelmez, geodesicer, itskemo, rome3ro, ilukashin, dsaad68, adamluzsi, Saaaaa, symrex, isaac-cf-wong, and 7 more reacted with thumbs up emoji adamluzsi, blueinkgz, and Gallardo994 reacted with hooray emoji rome3ro, ilukashin, adamluzsi, M-Hassan-Raza, anthonypillot, jazzdevils, Gallardo994, shuft, and jackwh reacted with heart emoji itskemo, rome3ro, ilukashin, adamluzsi, gn00029914, anthonypillot, Gallardo994, d1slike, iF2007, and jackwh reacted with rocket emoji
26 people reacted

0.4.2rc1

06 Jun 18:09
@jundot jundot

Choose a tag to compare

This release candidate focuses on native MarkItDown document processing, Qwen throughput recovery (x1.48), Gemma 4 unified multimodal support, and cache, scheduler, and server stability improvements.

Highlights

  • Added native MarkItDown document processing. Chat file uploads can now be converted through MarkItDown, including PDF, DOCX, PPTX, TXT, and Markdown inputs, with OpenAI-style file_data support.
  • Added selectable PDF processing engines. PDFs can use either MarkItDown conversion or VLM OCR from Integration settings (currently web dashboard only).
  • Recovered Qwen throughput after the 0.4.0 regression. Qwen VLM/MTP batching and singleton decode handling were updated so single-row decode no longer falls into the slower batched cache path.
  • Added Gemma 4 unified audio input support. Gemma 4 unified models can now accept audio input alongside image inputs, with suppress-token handling for multimodal placeholders.
  • Improved model and server controls. A server-wide context window cap policy was added, embedding requests now respect the effective model context length, and server processes now show an omlx-server process title.

Performance

Internal tg512 measurements show the main Qwen regression recovered while keeping Gemma performance stable:

HKJjNP8aMAA9UJ-

Fixes

  • Fixed Qwen VLM/MTP decode behavior by keeping singleton cache layers out of BatchKVCache until a second request is actually appended.
  • Fixed prompt-prefix token seeding in BatchGenerator so penalty processors see cached prompt tokens correctly. Thanks to @Yukon for identifying the missing token buffer seed.
  • Fixed SSD cache compatibility across model and cache-layout switches so valid persisted blocks are not deleted for other models or layouts.
  • Fixed SSD cache pressure handling by unlinking LRU files outside the bounded write queue and preserving capped-eviction observability. by @cfbraun (#1451)
  • Fixed cache-store backpressure and aborted prefill cleanup so new prefills wait safely while cache cleanup is full.
  • Fixed an engine-pool acquire-vs-use eviction race and active-request counter leak for embedding and rerank engines. by @Cmerrill1713 (#1668)
  • Fixed Gemma 4 MoE multi-turn tool-call corruption by stripping stray tool-call close markers before history is re-rendered. by @kreeger (#1665)
  • Fixed Gemma special tokens leaking into API output. by @JimStenstrom (#1698)
  • Fixed raw tool-call JSON recovery when model output contains tabs or newlines inside string values.
  • Added Hermes-style tool call parsing and streaming support. by @scubamount (#1596)
  • Fixed response_format downgrade visibility by returning a client-visible Warning header when grammar-constrained output cannot be enforced. by @richgoodson (#1564)
  • Fixed embedding context length handling so /v1/embeddings uses the effective model context window instead of falling back to 512 tokens. by @jackwh (#1694)
  • Fixed multimodal oQ precision by preserving protected vision and audio tensors as float32 where needed. by @dodams258 (#1682)
  • Fixed mlx-audio resample export compatibility for input audio.
  • Fixed decoder-aware streaming detokenization and kept the fallback path compatible with older tokenizer wrappers.
  • Fixed Gemma 4 unified routing, prompt kwargs preservation, assistant drafter acceptance, and suppress-token handling across scheduler, DFlash, and VLM MTP paths.
  • Fixed small-system memory behavior so sub-24GB Apple Silicon systems use the small-system reserve path.
  • Fixed idle CPU overhead while models are loaded and guarded idle wakeups for partial engine cores.
  • Fixed interactive Claude model picker launch behavior when tier models are configured. by @fparrav (#1638)
  • Fixed prerelease Homebrew formula updates so prerelease tags do not update the stable formula.
  • Fixed packaging documentation for the staged app path and stale DMG build wording. by @cfbraun (#1458)

macOS App

  • Added all DFlash settings to the native model config screen, including verify mode, draft window and sink sizes, quantization controls, in-memory cache entries, and SSD cache size.
  • Added the VLM MTP toggle to model settings and gated overlapping speculative controls while VLM MTP is enabled. by @jabagawee (#1654)
  • Added the server-wide context window cap to the admin settings UI.
  • Fixed the chat composer at 768px tablet portrait width so the sidebar no longer covers it. by @pigeonstorm (#1699)
  • Fixed the Settings menu so it stays available when the server is stopped.
  • Improved process naming so the server appears as omlx-server. by @iamckun (#1658)

New Contributors

Thank you to everyone making their first contribution in this release:

@jabagawee, @Cmerrill1713, @kreeger, @sje397, @dodams258, @pigeonstorm, @JimStenstrom, @jackwh, @iamckun, @Kistaro.

Loading
shuft, prem-arrcus, Huerto, blueinkgz, kuanjames, mintisan, antstars, derkan, FaisalFehad, mrrodriguez, and 6 more reacted with thumbs up emoji bobbrysonn, jftuga, ilukashin, heinotoorik, geodesicer, Huerto, lrnd1, harmssam, jazzdevils, alfinsyahruddin, and 6 more reacted with heart emoji jackwh reacted with rocket emoji
29 people reacted

0.4.2.dev3

05 Jun 17:40
@jundot jundot

Choose a tag to compare

0.4.2.dev3 Pre-release
Pre-release

This development release adds native MarkItDown document processing and VLM-based PDF processing in oMLX, improves Gemma 4 tool-call stability, and hardens multimodal precision, cache, memory, and engine scheduling.

oMLX_0.4.2_MarkItDown_v2.mp4
  • Added native MarkItDown document processing and VLM-based PDF processing. Uploaded files can now be converted through MarkItDown, and PDFs can use either MarkItDown or VLM OCR from the selected processing engine.
  • Improved Gemma 4 tool-call stability. Multi-turn Gemma 4 MoE tool conversations now strip stray tool-call close markers before re-rendering conversation history. by @kreeger in #1665
  • Improved raw tool-call JSON recovery. Tool calls with raw tabs or newlines inside generated JSON string values are now recovered and returned as valid structured tool calls.
  • Improved multimodal oQ precision. Protected vision and audio tensors are preserved in float32 during oQ conversion to avoid FP16 overflow and multimodal quality loss. by @dodams258 in #1682
  • Improved engine eviction safety. Embedding and rerank engines are now leased while in use, preventing acquire-vs-use eviction races and resetting leaked activity counters on teardown. by @Cmerrill1713 in #1668
  • Improved cache and prefill backpressure. Hot-cache budget is shared across models, cache-heavy prefills wait while cache-store cleanup is full, and idle wakeups are guarded for partial engine cores.
  • Improved small-system memory behavior. Sub-24GB Apple Silicon systems now use the small-system reserve path, reducing over-reservation from tiered defaults.
  • Reduced idle CPU overhead. Loaded models now avoid unnecessary idle wakeups while remaining ready for requests.

New Contributors

Contributors

kreeger, sje397, and 2 other contributors
Loading
Huerto, pepijnolivier, itskemo, kreeger, geodesicer, Kiruno-lz, Monty-bob, vacekj, heinotoorik, t-hanafin, and 2 more reacted with thumbs up emoji kreeger, eokic, blueinkgz, yovko, and jazzdevils reacted with heart emoji
16 people reacted

0.4.2.dev2

04 Jun 16:32
@jundot jundot

Choose a tag to compare

0.4.2.dev2 Pre-release
Pre-release

This development release fixes Gemma4 Unified image understanding and OCR quality in oMLX, improves Gemma4 Unified vision feature cache handling, and fixes Gemma4 Unified MTP compatibility.

  • Fixed Gemma4 Unified VLM prefill handling so multimodal token-type IDs are preserved through the oMLX external prefill path.
  • Restored Gemma4 Unified 12B image understanding and OCR quality, including text-reading prompts that previously degraded into hallucinated descriptions.
  • Improved Gemma4 Unified vision feature cache correctness for position-aware and compacted vision features.
  • Fixed Gemma4 Unified MTP compatibility.
Loading
gwyntel-git, savourylie, Surfire, seeghost1019, larria, t-hanafin, and prem-arrcus reacted with thumbs up emoji gwyntel-git, fish-can, and harm-scinex reacted with hooray emoji aleksa-codes, wickylee, miabchdave, mr-suw, dieterh3llstern, rvdsteege, geodesicer, sdiamanEXUS, aysebelenpisdil, jazzdevils, and 16 more reacted with heart emoji gwyntel-git, Alfabeta01, and dhuebner reacted with rocket emoji
35 people reacted

0.4.2.dev1

04 Jun 09:56
@jundot jundot

Choose a tag to compare

0.4.2.dev1 Pre-release
Pre-release

This development release updates mlx-vlm to 0.6.1 (041f889) to add Gemma4 Unified (12B) support, adds cohere2_moe (Command A+) support, and includes several bug fixes.

  • Updated mlx-vlm to 0.6.1 (041f889) with Gemma4 Unified long-text prefill fixes.
  • Added support for Gemma4 Unified (12B) through the updated mlx-vlm backend.
  • Added support for cohere2_moe models, including Command A+.
  • Fixed streaming detokenization for raw VLM/Gemma4 paths so byte-fallback tokenizers no longer emit replacement characters in split UTF-8 output.
  • Added VLM MTP controls to the macOS model settings UI.
  • Exposed advanced DFlash model settings in the macOS app, including the ddtree verify mode.
  • Fixed speculative-control state in the macOS app while VLM MTP is enabled.
  • Restored the interactive Claude model picker in omlx launch.
Loading
hex, jazzdevils, oldcai, yovko, ashhart, aleksa-codes, nick6969, geodesicer, shafiqalibhai, eokic, and 9 more reacted with heart emoji blueinkgz, computer3314, benn-herrera, and royalmelodycar-collab reacted with rocket emoji
21 people reacted

v0.4.1

03 Jun 15:48
@jundot jundot

Choose a tag to compare

For the major 0.4.0 upgrade notes, please see the 0.4.0 release notes.

This patch release focuses on memory-pressure stability, robust model discovery, managed server lifecycle controls, and macOS app and CLI quality-of-life improvements.

Highlights

  • Improved prefill memory handling. Static memory reserves were reduced, throttle tiers were simplified, and prefill chunk floors were raised for more stable throughput.
  • Idle-model eviction before prefill throttling. Loaded but idle models can now be freed before new work is rejected or throttled.
  • Managed server lifecycle controls. The macOS app and admin surfaces now expose server lifecycle control, and when the macOS app is installed the server can also be controlled from the CLI with omlx start, omlx stop, and omlx restart.
  • Copyable model IDs in the SwiftUI app. Model ID copy buttons were added to the Models list and per-model detail header.

Fixes

  • Fixed TurboQuant KV cache selection for MLA models by excluding incompatible MLA models from TurboQuant cache paths. by @popfido (#1626)
  • Fixed scheduler handling so prefill error outputs are preserved during decode. by @ken-zzzzz (#1622)
  • Fixed inaccessible secondary model directories so they are tolerated instead of breaking model discovery or admin reload flows.
  • Fixed LFM2 pythonic tool parser support.
  • Fixed a macOS crash when deleting chat template kwargs. by @nethbotheju (#1634)
  • Fixed shell integration behavior so editing shell init files requires explicit user consent. (#1633)
  • Fixed app CLI wrapper symlink resolution in bundled macOS builds.

macOS App

  • Added model ID copy affordances in the SwiftUI Models screen and model detail screen.
  • Improved server lifecycle integration between the macOS app and local control server.
  • Improved shell integration consent flow and related UI copy.

New Contributors

Thank you to everyone making their first contribution in this release:

@ken-zzzzz, @nethbotheju.

Contributors

popfido, nethbotheju, and ken-zzzzz
Loading
ilukashin, rromanchuk, ravikallem-coder, alxedelweiss, mstute, wilburx3, evhenious, pjjshy1223, beamivalice, kuanjames, and 9 more reacted with thumbs up emoji blueinkgz, alxedelweiss, tlint101, acorello, mwhuss, david-king-neuma, and djflux reacted with hooray emoji jazzdevils and david-king-neuma reacted with heart emoji
25 people reacted

v0.4.0

02 Jun 17:11
@jundot jundot

Choose a tag to compare

0.4.1 is now available. If you reached this page from a direct link, please use the latest patch release instead.

0.4.0 is the first official release of the native Swift macOS app. The old PyObjC menubar app has been retired, and the macOS bundle now ships as a Swift app with a redesigned onboarding flow, settings UI, status surfaces, model management, and GitHub Releases based updater.

This Swift transition was driven by excellent work from @popfido, with follow-up polish and release-path fixes folded in after the initial merge. Thank you for the huge amount of thoughtful work here — this is the biggest user-facing macOS change oMLX has shipped so far, and it substantially raises the quality of the desktop app.

oMLX 0.4.0 native Swift macOS app screenshot

Highlights

  • Native Swift macOS app. The old PyObjC menubar app has been replaced by a native Swift/SwiftUI app, with new onboarding, settings, status, model management, downloads, integrations, and update flows. by @popfido
  • Improved menubar and app status. Live port/status updates, StatusKit fixes, version display, supervised-server handling, and cleaner running-state behavior. by @popfido
  • Standard Hugging Face cache model directory support. oMLX can now discover models from the standard Hugging Face cache location, with controls for toggling HF cache discovery and managing local model directories.
  • Safer update flow. App updates now honor the selected update channel and require confirmation before download.
  • Browser chat UI received a major usability overhaul and follow-up message/action fixes. by @beamivalice
  • xgrammar is bundled into the venvstacks export with the no-torch stub path. by @cfbraun
  • Memory guard tuning relaxed throttle/eviction thresholds, improved Custom tier behavior, and added CLI options for memory guard configuration.

Runtime, cache, and scheduler

  • Per-engine MLX threads eliminate cross-engine stream contamination. by @ivaniguarans
  • Store-cache and boundary snapshot paths now materialize lazy arrays on the owning thread before async byte extraction. by @aeyeopsdev
  • Boundary snapshot cleanup races and stale snapshot handling were fixed. by @cfbraun
  • Predictive prefill throttling and reclaim/requeue behavior reduce mid-stream OOM failures. by @sdiamanEXUS
  • Paged cache references are released correctly on preflight/prefill rejection paths. by @cfbraun
  • Paged cache now disables itself cleanly when SSD initialization fails instead of breaking startup. by @lvsijian8
  • VLM, SpecPrefill, and draft-model lazy state is materialized on loader threads to avoid stream errors. by @cfbraun
  • Engine stop now yields back to the event loop so shutdown/restart paths do not monopolize the loop. by @fqx
  • Unreadable model directories are handled during startup instead of aborting discovery.
  • DMG builds now preserve engine commit metadata.

MTP, oQ, TurboQuant, and model compatibility

  • Safe row-wise MTP decoding is enabled for aligned batches, with fallback for unsafe late-join batches.
  • Qwen3.6 MXFP4 mixed norm conventions and MTP preservation are handled more safely. by @scubamount
  • TurboQuant now supports batched KV-cache compression and fixes batch merge edge cases. by @popfido
  • DFlash/MTP transition restores Qwen GQA attention hooks.
  • LFM text MoE model discovery is classified correctly as LLM instead of mlx-audio STS. by @samfenwick
  • Step 3.7 Flash support is patched through the mlx-lm compatibility path.

API and integrations

  • Guided grammar is now exposed as a model setting and maps into the existing structured-output grammar path. by @MrNiceRicee
  • Anthropic cache-control accounting and model context length reporting were fixed. by @richgoodson
  • tool_choice: "none" is respected for MCP tools. by @lvsijian8
  • Tool call function names are trimmed while preserving type validation behavior. by @palvaleri
  • Wildcard bind addresses such as 0.0.0.0 are normalized to usable local client addresses. by @monroewilliams
  • Top-level omlx imports are lazy-loaded to improve startup compatibility, including NumPy 2.x environments. by @fparrav
  • Claude Code compatibility was updated for newer request behavior. by @lx1229
  • CLI shutdown handles KeyboardInterrupt cleanly. by @fry69
  • Integration launch context was unified across external tool integrations.

Admin UI and macOS UI

  • Downloads now include a model card sheet with metadata, files, and tags. by @popfido
  • Local Models sorting is now case-insensitive ascending. by @MwC-Trexx
  • SwiftUI model lists now also sort case-insensitively.
  • Active Models layout works better on narrow screens. by @samfenwick
  • Model settings table headers are aligned. by @ilukashin
  • Server/app settings apply behavior and live port display were cleaned up. by @popfido
  • Light mode settings contrast was restored.
  • Mac app CLI launch shim and CLI wrapper signing were restored.
  • Admin custom-tier memory text is synced with server behavior.

Packaging, CI, and tests

  • The venvstacks driver is pinned/detected more reproducibly. by @popfido
  • The mlx-framework venvstacks layer was renamed to mlx-base. by @popfido
  • CI workflow and broader unit-test coverage were added. by @Mearman, @cfbraun, @fry69
  • Python 3.14 was added to the CI matrix. by @fry69
  • Formula automation and release URL substitution were corrected.
  • paroquant dev dependency was bumped to 0.1.15.

New Contributors

Thank you to everyone making their first contribution in this release:

@cfbraun, @chenqianhe, @jcalvert, @MwC-Trexx, @azhangd, @scubamount, @sdiamanEXUS, @ilukashin, @tylerliu, @MrNiceRicee, @lx1229, @palvaleri, @monroewilliams.

Loading
optionals, maxness, Apfelkringel, godmonero, david-king-neuma, heinotoorik, senzowayne, alxedelweiss, rome3ro, jazzdevils, and 17 more reacted with thumbs up emoji eamigo86, david-king-neuma, benn-herrera, senzowayne, alxedelweiss, rome3ro, evhenious, blueinkgz, gregbarbosa, marvinpeng-svg, and 8 more reacted with hooray emoji david-king-neuma, beamivalice, celangoni, heinotoorik, benn-herrera, senzowayne, alxedelweiss, sacsjin, rome3ro, evhenious, and 9 more reacted with heart emoji heinotoorik, benn-herrera, senzowayne, alxedelweiss, d1slike, rome3ro, sdiamanEXUS, leourbina, Douraben, rainbowatcher, and 2 more reacted with rocket emoji
44 people reacted
Previous 1 3 4 5 8 9
Previous

AltStyle によって変換されたページ (->オリジナル) /