-
-
Notifications
You must be signed in to change notification settings - Fork 106
Releases: turboderp-org/exllamav3
Releases · turboderp-org/exllamav3
0.0.43
- Fix error when MTP drafting in TP mode
- Faster quanization
Full Changelog: v0.0.42...v0.0.43
Assets 52
- sha256:f0964a8debe78d47538b4ae37ca5e5162f5cb09b48ad0531ffdc4400e8c12981185 MB
2026年06月14日T17:32:18Z - sha256:312062a54b7aa997fb07d588c7e695d1b46c4f43d0c7522757c13e4c38fd8d0d169 MB
2026年06月14日T17:39:48Z - sha256:417de48608be60629938fdd2cc5981094b744f0cfa309453a37d8d939cf8e6c0185 MB
2026年06月14日T17:31:54Z - sha256:2305f66bfa614f67dc3d9efa5f4f1237abf6eda179dc69920d76dd6a5026eb58169 MB
2026年06月14日T17:37:45Z - sha256:2b9ac05af7f7c61738651ae0d4007e565fe519e6e3ba2c7e293aacefd7684b42185 MB
2026年06月14日T17:33:44Z - sha256:b46da4707268bbfe49ac764e636fa985f9de9207316c9f19f2052c83569dc60e169 MB
2026年06月14日T17:38:35Z - sha256:0ce569a472c18ce07a2e2ed783b9c73d0ae592ad416d048f93d1bfb220c845d4185 MB
2026年06月14日T17:27:28Z - sha256:0807eeb21f614074e6c55a792f881391d45d0604a2445ad6828fa362e2037145169 MB
2026年06月14日T17:37:15Z - sha256:31694d44a3ec0d104c5d1710945521c4fa9bc41b8b6b9776a8a50ec98c6f6ad6185 MB
2026年06月14日T17:30:59Z - sha256:e368d7a07a532837505db8579aeb034724b8fac5abddc30165bfd50cb21eba06171 MB
2026年06月14日T17:38:52Z -
2026年06月14日T16:56:03Z -
2026年06月14日T16:56:03Z - Loading
0.0.42
- Fix MTP drafting when MTP model is not on the target model's output device
Full Changelog: v0.0.41...v0.0.42
Assets 52
0.0.41
- Add MTP support for Qwen3.5/3.6
Full Changelog: v0.0.40...v0.0.41
Assets 52
0.0.40
- Support Gemma4UnifiedForConditionalGeneration
Full Changelog: v0.0.39...v0.0.40
Assets 52
0.0.39
- Add Step3p7ForConditionalGeneration
Full Changelog: v0.0.38...v0.0.39
Assets 52
2 people reacted
0.0.38
- Support Lfm2MoeForCausalLM (LFM 2.5)
- Fix regression in GDN inference when bsz > 1
- Fix issue causing DFlash to break in TP mode when cudaMallocAsync backend was used
- QoL improvements
Full Changelog: v0.0.37...v0.0.38
Assets 52
1 person reacted
0.0.37
- Another small bugfix
Full Changelog: v0.0.36...v0.0.37
Assets 52
0.0.36
- Fix small SD regression
Full Changelog: v0.0.35...v0.0.36
Assets 52
0.0.35
@github-actions
github-actions
c0b20f6
This commit was created on GitHub.com and signed with GitHub’s verified signature.
- Tensor parallel mode for Qwen3.5/3.6
- New recurrent state manager avoid dynamics allocation of recurrent states (reduces fragmentation and keeps VRAM overhead constant for Qwen3.5 etc.)
- Improved checkpointing decision to make better use of available cache space
- Perform reconstruct-GEMM in slices for large layers, greatly reducing VRAM overhead
- Fix race condition causing streaming token output to lag behind generation in some situations
- Support new tensor keys in Mistral 3.5 Medium
- KL-div and perplexity kernels for eval scripts
- More tuning
- Lots of bugfixes
Full Changelog: v0.0.34...v0.0.35
Assets 52
4 people reacted
0.0.34
- Fix regression causing extra VRAM usage during prefill
- Add CUDA 13.2.0 wheels (built with cu132 against torch==2.11.0+cu130)
Full Changelog: v0.0.33...v0.0.34
Assets 52
3 people reacted