Releases: ModelCloud/GPTQModel

GPT-QModel v7.1.0

08 Jun 08:04

@Qubitium Qubitium

v7.1.0

d49b9bd

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

GPT-QModel v7.1.0 Latest

Latest

What's Changed

[CI] fix release action cannot find uv & clean actions by @CSY-ModelCloud in #2837
[CI] install timm for internvl chat by @CSY-ModelCloud in #2839
Add Laguna model support by @Qubitium in #2836
[MODEL] support ernie4_5_vl_moe by @ZX-ModelCloud in #2838
fix AttributeError: 'NoneType' object has no attribute 'from_pretrained' by @CSY-ModelCloud in #2840
Add GSM8K Platinum to Laguna regression by @Qubitium in #2841
docs: update hardware support table by @Qubitium in #2842
[FIX] ci test by @ZX-ModelCloud in #2843
Add NPU quant method coverage by @Qubitium in #2845
[FIX] AWQ device placement to follow planner target devices by @ZX-ModelCloud in #2847
add nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 support by @CSY-ModelCloud in #2846
fix test_subset by @CSY-ModelCloud in #2849
new template for Nemotron_3 test by @CSY-ModelCloud in #2851
update internvl_chat pkgs by @CSY-ModelCloud in #2850
add inclusionAI/Ling-2.6-flash support by @CSY-ModelCloud in #2844
Refactor LazyTurtle checkpoint tensor resolution by @ZX-ModelCloud in #2852
fix AttributeError: '_DummyConfig' object has no attribute 'model_type' by @CSY-ModelCloud in #2853
fix apply_moe_config was not found by @CSY-ModelCloud in #2854
added "qwen3_5_moe_text" definition and "qwen3_5_text" definition by @ZX-ModelCloud in #2857
fix first layer was asserted, but only last 2 layers are quanted by @CSY-ModelCloud in #2861
[MODEL] support "glm4v_moe" by @ZX-ModelCloud in #2862
add LogBar progress to sync_all_meta() writes by @ZX-ModelCloud in #2860
fix: handle float8 tensor serialization in streaming safetensors saves by @ZX-ModelCloud in #2863
move causal_conv1d to gptqmodel/hf_kernels, not in root by @CSY-ModelCloud in #2864
store logs file in ./logs, not in root dir by @CSY-ModelCloud in #2865
[CI] fix dust or dir may mot exist by @CSY-ModelCloud in #2867
fix tests/models/test_qwen3_5_text_only by @ZX-ModelCloud in #2866
fix TestVoxtral::test_voxtral - IndexError: index out of range in self by @CSY-ModelCloud in #2870
[CI] fix envs conflict on one host runner by @CSY-ModelCloud in #2871
[FIX] test_hymba by @ZX-ModelCloud in #2872
fix KeyError: 'type' & AutoTokenizer was not found by @CSY-ModelCloud in #2873
[MODEL] support zamba and zamba2 by @ZX-ModelCloud in #2868
no need to import AutoTokenizer which is unused by @CSY-ModelCloud in #2874
[FIX] FP8 dequantization for cross-shard scales and partial edge blocks by @ZX-ModelCloud in #2875
add retry to fix remote files missing in cache dir by @CSY-ModelCloud in #2876
fix: preserve mtp.* tensors for dense Qwen3.5/Qwen3.6 models by @erm14254 in #2869
fix 2 deps on CI by @CSY-ModelCloud in #2881
fix assert was checking last layers by @CSY-ModelCloud in #2880
fix rope_parameters is not inited by @CSY-ModelCloud in #2882
fix AttributeError: 'FakeGPTQModel' object has no attribute '_sanitiz... by @CSY-ModelCloud in #2885
[MODEL] support minicpmv_4_6 by @ZX-ModelCloud in #2884
improve JIT extension failure diagnostics for CI flakiness by @CSY-ModelCloud in #2886
log stack trace for marlin jit error by @CSY-ModelCloud in #2887
fix Qwen3OmniMoe throws get_input_embeddings NotImplementedError by @CSY-ModelCloud in #2888
Fix dequantization for ignored layers, padded FP8 scales, and non-4D tensors by @ZX-ModelCloud in #2889
fix command got wrong args by @CSY-ModelCloud in #2891
print expected in error by @CSY-ModelCloud in #2890
Fix InternVL tokenizer compat on transformers 5 by @CSY-ModelCloud in #2892
[MODEL] support deepseek_v4 by @ZX-ModelCloud in #2877
add kimi 2.5 support by @CSY-ModelCloud in #2858
[CI] use torch 2.12.0 & python 3.14t as default on CI by @CSY-ModelCloud in #2894
[MODEL ]support mimo_v2 by @ZX-ModelCloud in #2893
[CI] auto clean cache & retry by @CSY-ModelCloud in #2895
[FIX] ovis incompatibility with transformers v5 by @ZX-ModelCloud in #2896
[MODEL] support ovis2_5 by @ZX-ModelCloud in #2897
fix paroquant was not included in release by @CSY-ModelCloud in #2900
[CI] install release pkg instead of source by @CSY-ModelCloud in #2901
[MODEL] support ovis2 6 moe by @ZX-ModelCloud in #2899
[MODEL] support interns1 by @ZX-ModelCloud in #2902
[MODEL] support ovis2_6_next by @ZX-ModelCloud in #2904
[MODEL] support hrm_text by @ZX-ModelCloud in #2905
ascend kernel compat update: cann 9.1beta1 by @Qubitium in #2906
fix test_subset.py by @ZX-ModelCloud in #2907
Ascend tests by @Qubitium in #2908
[MODEL] support nemotron_labs_diffusion by @ZX-ModelCloud in #2909
[MODEL] support hunyuan_v1_dense and hunyuan_v1_moe by @ZX-ModelCloud in #2910
[FIX] Avoid invoke tensor.transpose(0, 1).contiguous() when the shapes already match by @ZX-ModelCloud in #2913
[FIX] DeepSeek-V4-Pro experts module can now be correctly dequantized to BF16 by @ZX-ModelCloud in #2914
[FIX] weight_only_looper did not support multi-GPU quantization. by @ZX-ModelCloud in #2915
fix(hub): import create_repo from huggingface_hub (transformers dropped the passthrough) by @Anai-Guo in #2917
[FIX] non-existent import in transformers.utils.hub with the latest transformers by @ZX-ModelCloud in #2918
prep for v7.1.0 release by @Qubitium in #2919

New Contributors

@erm14254 made their first contribution in #2869
@Anai-Guo made their first contribution in #2917

Full Changelog: v7.0.0...v7.1.0

Contributors

Qubitium, ZX-ModelCloud, and 3 other contributors

Assets 3

🚀 GPTQModel v7.0.0

28 Apr 20:37

@Qubitium Qubitium

v7.0.0

f731429

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

🚀 GPTQModel v7.0.0

🔥 Major

New Huawei Ascend NPU quantization support with torch based kernels for inference
All CUDA/ROCm compiled kernels are now JIT (just-in-time) compiled on first use
Pip/UV install no longer requires the --no-build-isolation flag

🧠 New model support and compatibility wins

Added support for GLM 5/5.1, GLM OCR, GLM ASR, Gemma 3n, Falcon Mamba, and InternVL Chat.
Extended OpenVINO GPTQ patching to understand GPTQModel's newer kernels.
Fixed Qwen3 dtype handling, Qwen3.5 MoE module-tree assertions, Qwen2-VL calibration input capture, and Qwen 3.6 MoE regressions.
Fixed Llama4Router replacement behavior, Phi-3 defused MLP module mapping, Phi-4 runtime requirements, Instella rope-scaling compatibility, Ling compatibility, Mixtral MoE checkpoint module names, Brumby thread safety, Baichuan compatibility, and Gemma
3 saving.
Fixed exllamav3_torch import under meta-device context.

⚡ Kernels, JIT, and hardware acceleration

Moved all compilation required kernels to JIT compilation on first-use and cleaned up Marlin import probing, CUDA header handling, nvcc flag checks, and Torch/CUDA mismatch handling.
Synced Marlin/Machete kernels with upstream and added hardware-specific Marlin boost paths.
Guarded CUTLASS version mismatches and fixed generated-kernel staleness.
Added global kernel rebuild support for CI and safer shared extension locks.
Added Ascend NPU support.
Fixed AWQ JIT cache invalidation, illegal memory access, SM120 execution, GEMM_Fast shared-memory launch, and BF16 bias validation.
Fixed BACKEND.MARLIN loading for gptq_v2 format and added Marlin import coverage.

🔥 Quantization, AWQ, FP8, and dequant

Added FP8/FP4 CPU dequant and DeepSeek FP8 .scale dequant export.
Added dtype auto-decoding and decode path updates.
Reduced AWQ scale-search activation memory and split AWQ integration tests for cleaner coverage.
Fail fast on unsupported act-group-aware GPTQ shapes instead of continuing into invalid layouts.
Fixed INT3 qzero format conversion, GAR width compatibility, and GPTQ batched keep-mask handling.
Improved AWQ W4A8 and BF16 validation paths, plus post-quant MoE routing behavior.
Used loader device selection for EoRA adapter generation.

🐢 LazyTurtle, loading, and model plumbing

Refactored input capture into BaseQModel and model-specific QModels for cleaner replay and calibration flows.
Renamed and hardened the turtle path into LazyTurtle, with stricter materialization failures and better expected-skip handling.
Fixed LazyTurtle materialization for non-square fused experts, PhiMoE, nested HF weight renames, reversed WeightRenaming semantics, and non-Safetensors checkpoints.
Improved out-of-model tensor handling for MTP prefix/files paths.
Removed BaseModel.loader_requires_dtype and normalized config dtype handling through get_hf_config_dtype().
Fixed multi-GPU replay output retention, GPTQ finalizer overlap, and quantization OOMs from retained callable cache keys.

🧰 CI, packaging, and developer workflow

Cleaned up CI shell logic, environment setup, UV cache handling, reusable Torch tests, CPU-only grouping, runner selection, retry behavior, and offload temp paths.
Kept CI and Torch CUDA versions aligned, moved to newer Docker images, and surfaced real exit codes and GPU names.
Removed lm-eval, deprecated tests, deprecated artifact IDs, pause UI lifecycle code, and tabulate from CI/test paths.
Migrated more regex usage to pcre/pcre2.
Replaced temp path helpers with tempfile.TemporaryDirectory() for automatic cleanup.
Updated requirements, dependencies, setuptools compatibility, and install-with-Torch validation.

💥 Breaking and removed

Kernel loading behavior has shifted heavily toward JIT compilation, so custom deployment environments should verify compiler/CUDA compatibility.
lm-eval references were removed from CI and test/docs paths.
Deprecated tests, artifact handling, and pause UI lifecycle code were removed.

Full Changelog:

Refactor input capture flow into BaseQModel and model-specific QModels by @ZX-ModelCloud in #2666
[CI] adjust venv logic by @CSY-ModelCloud in #2667
[CI] remove verbose log flag for build by @CSY-ModelCloud in #2669
Move more kernels to JIT compile path by @Qubitium in #2668
Kernels migrate to jit compile by @Qubitium in #2670
remove hf kernels dependency for cpu by @Qubitium in #2671
fix marlin import paths probe by @Qubitium in #2673
fix failed test by @Qubitium in #2674
Extend OpenVINO's GPTQ patcher to understand GPTQModel new kernels. by @ZX-ModelCloud in #2675
[CI] use same cuda version for CI & torch by @CSY-ModelCloud in #2676
Handle mtp prefix/filesin out_of_model_tensors by @ZX-ModelCloud in #2677
bonsai refractor by @Qubitium in #2672
glm 5/5.1 support by @Qubitium in #2680
Normalize config dtype to torch.dtype in get_hf_config_dtype() by @ZX-ModelCloud in #2681
[FIX] Qwen3ForCausalLM does not require the dtype argument. by @ZX-ModelCloud in #2682
fix jit error because torch's cuda mismatchs local nvcc version by @CSY-ModelCloud in #2683
fix: rotary_embed init by @Qubitium in #2684
remove BaseModel.loader_requires_dtype by @ZX-ModelCloud in #2686
[CI] no need build step by @CSY-ModelCloud in #2688
refractor turtle to lazy by @Qubitium in #2687
[CI] fix jobs are skipped by @CSY-ModelCloud in #2689
[FIX] multi-GPU replay output retention OOM by @ZX-ModelCloud in #2692
fp8/fp4 cpu dequant by @Qubitium in #2691
refactor all monekypatches to use same lock by @ZX-ModelCloud in #2693
[CI] decrease max parallel jobs to 4 by @CSY-ModelCloud in #2695
[FIX] All cpp extensions should share the same lock instead of using a map of locks by @ZX-ModelCloud in #2696
dtype auto decoder by @Qubitium in #2690
Decode update by @Qubitium in #2698
refractor processors by @Qubitium in #2697
fix cuda header path conflict by @Qubitium in #2701
[CI] add prefix for env name by @CSY-ModelCloud in #2704
ignore .codex by @CSY-ModelCloud in #2703
fix: stabilize baichuan compat test by @Qubitium in #2702
Fix Qwen3.5 MoE module tree assertion by @Qubitium in #2705
[CI] remove UV_INDEX_URL by @CSY-ModelCloud in #2706
Fix: LazyTurtle materialization for non-square fused experts by @ZX-ModelCloud in #2707
[CI] uv won't R/W /monster now by @CSY-ModelCloud in #2708
Fix AWQ JIT cache invalidation by @Qubitium in #2709
Split AWQ integration tests by @Qubitium in #2710
Fix CI to install ModelCloud deps from git by @Qubitium in #2711
migrate stdlib.re to pcre2 by @Qubitium in #2712
[CI] show real exit code by @CSY-ModelCloud in #2713
Remove lm-eval from CI and test/docs references by @Qubitium in #2714
Remove pause UI controller lifecycle by @Qubitium in #2715
[CI] re-mount /monster for uv by @CSY-ModelCloud in #2718
Sync Marlin/Machete Kernel with upstream by @Qubitium in #2717
Fix GPU CI allocation and streaming regressions by @Qubitium in #2719
Guard CUTLASS version mismatches by @Qubitium in #2720
Fix Marlin generated kernel staleness by @Qubitium in #2721
Fix balanced MoE vram usage by @Qubitium in #2716
Fix bias dtype and validate AWQ bf16 ops by @Qubitium in #2722
Raise on LazyTurtle materialization failures and silence expected skips by @ZX-ModelCloud in #2723
HW specific boost for Marlin by @Qubitium in #2724
Update requirements.txt by @Qubitium in #2725
[CI] share common venvs & add lock wh...

Contributors

Qubitium, dblundell, and 2 other contributors

Assets 3

GPT-QModel v6.0.3

02 Apr 23:56

@Qubitium Qubitium

v6.0.3

6a65d69

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

GPT-QModel v6.0.3

Notable Changes:

Quantization and inference

Major ParoQuant improvements across speed, inference, and accuracy.
Added Paro inference support and a new layer optimizer.
Auto-enables AMP for the fast Paro implementation to better match reference behavior.
Added Paro rotation autotuning and fixed BF16 rotation support for the fused CUDA kernel.
Improved Paro stability with seeding fixes, cleanup, learned channel scale clamping, and contiguous tensor handling fixes.
Fixed a layer output replay/re-capture regression.
Added FOEM (First-Order Error Matters) for more accurate quantized LLM compensation, plus follow-up fixes to its data processing pipeline.
Replaced the old marlin_fp16 backend behavior with environment-flag control for FP32 reduction.

Model and backend support

Added support for Gemma4, MiniCPMO, MiniCPMV, and GLM4-MoE-Lite.
Added PrismML/Bonsai model support for inference.
Fixed Qwen3_5QModel definition issues.
Fixed Qwen 3.5 rotary embedding behavior.
Fixed AWQ layer grouping for qwen3_5_moe, llama4, qwen2_moe, and qwen3_next.
Fixed awq_processor.dynamic so skipped layers are handled correctly.
Improved dtype compatibility.
Hugging Face kernels are now gated off on Python no-GIL builds until upstream wheel support is fixed.

Evaluation, calibration, and usability

Integrated Evalution into the workflow.
Added evalution.VLLM and evalution.SGLang backends.
Fixed SGLang evaluation engine initialization.
Automatically determines MODEL_COMPAT_FAST_LAYER_COUNT.
Improved calibration data device handling.
Updated tokenizer handling, and collation now respects tokenizer padding_size.
Improved import performance by lazy-loading _DEVICE_THREAD_POOL.
Cleaned up warning behavior and added an option to suppress warnings.
Removed forced random seed overrides.

Dependency and compatibility updates

Updated pypcre to 0.2.14.
Pinned logbar to >=0.4.1.
Updated transformers and defuser package versions.
Fixed SAVE_PATH handling and import path resolution issues.

Breaking and removed

Removed GPTQModel.upload_to_hub().
Removed MLX export support.

What's Changed

[CI] fix pkgs' order & fix flashinfer version was overridden by @CSY-ModelCloud in #2575
allow to disable warning by @CSY-ModelCloud in #2576
lazy load _DEVICE_THREAD_POOL, to speed up import by @CSY-ModelCloud in #2577
remove disable env check by @CSY-ModelCloud in #2578
[CI] no need to set MAX_JOBS by @CSY-ModelCloud in #2579
Update pypcre version to 0.2.14 by @Qubitium in #2581
Nothing to see here... by @Qubitium in #2456
dtype compat by @Qubitium in #2582
fix test_moe_config by @ZX-ModelCloud in #2583
fix new format test by @ZX-ModelCloud in #2586
[CI] add test config by @CSY-ModelCloud in #2587
fix Qwen3_5QModel definition by @ZX-ModelCloud in #2588
speed up paroquant quant speed and resolve accuracy issues by @Qubitium in #2590
append last commit to version by @CSY-ModelCloud in #2591
speedup paroquant test by @ZX-ModelCloud in #2592
[CI] generate release matrix from torch registry by @CSY-ModelCloud in #2593
Evalution integration by @Qubitium in #2585
move eval.sh to tests by @Qubitium in #2594
remove warning by @Qubitium in #2595
[CI] use new docker image by @CSY-ModelCloud in #2596
[CI] install required pkg by @CSY-ModelCloud in #2597
Automatically Determine MODEL_COMPAT_FAST_LAYER_COUNT by @ZX-ModelCloud in #2598
[CI] no need to set MAX_JOBS by @CSY-ModelCloud in #2599
Fix: Paroquant impl accuracy by @Qubitium in #2601
remove forced random seed override in cls proper by @Qubitium in #2603
Paro test by @Qubitium in #2604
[FIX] incorrect SAVE_PATH by @ZX-ModelCloud in #2605
pin logbar to >= 0.4.1 by @Qubitium in #2606
Update the evalution scores by @ZX-ModelCloud in #2600
Paro: auto enable amp for fast impl to sync with reference by @Qubitium in #2607
paro: fix seeding and cleanup by @Qubitium in #2609
gate hf kernel to non-nogil builds of python until upsteram fix wheels by @Qubitium in #2610
[CI] use Ubuntu 24.04 docker image by @CSY-ModelCloud in #2612
Fix layer output re-capture (replay) regression by @Qubitium in #2611
remove legacy ppl codes by @Qubitium in #2613
replace marlin_fp16 backend with env flag control for fp32 reduction ... by @Qubitium in #2614
[CI] default py 3.14t & install latest Evalution by @CSY-ModelCloud in #2616
[CI] fix Evalution is private by @CSY-ModelCloud in #2617
updat tokenicer by @Qubitium in #2618
make collate respect tokenier padding_size by @Qubitium in #2620
paro: clamp learned channel scales to avoid collapse by @Qubitium in #2622
Calibration data device by @avtc in #2608
[FIX] qwen3_5 rotary_embedding by @ZX-ModelCloud in #2624
Temporarily disable gptqmodel spit_by feature by @ZX-ModelCloud in #2625
use evalution.VLLM by @CSY-ModelCloud in #2615
use evalution.SGLang by @ZX-ModelCloud in #2626
paro: enter the dragon by @Qubitium in #2623
[CI] use torch 2.11 by @CSY-ModelCloud in #2627
[FIX] sglang evaluation engine initialization error. by @ZX-ModelCloud in #2629
[MODEL] Add minicpmo support by @ZX-ModelCloud in #2630
[CI] update CI path by @CSY-ModelCloud in #2633
[FIX] qwen3_5_moe / llama4 / qwen2_moe / qwen3_next awq layer grouping by @ZX-ModelCloud in #2634
Remove GPTQModel.upload_to_hub() api by @ZX-ModelCloud in #2635
remove export to mlx option by @ZX-ModelCloud in #2636
[MODEL] supports minicpmv by @ZX-ModelCloud in #2637
Paro: layer optimizer by @Qubitium in #2628
Paro inference by @Qubitium in #2638
PrismAI/Bonsai Model Support (inference only) by @Qubitium in #2640
Update README.md by @Qubitium in #2641
Update transformers and defuser package versions by @Qubitium in #2642
[CI] install gguf for test_local_model_paths by @CSY-ModelCloud in #2645
fix imported path not found by @CSY-ModelCloud in #2646
[MODEL] support glm4_moe_lite by @ZX-ModelCloud in #2644
[FEATURE] Add FOEM: First-Order Error Matters; Accurate Compensation for Quantized LLM by @Xingyu-Zheng in #2639
Revise README with latest news and article references by @Qubitium in #2647
FIX paroquant bf16 rotation support for fused cuda kernel by @Qubitium in #2648
paroquant rotation autotune by @Qubitium in #2649
[FIX] In awq_processor, dynamic did not correctly skip layers. by @ZX-ModelCloud in #2650
ruff fix by @Qubitium in #2651
Ruff fix by @Qubitium in #2652
update readme by @Qubitium in #2653
fix: ensure contagious tensors by @Qubitium in #2655
fix failed test by @ZX-ModelCloud in https://github.com/ModelCl...

Contributors

Qubitium, avtc, and 3 other contributors

Assets 66

1 person reacted

GPT-QModel v5.8.0

19 Mar 16:35

@Qubitium Qubitium

v5.8.0

9980f01

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

GPT-QModel v5.8.0

Notable Changes

Transformers 5.3.0 compatibility.
Video Quantization Support
- Added support for video input during quantization.
MoE & Model Support
- Added support for Qwen 3.5 and Qwen 3.5 MoE.
- Expanded compatibility for Qwen 3 variants including MoE / VL / Omni / Next.
- Added support for LLada2 block diffusion LLM models.
- Improved compatibility for Mixtral, Phi-4, Nemotron Ultra, BaiChuan, ChatGLM, Yi, and GLM4V.
- Fixed multiple MoE-specific AWQ and multi-GPU issues, including routing, module tree, position embeddings, and device mismatches.
AWQ / GPTQ Kernels
- Added CPU fused AWQ kernels for torch_fused and hf_kernel.
- Added torch_int8 AWQ kernel.
- Added BitBLAS AWQ kernel.
- Ported Intel int8 GPTQ/AWQ kernels.
- Updated kernel selection to prefer HF kernels where they provide the best performance and compatibility.
- Added BitBLAS fallback protection and fixed BitBLAS accuracy and qzero remap regressions.
Quantization Improvements
- Replaced greedy search with ternary search in SmoothBSE.
- Fixed SmoothMAD overly aggressive clipping.
- Added layer-level dynamic skip for fast quantization.
- Added early stop when all remaining layers are skipped during quantization.
- Fixed AWQ OOM and dequantization-related issues.
Runtime & Dequantization
- Added optional CPU int64 g_idx cache for TorchQuantLinear dequantization.
- Improved TorchFused dequantization and fp32 dtype support.
- Removed unnecessary symmetric handling in dequantize_gemm.
- Fixed rotary embedding device mismatch by storing per-device rotary copies.
- Added warmup protection for threaded timing.
Defuser Integration
- Integrated defuser.convert_hf_model().
- Integrated defuser.materialize_model().
- Integrated defuser.replace_fused_blocks().
- Improved defuser meta/offload compatibility and fused block handling.
Compatibility Fixes
- Improved compatibility with older and newer Hugging Face Transformers / Optimum versions.
- Fixed import compatibility issues in models/utils.
- Fixed rotary / embedding config compatibility with older HF and model variants.
- Improved tokenizer and model compatibility updates related to tokenicer.
- Fixed OSS compatibility issues.
Kernel / Backend Changes
- Hard deprecated ExLLaMA v1 kernel.
- Exposed the Triton patcher as an externally callable API.

What's Changed

support video input for quantization by @techshoww in #2386
feat: moe-router-bypass-batch-size by @avtc in #2349
[CI] use UV as python manager by @CSY-ModelCloud in #2415
[CI] fix deps installation & gpu service api path by @CSY-ModelCloud in #2416
[CI] auto release GPU if job has sth wrong or unrecoverable by @CSY-ModelCloud in #2417
[CI] save log to disk & fix deps installation by @CSY-ModelCloud in #2418
Replace Greedy with Tenary Search for SmoothBSE by @namgyu-youn in #2419
Feature/LLada2 support: Block Diffusion LLM by @blazingbhavneek in #2422
Bump the github-actions group with 2 updates by @dependabot[bot] in #2426
[MODEL] supports qwen3_5 by @ZX-ModelCloud in #2427
[FIX] eval bug for qwn3_5 quantized model by @ZX-ModelCloud in #2428
[MODEL] supports qwen3_5_moe by @ZX-ModelCloud in #2433
Update tokenicer dependency version to 0.0.7 by @Qubitium in #2434
Optional CPU g_idx int64 cache for TorchQuantLinear dequant path by @Qubitium in #2431
fix import compat issues for models/utils that is locked to higher ve... by @Qubitium in #2436
call defuser.convert_hf_model() by @ZX-ModelCloud in #2437
Update defuser dependency version to 0.0.3 by @Qubitium in #2439
quantize mlp experts module for qwen3_5_moe by @ZX-ModelCloud in #2443
Fix typo in setup.py causing wheel build failure (sys.abiflag -> sys.abiflags) by @beomchan0 in #2444
call defuser's materialize_model() by @ZX-ModelCloud in #2446
Update defuser dependency version to 0.0.4 by @Qubitium in #2447
port intel's int8 gptq/awq kernel over by @Qubitium in #2438
expose triton patcher as externally callable by @Qubitium in #2448
docs by @Qubitium in #2449
Add AWQ support for CPU fused kernels (torch_fused & hf_kernel) by @jiqing-feng in #2445
Cleanupx by @Qubitium in #2450
Make HF kernels for gptq/awq highest priority as they are the highest... by @Qubitium in #2451
rm sym in dequantize_gemm by @jiqing-feng in #2452
fix awq rotary device mismatch. store per-device copy of rotary by @Qubitium in #2453
add torch_int8 awq kernel by @Qubitium in #2454
[CI] move check log to a new step by @CSY-ModelCloud in #2455
cleanup hf kernel gptq/awq post_init loading by @Qubitium in #2457
fix SmoothMAD overly-aggressive clipping by @Qubitium in #2459
upgrade defuser version to 0.0.5 by @ZX-ModelCloud in #2460
[FIX] test_qwen3_5_moe by @ZX-ModelCloud in #2461
Update defuser dependency version to 0.0.6 by @Qubitium in #2462
fix awq oom by @CSY-ModelCloud in #2458
[CI] CUDA 131 + Torch 2.10.0 + Python 3.13 by @CSY-ModelCloud in #2463
Fix the module_tree in Qwen3_5_Moe to correctly support AWQ by @ZX-ModelCloud in #2464
[CI] fix git link cannot be installed by uv by @CSY-ModelCloud in #2465
[FIX] GEMM can't pack by @ZX-ModelCloud in #2466
[CI] add peft for test_asym_gptq_v1 & check log after test by @CSY-ModelCloud in #2467
[CI] get path error from log & install pre-compiled bitblas by @CSY-ModelCloud in #2468
[CI] fix log files were saved with wrong runid by @CSY-ModelCloud in #2469
[FIX] where qwen3_5_moe got incorrect position_embeddings during AWQ quantization by @ZX-ModelCloud in #2470
Update pypcre version to 0.2.13 by @CSY-ModelCloud in #2471
read dependencies from requirements.txt by @CSY-ModelCloud in #2472
add setuptools to requirements.txt by @CSY-ModelCloud in #2474
set minimum setuptools version to 78.1.1 by @CSY-ModelCloud in #2475
[FIX] device mismatch issue that occurred during multi-GPU AWQ quantization in moe Model by @ZX-ModelCloud in #2476
[CI] auto uninstall unneeded pkgs by @CSY-ModelCloud in #2478
fix ci failed tests by @ZX-ModelCloud in #2477
update mixtral's module_tree by @ZX-ModelCloud in #2480
Fix CI by @Qubitium in #2481
[CI] add pypi as backup by @CSY-ModelCloud in #2482
Ci fixes 2 by @Qubitium in #2483
CI Tests Fix 3 by @Qubitium in #2484
[CI] fix old models need old transformers by @CSY-ModelCloud in #2485
fix failed test by @ZX-ModelCloud in #2486
[CI] install latest bitblas & fix missing pkgs by @CSY-ModelCloud in #2487
BaiChuan fix by @Qubitium in #2488
Ci fix 5 by @Qubitium in #2489
Shelll/Src module buffer registratio mismatch + Qwen 2.5 VL patch by @Qubitium in #2490
[CI] install latest evalplus wheel by @CSY-ModelCloud in #2492
[CI] throw error for fast check by @CSY-ModelCloud in #2493
[FIX] test_post_quant_eora by @ZX-ModelCloud in https://github.com/ModelCloud/GPTQ...

Contributors

Qubitium, avtc, and 8 other contributors

Assets 46

GPT-QModel v5.7.0

10 Feb 10:09

@Qubitium Qubitium

v5.7.0

ed96f2e

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

GPT-QModel v5.7.0

Notable Changes:

Feature: MoE.Routing control (Bypass or Override) by @avtc in #2235
Feature: Use FailSafe Naive Quantization when GPTQ fails due to MoE uneven routing by @ZX-ModelCloud in #2293
Feature: ability to pause/resume quantization via 'p' key by @avtc in #2294
Glm4v support by @LRL2-ModelCloud in #2303
Failsafe smoothers by @Qubitium in #2304
New median strategy and SmoothPercentileAsymmetric smoother by @Qubitium in
Support for Qwen2.5-Omni calibration data includes audio. by @ChenShisen in #2309
Add Smooth trigger based on group_size by @Qubitium in #2312
Voxtral support by @LRL2-ModelCloud in #2315
Better compat with triton-windows and other alternative triton packages by @Qubitium in #2395
Dynamically map format/backend to kernel by @Qubitium in #2353
Add EXAONE4 support by @namgyu-youn in #2405

What's Changed

[FIX] unittest by @ZX-ModelCloud in #2291
[FIX] marlin forward by @ZX-ModelCloud in #2296
FIX fast_hadamard_transform import by @LRL2-ModelCloud in #2298
do not log moe errors if failesafe enabled by @Qubitium in #2299
[CI] allow cancel action by @CSY-ModelCloud in #2300
Fix non-rtn packing by @Qubitium in #2302
log q vs weight abs.mean for loss column by @Qubitium in #2306
fix inverted failsafe log condition by @Qubitium in #2310
#2311
Allow failsafe to be none by @Qubitium in #2313
move non-inference affecting fields to meta on save by @Qubitium in #2314
[FIX] GPTQModel.load() can now correctly load non-quantized models. by @ZX-ModelCloud in #2317
FIX hf kernel by @jiqing-feng in #2319
[CI] test_qwen3_moe add eval task: GSM8K_PLATINUM_COT and MMLU_STEM by @ZX-ModelCloud in #2320
Release 5.7 Prep by @Qubitium in #2318
[FIX] Exclude unrouted MoE experts on load by @ZX-ModelCloud in #2321
[FIX] Skip empty subset by @ZX-ModelCloud in #2322
[FIX] GLM-4.5-Air quantize fail by @ZX-ModelCloud in #2323
fix: offload_to_disk=True uses more vram than offload_to_disk=False by @avtc in #2325
Fix import no_init_weights from transformers by @jiqing-feng in #2329
[FIX] qqq quantize by @ZX-ModelCloud in #2330
chery pick: attempt to fix terminal state after pause/resume handlers by @avtc in #2327
[FIX] quantization to fail for non-MoE models by @ZX-ModelCloud in #2333
Device check by @jiqing-feng in #2334
FIX moe flag passing not passing nested ci test by @Qubitium in #2337
Use safer checks for nullable properties where they may not exists at... by @Qubitium in #2338
Fix unit test by @Qubitium in #2339
Group module_tree/subsection parsing related tests to module_tree folder by @Qubitium in #2340
Group kernel tests by @Qubitium in #2341
Lifecycle: Move awq.pack_module to submodule_finalize() from process() by @ZX-ModelCloud in #2335
Partial Revert 2235: temp remove moe bypass by @ZX-ModelCloud in #2343
Re apply compute device filter by @Qubitium in #2345
Re-apply moe routing bypass by @ZX-ModelCloud in #2347
Fix: Zero point underflow in AWQ Exllama v2 kernel by @12345txy in #2351
Remove unnecessary +1/-1 inference/packing zerpoint offset for AWQ Exllama v2 kernel by @Qubitium in #2352
Normalize AWQ.qcfg zero_point to sym property by @Qubitium in #2355
FIX sym True with AWQ by @ZX-ModelCloud in #2357
Prepare for 5.7 by @Qubitium in #2358
[FIX] self_attn.q_proj was not quantized in the Moonlight Model by @ZX-ModelCloud in #2360
[FIX] torch_fused inference error by @ZX-ModelCloud in #2362
[FIX] FORMAT.LLM_AWQ was incorrectly quantized as FORMAT.GEMM by @ZX-ModelCloud in #2364
[CI] load all tests include sub dirs & merge some small tests in to one file by @CSY-ModelCloud in #2363
Fix evalplus output filename mismatch by @juraev in #2365
[FIX] FORMAT.GEMV and FORMAT.GEMV_FAST could not be quantized by @ZX-ModelCloud in #2366
[CI] add deps config for CI tests by @CSY-ModelCloud in #2368
[FIX] unittest by @ZX-ModelCloud in #2370
[FIX] In AWQProcessor, the failsafe threshold_value should be calculated based on the scale group, not the entire layer by @ZX-ModelCloud in #2369
[CI] fix ci didn't read correct yaml by @CSY-ModelCloud in #2371
[FIX] ci unittest by @ZX-ModelCloud in #2372
[FIX] test_q4_bitblas and test_qqq by @ZX-ModelCloud in #2373
[CI] add test_integration deps by @CSY-ModelCloud in #2374
[CI] fix torch version was upgraded by deps by @CSY-ModelCloud in #2377
select_quant_linear should always receive a non-null device by @Qubitium in #2376
[CI] uninstall pynvml by @CSY-ModelCloud in #2378
[FIX] failed ci test by @ZX-ModelCloud in #2380
[FIX] test_gptq by @ZX-ModelCloud in #2382
[FIX] correct has_captured_input_ids() logic by using > 0 check by @ZX-ModelCloud in #2383
[FIX] test_model by @ZX-ModelCloud in #2384
[FIX] unit test by @ZX-ModelCloud in #2385
[CI] use new docker by @CSY-ModelCloud in #2387
[FIX] ci test by @ZX-ModelCloud in #2388
[FIX] unittest by @ZX-ModelCloud in #2389
[FIX] missing ExllamaV2 kernels initialization in AutoRound by @ZX-ModelCloud in #2390
[CI] keep uv up to date by @CSY-ModelCloud in #2391
[FIX] test_awq by @ZX-ModelCloud in #2392
[FIX] Incorrectly selected device by @ZX-ModelCloud in #2394
[FIX] quantization failure for Qwen2/2.5/3 VL models with FlashAttention-2 by @ZX-ModelCloud in #2396
[FIX] test_ovis2 and test_ovis_1_6_llama by @ZX-ModelCloud in #2397
[FIX] test_stage_modules by @ZX-ModelCloud in #2398
[CI] list test files with py file & fix duplicated test names by @CSY-ModelCloud in #2399
[FIX] test_pause_resume by @ZX-ModelCloud in #2400
[CI] update sort, root test files first by @CSY-ModelCloud in #2401
[FIX] exllama_v1 kernel crash by @ZX-ModelCloud in #2402
[FIX] test_chatglm by @ZX-ModelCloud in #2406
set tokenicer>=0.0.6 by @CSY-ModelCloud in #2407
Fix tokenizer_class incompatibility with transformers 5.0 by @juraev in #2403
[FIX] model_test by @ZX-ModelCloud in #2410
fixed ValueError: invalid pyproject.toml config: project.license. con... by @CSY-ModelCloud in htt...

Contributors

Qubitium, avtc, and 8 other contributors

Assets 32

GPT-QModel v5.6.12

17 Dec 11:28

@Qubitium Qubitium

v5.6.12

1a19cd0

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

GPT-QModel v5.6.12

Notable Changes:

uv compat
Both uv and pip install will now display ui progress for external wheel/depend downloads.

What's Changed

[FIX] failed unittest by @ZX-ModelCloud in #2286
fix wheel name mistaches with version name by @CSY-ModelCloud in #2288
Setup download progress by @Qubitium in #2289
Update latest news section in README.md by @Qubitium in #2290

Full Changelog: v5.6.10...v5.6.12

Contributors

Qubitium, ZX-ModelCloud, and CSY-ModelCloud

Assets 34

GPT-QModel v5.6.10

16 Dec 10:13

@Qubitium Qubitium

v5.6.10

70a507d

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

GPT-QModel v5.6.10

Notable Changes:

Triton check by @Qubitium in #2274
Fix bitblas support for gptq_v2 format by @xxxxyu in #2281
Fix awq triton kernel has invalid properties by @Qubitium in #2279

What's Changed

Add kernel selection log by @ZX-ModelCloud in #2275
Update README.md by @Qubitium in #2276
Update pypcre depend by @Qubitium in #2277
Update version.py by @Qubitium in #2278
Add macos unit tests by @CSY-ModelCloud in #2282
Update README.md by @Qubitium in #2283

New Contributors

@xxxxyu made their first contribution in #2281

Full Changelog: v5.6.6...v5.6.10

Contributors

Qubitium, xxxxyu, and 2 other contributors

Assets 34

GPT-QModel v5.6.8

16 Dec 04:11

@Qubitium Qubitium

v5.6.8

711b214

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

GPT-QModel v5.6.8

Notable Changes:

Fix Triton check/import by @Qubitium in #2274

What's Changed

Add kernel selection log by @ZX-ModelCloud in #2275
Update README.md by @Qubitium in #2276

Full Changelog: v5.6.6...v5.6.8

Contributors

Qubitium and ZX-ModelCloud

Assets 34

v5.6.6

15 Dec 10:35

@Qubitium Qubitium

v5.6.6

9a79b62

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

v5.6.6

Notable Changes:

Use static cuda ctx for triton kernel launch by @Qubitium in #2269
Remove random-word depend by @LRL2-ModelCloud in #2266
Update PyPcre depend from 0.2.7 to 0.2.8 by @Qubitium in #2267

What's Changed

Bump the github-actions group with 2 updates by @dependabot[bot] in #2265
Update version.py by @Qubitium in #2268
Ready 5.6.6 by @Qubitium in #2270

Full Changelog: v5.6.2...v5.6.6

Contributors

Qubitium, dependabot, and LRL2-ModelCloud

Assets 34

GPT-QModel v5.6.4

15 Dec 08:27

@Qubitium Qubitium

v5.6.4

61e5e7f

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

GPT-QModel v5.6.4

What's Changed

Bump the github-actions group with 2 updates by @dependabot[bot] in #2265
remove random-word depend by @LRL2-ModelCloud in #2266
Update pypcre version from 0.2.7 to 0.2.8 by @Qubitium in #2267
Update version.py by @Qubitium in #2268

Full Changelog: v5.6.2...v5.6.4

Contributors

Qubitium, dependabot, and LRL2-ModelCloud

Assets 34

Releases: ModelCloud/GPTQModel

GPT-QModel v7.1.0

What's Changed

New Contributors

Contributors

Uh oh!

🚀 GPTQModel v7.0.0

🔥 Major

🧠 New model support and compatibility wins

⚡ Kernels, JIT, and hardware acceleration

🔥 Quantization, AWQ, FP8, and dequant

🐢 LazyTurtle, loading, and model plumbing

🧰 CI, packaging, and developer workflow

💥 Breaking and removed

Full Changelog:

Contributors

Uh oh!

GPT-QModel v6.0.3

Notable Changes:

Quantization and inference

Model and backend support

Evaluation, calibration, and usability

Dependency and compatibility updates

Breaking and removed

What's Changed

Contributors

Uh oh!

GPT-QModel v5.8.0

Notable Changes

What's Changed

Contributors

Uh oh!

GPT-QModel v5.7.0

Notable Changes:

What's Changed

Contributors

Uh oh!

GPT-QModel v5.6.12

Notable Changes:

What's Changed

Contributors

Uh oh!

GPT-QModel v5.6.10

Notable Changes:

What's Changed

New Contributors

Contributors

Uh oh!

GPT-QModel v5.6.8

Notable Changes:

What's Changed

Contributors

Uh oh!

v5.6.6

Notable Changes:

What's Changed

Contributors

Uh oh!

GPT-QModel v5.6.4

What's Changed

Contributors

Uh oh!