Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Releases: ModelCloud/GPTQModel

GPT-QModel v7.1.0

08 Jun 08:04
@Qubitium Qubitium
d49b9bd
This commit was created on GitHub.com and signed with GitHub’s verified signature.
GPG key ID: B5690EEEBB952194
Verified
Learn about vigilant mode.

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v7.0.0...v7.1.0

Contributors

Qubitium, ZX-ModelCloud, and 3 other contributors
Assets 3
Loading

🚀 GPTQModel v7.0.0

28 Apr 20:37
@Qubitium Qubitium
f731429
This commit was created on GitHub.com and signed with GitHub’s verified signature.
GPG key ID: B5690EEEBB952194
Verified
Learn about vigilant mode.

Choose a tag to compare

🔥 Major

  • New Huawei Ascend NPU quantization support with torch based kernels for inference
  • All CUDA/ROCm compiled kernels are now JIT (just-in-time) compiled on first use
  • Pip/UV install no longer requires the --no-build-isolation flag

🧠 New model support and compatibility wins

  • Added support for GLM 5/5.1, GLM OCR, GLM ASR, Gemma 3n, Falcon Mamba, and InternVL Chat.
  • Extended OpenVINO GPTQ patching to understand GPTQModel's newer kernels.
  • Fixed Qwen3 dtype handling, Qwen3.5 MoE module-tree assertions, Qwen2-VL calibration input capture, and Qwen 3.6 MoE regressions.
  • Fixed Llama4Router replacement behavior, Phi-3 defused MLP module mapping, Phi-4 runtime requirements, Instella rope-scaling compatibility, Ling compatibility, Mixtral MoE checkpoint module names, Brumby thread safety, Baichuan compatibility, and Gemma
    3 saving.
  • Fixed exllamav3_torch import under meta-device context.

⚡ Kernels, JIT, and hardware acceleration

  • Moved all compilation required kernels to JIT compilation on first-use and cleaned up Marlin import probing, CUDA header handling, nvcc flag checks, and Torch/CUDA mismatch handling.
  • Synced Marlin/Machete kernels with upstream and added hardware-specific Marlin boost paths.
  • Guarded CUTLASS version mismatches and fixed generated-kernel staleness.
  • Added global kernel rebuild support for CI and safer shared extension locks.
  • Added Ascend NPU support.
  • Fixed AWQ JIT cache invalidation, illegal memory access, SM120 execution, GEMM_Fast shared-memory launch, and BF16 bias validation.
  • Fixed BACKEND.MARLIN loading for gptq_v2 format and added Marlin import coverage.

🔥 Quantization, AWQ, FP8, and dequant

  • Added FP8/FP4 CPU dequant and DeepSeek FP8 .scale dequant export.
  • Added dtype auto-decoding and decode path updates.
  • Reduced AWQ scale-search activation memory and split AWQ integration tests for cleaner coverage.
  • Fail fast on unsupported act-group-aware GPTQ shapes instead of continuing into invalid layouts.
  • Fixed INT3 qzero format conversion, GAR width compatibility, and GPTQ batched keep-mask handling.
  • Improved AWQ W4A8 and BF16 validation paths, plus post-quant MoE routing behavior.
  • Used loader device selection for EoRA adapter generation.

🐢 LazyTurtle, loading, and model plumbing

  • Refactored input capture into BaseQModel and model-specific QModels for cleaner replay and calibration flows.
  • Renamed and hardened the turtle path into LazyTurtle, with stricter materialization failures and better expected-skip handling.
  • Fixed LazyTurtle materialization for non-square fused experts, PhiMoE, nested HF weight renames, reversed WeightRenaming semantics, and non-Safetensors checkpoints.
  • Improved out-of-model tensor handling for MTP prefix/files paths.
  • Removed BaseModel.loader_requires_dtype and normalized config dtype handling through get_hf_config_dtype().
  • Fixed multi-GPU replay output retention, GPTQ finalizer overlap, and quantization OOMs from retained callable cache keys.

🧰 CI, packaging, and developer workflow

  • Cleaned up CI shell logic, environment setup, UV cache handling, reusable Torch tests, CPU-only grouping, runner selection, retry behavior, and offload temp paths.
  • Kept CI and Torch CUDA versions aligned, moved to newer Docker images, and surfaced real exit codes and GPU names.
  • Removed lm-eval, deprecated tests, deprecated artifact IDs, pause UI lifecycle code, and tabulate from CI/test paths.
  • Migrated more regex usage to pcre/pcre2.
  • Replaced temp path helpers with tempfile.TemporaryDirectory() for automatic cleanup.
  • Updated requirements, dependencies, setuptools compatibility, and install-with-Torch validation.

💥 Breaking and removed

  • Kernel loading behavior has shifted heavily toward JIT compilation, so custom deployment environments should verify compiler/CUDA compatibility.
  • lm-eval references were removed from CI and test/docs paths.
  • Deprecated tests, artifact handling, and pause UI lifecycle code were removed.

Full Changelog:

Read more

Contributors

Qubitium, dblundell, and 2 other contributors
Loading

GPT-QModel v6.0.3

02 Apr 23:56
@Qubitium Qubitium
6a65d69
This commit was created on GitHub.com and signed with GitHub’s verified signature.
GPG key ID: B5690EEEBB952194
Verified
Learn about vigilant mode.

Choose a tag to compare

Notable Changes:

Quantization and inference

  • Major ParoQuant improvements across speed, inference, and accuracy.
  • Added Paro inference support and a new layer optimizer.
  • Auto-enables AMP for the fast Paro implementation to better match reference behavior.
  • Added Paro rotation autotuning and fixed BF16 rotation support for the fused CUDA kernel.
  • Improved Paro stability with seeding fixes, cleanup, learned channel scale clamping, and contiguous tensor handling fixes.
  • Fixed a layer output replay/re-capture regression.
  • Added FOEM (First-Order Error Matters) for more accurate quantized LLM compensation, plus follow-up fixes to its data processing pipeline.
  • Replaced the old marlin_fp16 backend behavior with environment-flag control for FP32 reduction.

Model and backend support

  • Added support for Gemma4, MiniCPMO, MiniCPMV, and GLM4-MoE-Lite.
  • Added PrismML/Bonsai model support for inference.
  • Fixed Qwen3_5QModel definition issues.
  • Fixed Qwen 3.5 rotary embedding behavior.
  • Fixed AWQ layer grouping for qwen3_5_moe, llama4, qwen2_moe, and qwen3_next.
  • Fixed awq_processor.dynamic so skipped layers are handled correctly.
  • Improved dtype compatibility.
  • Hugging Face kernels are now gated off on Python no-GIL builds until upstream wheel support is fixed.

Evaluation, calibration, and usability

  • Integrated Evalution into the workflow.
  • Added evalution.VLLM and evalution.SGLang backends.
  • Fixed SGLang evaluation engine initialization.
  • Automatically determines MODEL_COMPAT_FAST_LAYER_COUNT.
  • Improved calibration data device handling.
  • Updated tokenizer handling, and collation now respects tokenizer padding_size.
  • Improved import performance by lazy-loading _DEVICE_THREAD_POOL.
  • Cleaned up warning behavior and added an option to suppress warnings.
  • Removed forced random seed overrides.

Dependency and compatibility updates

  • Updated pypcre to 0.2.14.
  • Pinned logbar to >=0.4.1.
  • Updated transformers and defuser package versions.
  • Fixed SAVE_PATH handling and import path resolution issues.

Breaking and removed

  • Removed GPTQModel.upload_to_hub().
  • Removed MLX export support.

What's Changed

Read more

Contributors

Qubitium, avtc, and 3 other contributors
Loading
Qubitium reacted with hooray emoji Qubitium reacted with rocket emoji
1 person reacted

GPT-QModel v5.8.0

19 Mar 16:35
@Qubitium Qubitium
9980f01
This commit was created on GitHub.com and signed with GitHub’s verified signature.
GPG key ID: B5690EEEBB952194
Verified
Learn about vigilant mode.

Choose a tag to compare

Notable Changes

  • Transformers 5.3.0 compatibility.

  • Video Quantization Support

    • Added support for video input during quantization.
  • MoE & Model Support

    • Added support for Qwen 3.5 and Qwen 3.5 MoE.
    • Expanded compatibility for Qwen 3 variants including MoE / VL / Omni / Next.
    • Added support for LLada2 block diffusion LLM models.
    • Improved compatibility for Mixtral, Phi-4, Nemotron Ultra, BaiChuan, ChatGLM, Yi, and GLM4V.
    • Fixed multiple MoE-specific AWQ and multi-GPU issues, including routing, module tree, position embeddings, and device mismatches.
  • AWQ / GPTQ Kernels

    • Added CPU fused AWQ kernels for torch_fused and hf_kernel.
    • Added torch_int8 AWQ kernel.
    • Added BitBLAS AWQ kernel.
    • Ported Intel int8 GPTQ/AWQ kernels.
    • Updated kernel selection to prefer HF kernels where they provide the best performance and compatibility.
    • Added BitBLAS fallback protection and fixed BitBLAS accuracy and qzero remap regressions.
  • Quantization Improvements

    • Replaced greedy search with ternary search in SmoothBSE.
    • Fixed SmoothMAD overly aggressive clipping.
    • Added layer-level dynamic skip for fast quantization.
    • Added early stop when all remaining layers are skipped during quantization.
    • Fixed AWQ OOM and dequantization-related issues.
  • Runtime & Dequantization

    • Added optional CPU int64 g_idx cache for TorchQuantLinear dequantization.
    • Improved TorchFused dequantization and fp32 dtype support.
    • Removed unnecessary symmetric handling in dequantize_gemm.
    • Fixed rotary embedding device mismatch by storing per-device rotary copies.
    • Added warmup protection for threaded timing.
  • Defuser Integration

    • Integrated defuser.convert_hf_model().
    • Integrated defuser.materialize_model().
    • Integrated defuser.replace_fused_blocks().
    • Improved defuser meta/offload compatibility and fused block handling.
  • Compatibility Fixes

    • Improved compatibility with older and newer Hugging Face Transformers / Optimum versions.
    • Fixed import compatibility issues in models/utils.
    • Fixed rotary / embedding config compatibility with older HF and model variants.
    • Improved tokenizer and model compatibility updates related to tokenicer.
    • Fixed OSS compatibility issues.
  • Kernel / Backend Changes

    • Hard deprecated ExLLaMA v1 kernel.
    • Exposed the Triton patcher as an externally callable API.

What's Changed

Read more
Loading

GPT-QModel v5.7.0

10 Feb 10:09
@Qubitium Qubitium
ed96f2e
This commit was created on GitHub.com and signed with GitHub’s verified signature.
GPG key ID: B5690EEEBB952194
Verified
Learn about vigilant mode.

Choose a tag to compare

Notable Changes:

What's Changed

Read more
Loading

GPT-QModel v5.6.12

17 Dec 11:28
@Qubitium Qubitium
1a19cd0
This commit was created on GitHub.com and signed with GitHub’s verified signature.
GPG key ID: B5690EEEBB952194
Verified
Learn about vigilant mode.

Choose a tag to compare

Notable Changes:

  • uv compat
  • Both uv and pip install will now display ui progress for external wheel/depend downloads.

What's Changed

Full Changelog: v5.6.10...v5.6.12

Contributors

Qubitium, ZX-ModelCloud, and CSY-ModelCloud
Loading

GPT-QModel v5.6.10

16 Dec 10:13
@Qubitium Qubitium
70a507d
This commit was created on GitHub.com and signed with GitHub’s verified signature.
GPG key ID: B5690EEEBB952194
Verified
Learn about vigilant mode.

Choose a tag to compare

Notable Changes:

What's Changed

New Contributors

Full Changelog: v5.6.6...v5.6.10

Contributors

Qubitium, xxxxyu, and 2 other contributors
Loading

GPT-QModel v5.6.8

16 Dec 04:11
@Qubitium Qubitium
711b214
This commit was created on GitHub.com and signed with GitHub’s verified signature.
GPG key ID: B5690EEEBB952194
Verified
Learn about vigilant mode.

Choose a tag to compare

Notable Changes:

What's Changed

Full Changelog: v5.6.6...v5.6.8

Contributors

Qubitium and ZX-ModelCloud
Loading

v5.6.6

15 Dec 10:35
@Qubitium Qubitium
9a79b62
This commit was created on GitHub.com and signed with GitHub’s verified signature.
GPG key ID: B5690EEEBB952194
Verified
Learn about vigilant mode.

Choose a tag to compare

Notable Changes:

What's Changed

Full Changelog: v5.6.2...v5.6.6

Contributors

Qubitium, dependabot, and LRL2-ModelCloud
Loading

GPT-QModel v5.6.4

15 Dec 08:27
@Qubitium Qubitium
61e5e7f
This commit was created on GitHub.com and signed with GitHub’s verified signature.
GPG key ID: B5690EEEBB952194
Verified
Learn about vigilant mode.

Choose a tag to compare

What's Changed

Full Changelog: v5.6.2...v5.6.4

Contributors

Qubitium, dependabot, and LRL2-ModelCloud
Loading
Previous 1 3 4 5 6 7
Previous

AltStyle によって変換されたページ (->オリジナル) /