-
Notifications
You must be signed in to change notification settings - Fork 422
add Qwen Image support #851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
SeanTater
commented
Sep 24, 2025
Thanks for adding this! I got it working on CPU on my machine, but as you would expect, it's quite slow.
I tried compiling with Vulkan, which compiles, but segfaults immediately as it starts the diffusion. Are you already working on that?
FWIW, Codex suggests changing ggml_vk_build_graph, which does get it to compute something - but its nonsense results. I get garbled output which doesn't appear to depend on the prompt. It's the same with or without diffusion-fa. With vae tiling, I get a floating point exception.
2025年09月23日T12:54:42-04:00
When doing the VAE on the CPU instead, we have a different problem: we get tiled field like this, the exact color of which varies.
vulkan-variation-01-seed1000
I suspect maybe there is an as-yet-unimplemented op that it's basically just stubbing.
jeffbolznv
commented
Sep 24, 2025
Where/how does it crash with Vulkan?
Where/how does it crash with Vulkan?
Testing it here, I get:
$ ./sd --diffusion-model ./qwen-image-Q4_0.gguf --vae ./Qwen_Image-VAE.safetensors --qwen2vl ./Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf -p 一个穿着"QWEN"标志的T恤的中国美女正拿着黑色的马克笔面相镜头微笑。她身后的玻璃板上手写体写着 "一、Qwen-Image的技术路线: 探索视觉生成基础模型的极限,开创理解与生成一体化的未来。二、Qwen-Image的模型特色:1、复杂文字渲染。支持中英渲染、自动布局; 2、精准图像编辑。支持文字编辑、物体增减、风格变换。三、Qwen-Image的未来愿景:赋能专业内容创作、助力生成式AI发展。" --cfg-scale 2.5 --sampling-method euler -v --offload-to-cpu -H 512 -W 512 --diffusion-fa --flow-shift 3
Option:
n_threads: 4
mode: img_gen
model_path:
wtype: unspecified
clip_l_path:
clip_g_path:
clip_vision_path:
t5xxl_path:
qwen2vl_path: ./Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf
diffusion_model_path: ./qwen-image-Q4_0.gguf
high_noise_diffusion_model_path:
vae_path: ./Qwen_Image-VAE.safetensors
taesd_path:
esrgan_path:
control_net_path:
embedding_dir:
photo_maker_path:
pm_id_images_dir:
pm_id_embed_path:
pm_style_strength: 20.00
output_path: output.png
init_image_path:
end_image_path:
mask_image_path:
control_image_path:
ref_images_paths:
control_video_path:
increase_ref_index: false
offload_params_to_cpu: true
clip_on_cpu: false
control_net_cpu: false
vae_on_cpu: false
diffusion flash attention: true
diffusion Conv2d direct: false
vae_conv_direct: false
control_strength: 0.90
prompt: 一个穿着"QWEN"标志的T恤的中国美女正拿着黑色的马克笔面相镜头微笑。她身后的玻璃板上手写体写着 "一、Qwen-Image的技术路线: 探索视觉生成基础模型的极限,开创理解与生成一体化的未来。二、Qwen-Image的模型特色:1、复杂文字渲染。支持中英渲染、自动布局; 2、精准图像编辑。支持文字编辑、物体增减、风格变换。三、Qwen-Image的未来愿景:赋能专业内容创作、助力生成式AI发展。"
negative_prompt:
clip_skip: -1
width: 512
height: 512
sample_params: (txt_cfg: 2.50, img_cfg: 2.50, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: default, sample_method: euler, sample_steps: 20, eta: 0.00, shifted_timestep: 0)
high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: default, sample_method: default, sample_steps: -1, eta: 0.00, shifted_timestep: 0)
moe_boundary: 0.875
flow_shift: 3.00
strength(img2img): 0.75
rng: cuda
seed: 42
batch_count: 1
vae_tiling: false
upscale_repeats: 1
chroma_use_dit_mask: true
chroma_use_t5_mask: false
chroma_t5_mask_pad: 1
video_frames: 1
vace_strength: 1.00
fps: 16
System Info:
SSE3 = 1
AVX = 1
AVX2 = 1
AVX512 = 0
AVX512_VBMI = 0
AVX512_VNNI = 0
FMA = 1
NEON = 0
ARM_FMA = 0
F16C = 1
FP16_VA = 0
WASM_SIMD = 0
VSX = 0
[DEBUG] stable-diffusion.cpp:153 - Using Vulkan backend
[DEBUG] ggml_extend.hpp:62 - ggml_vulkan: Found 1 Vulkan devices:
[DEBUG] ggml_extend.hpp:62 - ggml_vulkan: 0 = AMD Radeon RX 7600 XT (RADV NAVI33) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
[INFO ] stable-diffusion.cpp:209 - loading diffusion model from './qwen-image-Q4_0.gguf'
[INFO ] model.cpp:1071 - load ./qwen-image-Q4_0.gguf using gguf format
[DEBUG] model.cpp:1088 - init from './qwen-image-Q4_0.gguf'
[INFO ] stable-diffusion.cpp:256 - loading qwen2vl from './Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf'
[INFO ] model.cpp:1071 - load ./Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf using gguf format
[DEBUG] model.cpp:1088 - init from './Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf'
[INFO ] stable-diffusion.cpp:263 - loading vae from './Qwen_Image-VAE.safetensors'
[INFO ] model.cpp:1074 - load ./Qwen_Image-VAE.safetensors using safetensors format
[DEBUG] model.cpp:1181 - init from './Qwen_Image-VAE.safetensors', prefix = 'vae.'
[INFO ] stable-diffusion.cpp:275 - Version: Qwen Image
[INFO ] stable-diffusion.cpp:306 - Weight type: bf16
[INFO ] stable-diffusion.cpp:307 - Conditioner weight type: f32
[INFO ] stable-diffusion.cpp:308 - Diffusion model weight type: bf16
[INFO ] stable-diffusion.cpp:309 - VAE weight type: NONE
[DEBUG] stable-diffusion.cpp:311 - ggml tensor size = 400 bytes
[INFO ] stable-diffusion.cpp:350 - Using flash attention in the diffusion model
[DEBUG] qwenvl.hpp:137 - merges size 151387
[DEBUG] qwenvl.hpp:159 - vocab size: 151665
[DEBUG] ggml_extend.hpp:1738 - qwenvl2.5 params backend buffer size = 3607.26 MB(RAM) (338 tensors)
[DEBUG] ggml_extend.hpp:1738 - qwen_image params backend buffer size = 11303.54 MB(RAM) (1933 tensors)
[DEBUG] ggml_extend.hpp:1738 - wan_vae params backend buffer size = 139.84 MB(RAM) (108 tensors)
[DEBUG] stable-diffusion.cpp:583 - loading weights
[DEBUG] model.cpp:2069 - loading tensors from ./qwen-image-Q4_0.gguf
|=======================================> | 1933/2465 - 804.75it/s
[DEBUG] model.cpp:2069 - loading tensors from ./Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf
|==============================================> | 2271/2465 - 222.34it/s
[DEBUG] model.cpp:2069 - loading tensors from ./Qwen_Image-VAE.safetensors
|==============================================> | 2283/2465 - 223.49it/s[INFO ] model.cpp:2339 - unknown tensor 'first_stage_model.conv1.weight | bf16 | 4 [1, 1, 1, 1024, 1]' in model file
|================================================> | 2393/2465 - 229.76it/s[INFO ] model.cpp:2339 - unknown tensor 'first_stage_model.conv1.bias | bf16 | 1 [32, 1, 1, 1, 1]' in model file
|==================================================| 2465/2465 - 232.22it/s
[INFO ] model.cpp:2307 - loading tensors completed, taking 10.65s (process: 0.04s, read: 9.94s, memcpy: 0.00s, convert: 0.10s, copy_to_backend: 0.00s)
[INFO ] stable-diffusion.cpp:664 - total params memory size = 15050.64MB (VRAM 15050.64MB, RAM 0.00MB): text_encoders 3607.26MB(VRAM), diffusion_model 11303.55MB(VRAM), vae 139.84MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:726 - running in FLOW mode
[DEBUG] stable-diffusion.cpp:750 - finished loaded file
[DEBUG] stable-diffusion.cpp:2328 - generate_image 512x512
[INFO ] stable-diffusion.cpp:2441 - TXT2IMG
init (f32): shape(64, 64, 16, 1)
[INFO ] stable-diffusion.cpp:899 - attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:919 - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:920 - prompt after extract and remove lora: "一个穿着"QWEN"标志的T恤的中国美女正拿着黑色的马克笔面相镜头微笑。她身后的玻璃板上手写体写着 "一、Qwen-Image的技术路线: 探索视觉生成基础模型的极限,开创理解与生成一体化的未来。二、Qwen-Image的模型特色:1、复杂文字渲染。支持中英渲染、自动布局; 2、精准图像编辑。支持文字编辑、物体增减、风格变换。三、Qwen-Image的未来愿景:赋能专业内容创作、助力生成式AI发展。""
[DEBUG] conditioner.hpp:1416 - parse '<|im_start|>system
Describe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>
<|im_start|>user
一个穿着"QWEN"标志的T恤的中国美女正拿着黑色的马克笔面相镜头微笑。她身后的玻璃板上手写体写着 "一、Qwen-Image的技术路线: 探索视觉生成基础模型的极限,开创理解与生成一体化的未来。二、Qwen-Image的模型特色:1、复杂文字渲染。支持中英渲染、自动布局; 2、精准图像编辑。支持文字编辑、物体增减、风格变换。三、Qwen-Image的未来愿景:赋能专业内容创作、助力生成式AI发展。"<|im_end|>
<|im_start|>assistant
' to [['<|im_start|>system
Describe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>
<|im_start|>user
一个穿着"QWEN"标志的T恤的中国美女正拿着黑色的马克笔�
[INFO ] ggml_extend.hpp:1661 - qwenvl2.5 offload params (3607.26 MB, 338 tensors) to runtime backend (Vulkan0), taking 1.49s
[DEBUG] ggml_extend.hpp:1563 - qwenvl2.5 compute buffer size: 30.06 MB(VRAM)
Segmentation fault (core dumped)
gdb shows just this:
Thread 1 "sd" received signal SIGSEGV, Segmentation fault.
0x000055555587569d in ggml_vk_build_graph(ggml_backend_vk_context*, ggml_cgraph*, int, ggml_tensor*, int, bool, bool, bool, bool) ()
I'll try on a debug build. @jeffbolznv , anything more specific I could check?
@SeanTater @wbruna This is likely because GGML Vulkan doesn’t support im2col_3d. I’ve updated GGML, so you can pull the latest code and try again.
@leejet , unfortunately a3a2b2d (with ggml 553c44706c ) crashes too:
the last output lines
ggml_backend_vk_buffer_init_tensor(0x55555963f8f0 (0x555559b0ce40), 0x7ffbc551c020)
ggml_backend_vk_buffer_init_tensor(0x55555963f8f0 (0x555559b0ce40), 0x7ffbc551c1d0)
ggml_backend_vk_buffer_set_tensor(0x55555963f8f0, 0x7ffbc548e060, 0x555559602820, 0, 4)
ggml_vk_buffer_write(4)
ggml_vk_buffer_write_2d(4, 1)
ggml_vk_create_temporary_context(0x55555a1ff900)
ggml_vk_ctx_begin(Vulkan1)
ggml_vk_create_cmd_buffer()
ggml_vk_buffer_write_2d_async(4, 1)
STAGING
ggml_vk_sync_buffers()
ggml_vk_ctx_end(0x55555a1ff900, 1)
ggml_vk_submit(0x55555a1ff900, 0x55555959c410)
ggml_vk_queue_command_pools_cleanup()
ggml_backend_vk_buffer_set_tensor(0x55555963f8f0, 0x7ffbc54a24c0, 0x7ffbe003f3a0, 0, 160)
ggml_vk_buffer_write(160)
ggml_vk_buffer_write_2d(160, 1)
ggml_vk_create_temporary_context(0x55555a1ff900)
ggml_vk_ctx_begin(Vulkan1)
ggml_vk_create_cmd_buffer()
ggml_vk_buffer_write_2d_async(160, 1)
STAGING
ggml_vk_sync_buffers()
ggml_vk_ctx_end(0x55555a1ff900, 1)
ggml_vk_submit(0x55555a1ff900, 0x55555959c410)
ggml_vk_queue_command_pools_cleanup()
ggml_vk_command_pool_cleanup()
ggml_backend_vk_buffer_set_tensor(0x55555963f8f0, 0x7ffbc54a2670, 0x55555ab88040, 0, 640)
ggml_vk_buffer_write(640)
ggml_vk_buffer_write_2d(640, 1)
ggml_vk_create_temporary_context(0x55555a1ff900)
ggml_vk_ctx_begin(Vulkan1)
ggml_vk_create_cmd_buffer()
ggml_vk_buffer_write_2d_async(640, 1)
STAGING
ggml_vk_sync_buffers()
ggml_vk_ctx_end(0x55555a1ff900, 1)
ggml_vk_submit(0x55555a1ff900, 0x55555959c410)
ggml_vk_queue_command_pools_cleanup()
ggml_backend_vk_graph_compute(1154 nodes)
ggml_vk_build_graph(0x7ffbc54a2820, RESHAPE)
ggml_vk_build_graph(0x7ffbc54a29d0, RESHAPE)
ggml_vk_build_graph(0x7ffbc54a2b80, GET_ROWS)
ggml_pipeline_request_descriptor_sets(
Thread 1 "sd" received signal SIGSEGV, Segmentation fault.
0x00007ffff7d54b24 in std::basic_ostream<char, std::char_traits<char> >& std::operator<< <char, std::char_traits<char>, std::allocator<char> >(std::basic_ostream<char, std::char_traits<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
from /lib/x86_64-linux-gnu/libstdc++.so.6
(gdb)
GDB backtrace
(gdb) bt
#0 0x00007ffff7d54b24 in std::basic_ostream<char, std::char_traits<char> >& std::operator<< <char, std::char_traits<char>, std::allocator<char> >(std::basic_ostream<char, std::char_traits<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
from /lib/x86_64-linux-gnu/libstdc++.so.6
#1 0x00005555558c25b2 in ggml_pipeline_request_descriptor_sets (ctx=0x5555594e62e0, pipeline=std::shared_ptr<vk_pipeline_struct> (empty) = {...}, n=1)
at ./ggml/src/ggml-vulkan/ggml-vulkan.cpp:1653
#2 0x00005555559a38f3 in ggml_vk_build_graph (ctx=0x5555594e62e0, cgraph=0x7ffbc548e210, node_idx=2, node_begin=0x0, node_idx_begin=0, dryrun=true,
last_node=false, almost_ready=false, submit=false) at ./ggml/src/ggml-vulkan/ggml-vulkan.cpp:10650
#3 0x00005555559a9e1e in ggml_backend_vk_graph_compute (backend=0x555559547bb0, cgraph=0x7ffbc548e210)
at ./ggml/src/ggml-vulkan/ggml-vulkan.cpp:11743
#4 0x0000555555a79755 in ggml_backend_graph_compute_async (backend=0x555559547bb0, cgraph=0x7ffbc548e210)
at ./ggml/src/ggml-backend.cpp:359
#5 0x0000555555a796e5 in ggml_backend_graph_compute (backend=0x555559547bb0, cgraph=0x7ffbc548e210)
at ./ggml/src/ggml-backend.cpp:352
#6 0x0000555555682db2 in GGMLRunner::compute(std::function<ggml_cgraph* ()>, int, bool, ggml_tensor**, ggml_context*) (this=0x555559d6b550,
get_graph=..., n_threads=1, free_compute_buffer_immediately=true, output=0x7fffffffb3a8, output_ctx=0x55555942bf60)
at ./ggml_extend.hpp:1824
#7 0x0000555555694938 in Qwen::Qwen2_5_VLRunner::compute (this=0x555559d6b550, n_threads=1, input_ids=0x7ffbe003f210, output=0x7fffffffb3a8,
output_ctx=0x55555942bf60) at ./qwenvl.hpp:603
#8 0x00005555556a7e18 in Qwen2_5_VLCLIPEmbedder::get_learned_condition_common (this=0x55555942b9b0, work_ctx=0x55555942bf60, n_threads=1,
token_and_weights=std::tuple containing = {...}, clip_skip=-1, zero_out_masked=false) at ./conditioner.hpp:1452
#9 0x00005555556a8287 in Qwen2_5_VLCLIPEmbedder::get_learned_condition (this=0x55555942b9b0, work_ctx=0x55555942bf60, n_threads=1, text="flower",
clip_skip=-1, width=512, height=512, adm_in_channels=768, zero_out_masked=false) at ./conditioner.hpp:1500
#10 0x000055555565f7d0 in generate_image_internal (sd_ctx=0x55555942a480, work_ctx=0x55555942bf60, init_latent=0x7ffbdffff060, prompt="flower",
negative_prompt="", clip_skip=-1, guidance=..., eta=0, shifted_timestep=0, width=512, height=512, sample_method=EULER,
sigmas=std::vector of length 21, capacity 32 = {...}, seed=42, batch_count=1, control_image=..., control_strength=0.899999976, pm_params=...,
ref_latents=std::vector of length 0, capacity 0, increase_ref_index=false, concat_latent=0x0, denoise_mask=0x0)
at ./stable-diffusion.cpp:2086
#11 0x0000555555661c6b in generate_image (sd_ctx=0x55555942a480, sd_img_gen_params=0x7fffffffbe70)
at ./stable-diffusion.cpp:2492
#12 0x00005555555add61 in main (argc=25, argv=0x7fffffffd828) at ./examples/cli/main.cpp:1392
jeffbolznv
commented
Sep 24, 2025
What are the src and dst types for the GET_ROWS that crashes?
(gdb) frame 2
#2 0x00005555559a38f3 in ggml_vk_build_graph (ctx=0x5555594e62e0, cgraph=0x7ffbc548e210, node_idx=2, node_begin=0x0, node_idx_begin=0, dryrun=true,
last_node=false, almost_ready=false, submit=false) at ./ggml/src/ggml-vulkan/ggml-vulkan.cpp:10650
10650 ggml_pipeline_request_descriptor_sets(ctx, pipeline, 1);
(gdb) print src0->type
6ドル = GGML_TYPE_Q4_K
(gdb) print src1->type
7ドル = GGML_TYPE_I32
(gdb) print src2->type
Cannot access memory at address 0x0
(gdb) print node->op
8ドル = GGML_OP_GET_ROWS
Interesting... the model files are qwen-image-Q4_0.gguf and Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf .
jeffbolznv
commented
Sep 24, 2025
Thanks. We're missing the K quants but I don't think there's any reason for this. I'll add it.
jeffbolznv
commented
Sep 24, 2025
Please try ggml-org/llama.cpp#16235.
After applying the change from Jeff's PR in llama.cpp to the ggml submodule in stable-diffusion.cpp, it does run, no crash. But I get garbed output, and even though it does recognize the devices:
[DEBUG] stable-diffusion.cpp:153 - Using Vulkan backend
[DEBUG] ggml_extend.hpp:62 - ggml_vulkan: Found 2 Vulkan devices:
[DEBUG] ggml_extend.hpp:62 - ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
[DEBUG] ggml_extend.hpp:62 - ggml_vulkan: 1 = Intel(R) Graphics (RPL-S) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none
.. and it swears it places them on VRAM ..
[INFO ] stable-diffusion.cpp:664 - total params memory size = 16634.26MB (VRAM 16634.26MB, RAM 0.00MB): text_encoders 4034.09MB(VRAM), diffusion_model 12460.33MB(VRAM), vae 139.84MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
rocm-smi disagrees:
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Edge) (Avg) (Mem, Compute, ID)
================================================================================================================
0 1 0x744c, 57282 37.0°C 6.0W N/A, N/A, 0 0Mhz 96Mhz 0% auto 327.0W 11% 0%
It does finish in 17 seconds per step as opposed to about 70 for successful CPU sampling, but I think that may be a red herring since the output is garbage and the GPU is idle
After applying ggml-org/llama.cpp@9073a73 and ggml-org/llama.cpp#16235 , I got a broken image too:
testqwen01./sd --diffusion-model ./Qwen_Image_Distill-Q4_0.gguf --vae ./Qwen_Image-VAE-f16.gguf --qwen2vl ./Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf -p '(...)' --cfg-scale 2.5 --sampling-method euler -v --offload-to-cpu -H 512 -W 512 --diffusion-fa --steps 20
VAE tiling also crashes with a Floating point exception(core dumped)
.
jeffbolznv
commented
Sep 25, 2025
I'm seeing similar corruption. I'll try to debug it.
I updated ggml to the latest commit and optimized the handling of embedding weights, so there’s no need to use k_quant’s get_rows. I’m not sure if this will fix the Vulkan issue.
jeffbolznv
commented
Sep 25, 2025
I don't think it's related to get_rows. Setting GGML_VK_DISABLE_FUSION=1 seems to fix it. I'll continue to narrow it down.
jeffbolznv
commented
Sep 25, 2025
Oops, I think I mixed up my experiments. I think it's forcing GGML_PREC_F32 for matrix-matrix multiplies that's fixing it. I don't know which multiplies, I just forced it for all of them.
Thanks for the quick work on a first implementation! Note this currently terminates with ggml_abort() on macOS (ARM) under Metal (compiled with -DSD_METAL=ON
), both for the original pull request and the current version.
Last lines of output:
[INFO] stable-diffusion.cpp:2185 - generating image: 1/1 - seed 42
[ERROR] ggml_extend.hpp:71 - ggml_metal_encode_node: error: unsupported op 'MUL_MAT'
.../stable-diffusion.cpp/ggml/src/ggml-metal/ggml-metal.m: 2068: unsupported op
Command:
sd --diffusion-model Qwen_Image-Q4_0.gguf --vae Qwen_Image-VAE.safetensors --qwen2vl Qwen2.5-VL-7B-Instruct-Q4_0.gguf -p "(my test prompt)" --steps 2 --width 640 --height 480
Stack trace from crash log:
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 libsystem_kernel.dylib 0x1844e2388 __pthread_kill + 8
1 libsystem_pthread.dylib 0x18451b88c pthread_kill + 296
2 libsystem_c.dylib 0x184424a3c abort + 124
3 sd 0x10464d144 ggml_abort + 160
4 sd 0x10464a758 ggml_metal_encode_node + 27288
5 sd 0x104643c2c __ggml_backend_metal_set_n_cb_block_invoke + 596
6 sd 0x1046436c4 ggml_backend_metal_graph_compute + 368
7 sd 0x104663098 ggml_backend_graph_compute + 32
8 sd 0x1045179c4 GGMLRunner::compute(std::__1::function<ggml_cgraph* ()>, int, bool, ggml_tensor**, ggml_context*) + 648
9 sd 0x10454d498 QwenImageModel::compute(int, DiffusionParams, ggml_tensor**, ggml_context*) + 152
I updated ggml to the latest commit and optimized the handling of embedding weights, so there’s no need to use k_quant’s get_rows. I’m not sure if this will fix the Vulkan issue.
Unfortunately, I get similar bad results with 94f4f29 .
ROCm fails, too (fully black images).
jeffbolznv
commented
Sep 25, 2025
This is the minimum precision change I could make to get this working in Vulkan. It's the img_mlp in QwenImageTransformerBlock that needs fp32 accumulation.
diff --git a/common.hpp b/common.hpp
index 9c8aba1..1e01825 100644
--- a/common.hpp
+++ b/common.hpp
@@ -242,7 +242,8 @@ public:
FeedForward(int64_t dim,
int64_t dim_out,
int64_t mult = 4,
- Activation activation = Activation::GEGLU) {
+ Activation activation = Activation::GEGLU,
+ bool force_prec_f32 = false) {
int64_t inner_dim = dim * mult;
if (activation == Activation::GELU) {
@@ -252,7 +253,7 @@ public:
}
// net_1 is nn.Dropout(), skip for inference
- blocks["net.2"] = std::shared_ptr<GGMLBlock>(new Linear(inner_dim, dim_out));
+ blocks["net.2"] = std::shared_ptr<GGMLBlock>(new Linear(inner_dim, dim_out, true, false, force_prec_f32));
}
struct ggml_tensor* forward(struct ggml_context* ctx, struct ggml_tensor* x) {
diff --git a/ggml_extend.hpp b/ggml_extend.hpp
index 965b979..e3b4926 100644
--- a/ggml_extend.hpp
+++ b/ggml_extend.hpp
@@ -933,8 +933,12 @@ __STATIC_INLINE__ struct ggml_tensor* ggml_group_norm_32(struct ggml_context* ct
__STATIC_INLINE__ struct ggml_tensor* ggml_nn_linear(struct ggml_context* ctx,
struct ggml_tensor* x,
struct ggml_tensor* w,
- struct ggml_tensor* b) {
+ struct ggml_tensor* b,
+ bool force_prec_f32 = false) {
x = ggml_mul_mat(ctx, w, x);
+ if (force_prec_f32) {
+ ggml_mul_mat_set_prec(x, GGML_PREC_F32);
+ }
if (b != NULL) {
x = ggml_add_inplace(ctx, x, b);
}
@@ -1947,6 +1951,7 @@ protected:
int64_t out_features;
bool bias;
bool force_f32;
+ bool force_prec_f32;
void init_params(struct ggml_context* ctx, const String2GGMLType& tensor_types = {}, const std::string prefix = "") {
enum ggml_type wtype = get_type(prefix + "weight", tensor_types, GGML_TYPE_F32);
@@ -1964,11 +1969,13 @@ public:
Linear(int64_t in_features,
int64_t out_features,
bool bias = true,
- bool force_f32 = false)
+ bool force_f32 = false,
+ bool force_prec_f32 = false)
: in_features(in_features),
out_features(out_features),
bias(bias),
- force_f32(force_f32) {}
+ force_f32(force_f32),
+ force_prec_f32(force_prec_f32) {}
struct ggml_tensor* forward(struct ggml_context* ctx, struct ggml_tensor* x) {
struct ggml_tensor* w = params["weight"];
@@ -1976,7 +1983,7 @@ public:
if (bias) {
b = params["bias"];
}
- return ggml_nn_linear(ctx, x, w, b);
+ return ggml_nn_linear(ctx, x, w, b, force_prec_f32);
}
};
diff --git a/qwen_image.hpp b/qwen_image.hpp
index 2f5dad8..ab16b82 100644
--- a/qwen_image.hpp
+++ b/qwen_image.hpp
@@ -196,7 +196,7 @@ namespace Qwen {
blocks["img_norm1"] = std::shared_ptr<GGMLBlock>(new LayerNorm(dim, eps, false));
blocks["img_norm2"] = std::shared_ptr<GGMLBlock>(new LayerNorm(dim, eps, false));
- blocks["img_mlp"] = std::shared_ptr<GGMLBlock>(new FeedForward(dim, dim, 4, FeedForward::Activation::GELU));
+ blocks["img_mlp"] = std::shared_ptr<GGMLBlock>(new FeedForward(dim, dim, 4, FeedForward::Activation::GELU, true));
// txt_mod.0 is nn.SiLU()
blocks["txt_mod.1"] = std::shared_ptr<GGMLBlock>(new Linear(dim, 6 * dim, true));
This is the minimum precision change I could make to get this working in Vulkan. It's the img_mlp in QwenImageTransformerBlock that needs fp32 accumulation.
It worked 😀
test4-vk-working
SeanTater
commented
Sep 26, 2025
This is a great improvement! There's still a bug with terrible results if you don't include "--diffusion-fa", but with it on I do get good results
Uh oh!
There was an error while loading. Please reload this page.
txt2img
img2img
WIP