Search code, repositories, users, issues, pull requests...


 add qwen tokenizer

f88daa5


 add qwen2.5 vl support

fe4e731


 mv qwen.hpp -> qwenvl.hpp

d8d4c26


 add qwen image model

d232509


 add qwen image t2i pipeline

cf19c6e


 fix qwen image flash attn

477911f


 add qwen image i2i pipline

feb0279


 change encoding of vocab_qwen.hpp to utf8

5af0bb0

@leejet leejet mentioned this pull request

Sep 22, 2025

Qwen Image #758

Open

@SeanTater

Copy link

SeanTater commented Sep 24, 2025

Thanks for adding this! I got it working on CPU on my machine, but as you would expect, it's quite slow.
I tried compiling with Vulkan, which compiles, but segfaults immediately as it starts the diffusion. Are you already working on that?

FWIW, Codex suggests changing ggml_vk_build_graph, which does get it to compute something - but its nonsense results. I get garbled output which doesn't appear to depend on the prompt. It's the same with or without diffusion-fa. With vae tiling, I get a floating point exception.
2025年09月23日T12:54:42-04:00
When doing the VAE on the CPU instead, we have a different problem: we get tiled field like this, the exact color of which varies.
vulkan-variation-01-seed1000

I suspect maybe there is an as-yet-unimplemented op that it's basically just stubbing.

Copy link

jeffbolznv commented Sep 24, 2025

Where/how does it crash with Vulkan?

Copy link

Contributor

wbruna commented Sep 24, 2025

Where/how does it crash with Vulkan?

Testing it here, I get:

$ ./sd --diffusion-model ./qwen-image-Q4_0.gguf --vae ./Qwen_Image-VAE.safetensors --qwen2vl ./Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf -p 一个穿着"QWEN"标志的T恤的中国美女正拿着黑色的马克笔面相镜头微笑。她身后的玻璃板上手写体写着 "一、Qwen-Image的技术路线: 探索视觉生成基础模型的极限,开创理解与生成一体化的未来。二、Qwen-Image的模型特色:1、复杂文字渲染。支持中英渲染、自动布局; 2、精准图像编辑。支持文字编辑、物体增减、风格变换。三、Qwen-Image的未来愿景:赋能专业内容创作、助力生成式AI发展。" --cfg-scale 2.5 --sampling-method euler -v --offload-to-cpu -H 512 -W 512 --diffusion-fa --flow-shift 3
Option: 
 n_threads: 4
 mode: img_gen
 model_path: 
 wtype: unspecified
 clip_l_path: 
 clip_g_path: 
 clip_vision_path: 
 t5xxl_path: 
 qwen2vl_path: ./Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf
 diffusion_model_path: ./qwen-image-Q4_0.gguf
 high_noise_diffusion_model_path: 
 vae_path: ./Qwen_Image-VAE.safetensors
 taesd_path: 
 esrgan_path: 
 control_net_path: 
 embedding_dir: 
 photo_maker_path: 
 pm_id_images_dir: 
 pm_id_embed_path: 
 pm_style_strength: 20.00
 output_path: output.png
 init_image_path: 
 end_image_path: 
 mask_image_path: 
 control_image_path: 
 ref_images_paths:
 control_video_path: 
 increase_ref_index: false
 offload_params_to_cpu: true
 clip_on_cpu: false
 control_net_cpu: false
 vae_on_cpu: false
 diffusion flash attention: true
 diffusion Conv2d direct: false
 vae_conv_direct: false
 control_strength: 0.90
 prompt: 一个穿着"QWEN"标志的T恤的中国美女正拿着黑色的马克笔面相镜头微笑。她身后的玻璃板上手写体写着 "一、Qwen-Image的技术路线: 探索视觉生成基础模型的极限,开创理解与生成一体化的未来。二、Qwen-Image的模型特色:1、复杂文字渲染。支持中英渲染、自动布局; 2、精准图像编辑。支持文字编辑、物体增减、风格变换。三、Qwen-Image的未来愿景:赋能专业内容创作、助力生成式AI发展。"
 negative_prompt: 
 clip_skip: -1
 width: 512
 height: 512
 sample_params: (txt_cfg: 2.50, img_cfg: 2.50, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: default, sample_method: euler, sample_steps: 20, eta: 0.00, shifted_timestep: 0)
 high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: default, sample_method: default, sample_steps: -1, eta: 0.00, shifted_timestep: 0)
 moe_boundary: 0.875
 flow_shift: 3.00
 strength(img2img): 0.75
 rng: cuda
 seed: 42
 batch_count: 1
 vae_tiling: false
 upscale_repeats: 1
 chroma_use_dit_mask: true
 chroma_use_t5_mask: false
 chroma_t5_mask_pad: 1
 video_frames: 1
 vace_strength: 1.00
 fps: 16
System Info: 
 SSE3 = 1
 AVX = 1
 AVX2 = 1
 AVX512 = 0
 AVX512_VBMI = 0
 AVX512_VNNI = 0
 FMA = 1
 NEON = 0
 ARM_FMA = 0
 F16C = 1
 FP16_VA = 0
 WASM_SIMD = 0
 VSX = 0
[DEBUG] stable-diffusion.cpp:153 - Using Vulkan backend
[DEBUG] ggml_extend.hpp:62 - ggml_vulkan: Found 1 Vulkan devices:
[DEBUG] ggml_extend.hpp:62 - ggml_vulkan: 0 = AMD Radeon RX 7600 XT (RADV NAVI33) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
[INFO ] stable-diffusion.cpp:209 - loading diffusion model from './qwen-image-Q4_0.gguf'
[INFO ] model.cpp:1071 - load ./qwen-image-Q4_0.gguf using gguf format
[DEBUG] model.cpp:1088 - init from './qwen-image-Q4_0.gguf'
[INFO ] stable-diffusion.cpp:256 - loading qwen2vl from './Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf'
[INFO ] model.cpp:1071 - load ./Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf using gguf format
[DEBUG] model.cpp:1088 - init from './Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf'
[INFO ] stable-diffusion.cpp:263 - loading vae from './Qwen_Image-VAE.safetensors'
[INFO ] model.cpp:1074 - load ./Qwen_Image-VAE.safetensors using safetensors format
[DEBUG] model.cpp:1181 - init from './Qwen_Image-VAE.safetensors', prefix = 'vae.'
[INFO ] stable-diffusion.cpp:275 - Version: Qwen Image 
[INFO ] stable-diffusion.cpp:306 - Weight type: bf16
[INFO ] stable-diffusion.cpp:307 - Conditioner weight type: f32
[INFO ] stable-diffusion.cpp:308 - Diffusion model weight type: bf16
[INFO ] stable-diffusion.cpp:309 - VAE weight type: NONE
[DEBUG] stable-diffusion.cpp:311 - ggml tensor size = 400 bytes
[INFO ] stable-diffusion.cpp:350 - Using flash attention in the diffusion model
[DEBUG] qwenvl.hpp:137 - merges size 151387
[DEBUG] qwenvl.hpp:159 - vocab size: 151665
[DEBUG] ggml_extend.hpp:1738 - qwenvl2.5 params backend buffer size = 3607.26 MB(RAM) (338 tensors)
[DEBUG] ggml_extend.hpp:1738 - qwen_image params backend buffer size = 11303.54 MB(RAM) (1933 tensors)
[DEBUG] ggml_extend.hpp:1738 - wan_vae params backend buffer size = 139.84 MB(RAM) (108 tensors)
[DEBUG] stable-diffusion.cpp:583 - loading weights
[DEBUG] model.cpp:2069 - loading tensors from ./qwen-image-Q4_0.gguf
 |=======================================> | 1933/2465 - 804.75it/s
[DEBUG] model.cpp:2069 - loading tensors from ./Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf
 |==============================================> | 2271/2465 - 222.34it/s
[DEBUG] model.cpp:2069 - loading tensors from ./Qwen_Image-VAE.safetensors
 |==============================================> | 2283/2465 - 223.49it/s[INFO ] model.cpp:2339 - unknown tensor 'first_stage_model.conv1.weight | bf16 | 4 [1, 1, 1, 1024, 1]' in model file
 |================================================> | 2393/2465 - 229.76it/s[INFO ] model.cpp:2339 - unknown tensor 'first_stage_model.conv1.bias | bf16 | 1 [32, 1, 1, 1, 1]' in model file
 |==================================================| 2465/2465 - 232.22it/s
[INFO ] model.cpp:2307 - loading tensors completed, taking 10.65s (process: 0.04s, read: 9.94s, memcpy: 0.00s, convert: 0.10s, copy_to_backend: 0.00s)
[INFO ] stable-diffusion.cpp:664 - total params memory size = 15050.64MB (VRAM 15050.64MB, RAM 0.00MB): text_encoders 3607.26MB(VRAM), diffusion_model 11303.55MB(VRAM), vae 139.84MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:726 - running in FLOW mode
[DEBUG] stable-diffusion.cpp:750 - finished loaded file
[DEBUG] stable-diffusion.cpp:2328 - generate_image 512x512
[INFO ] stable-diffusion.cpp:2441 - TXT2IMG
init (f32): shape(64, 64, 16, 1)
[INFO ] stable-diffusion.cpp:899 - attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:919 - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:920 - prompt after extract and remove lora: "一个穿着"QWEN"标志的T恤的中国美女正拿着黑色的马克笔面相镜头微笑。她身后的玻璃板上手写体写着 "一、Qwen-Image的技术路线: 探索视觉生成基础模型的极限,开创理解与生成一体化的未来。二、Qwen-Image的模型特色:1、复杂文字渲染。支持中英渲染、自动布局; 2、精准图像编辑。支持文字编辑、物体增减、风格变换。三、Qwen-Image的未来愿景:赋能专业内容创作、助力生成式AI发展。""
[DEBUG] conditioner.hpp:1416 - parse '<|im_start|>system
Describe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>
<|im_start|>user
一个穿着"QWEN"标志的T恤的中国美女正拿着黑色的马克笔面相镜头微笑。她身后的玻璃板上手写体写着 "一、Qwen-Image的技术路线: 探索视觉生成基础模型的极限,开创理解与生成一体化的未来。二、Qwen-Image的模型特色:1、复杂文字渲染。支持中英渲染、自动布局; 2、精准图像编辑。支持文字编辑、物体增减、风格变换。三、Qwen-Image的未来愿景:赋能专业内容创作、助力生成式AI发展。"<|im_end|>
<|im_start|>assistant
' to [['<|im_start|>system
Describe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>
<|im_start|>user
一个穿着"QWEN"标志的T恤的中国美女正拿着黑色的马克笔�
[INFO ] ggml_extend.hpp:1661 - qwenvl2.5 offload params (3607.26 MB, 338 tensors) to runtime backend (Vulkan0), taking 1.49s
[DEBUG] ggml_extend.hpp:1563 - qwenvl2.5 compute buffer size: 30.06 MB(VRAM)
Segmentation fault (core dumped)

gdb shows just this:

Thread 1 "sd" received signal SIGSEGV, Segmentation fault.
0x000055555587569d in ggml_vk_build_graph(ggml_backend_vk_context*, ggml_cgraph*, int, ggml_tensor*, int, bool, bool, bool, bool) ()

I'll try on a debug build. @jeffbolznv , anything more specific I could check?

leejet added 2 commits

September 25, 2025 00:39


 Merge branch 'master' into qwen_image

a8d3aa0


 fix get_first_stage_encoding

a3a2b2d

Copy link

Owner Author

leejet commented Sep 24, 2025

@SeanTater @wbruna This is likely because GGML Vulkan doesn’t support im2col_3d. I’ve updated GGML, so you can pull the latest code and try again.

Copy link

Contributor

wbruna commented Sep 24, 2025

@leejet , unfortunately a3a2b2d (with ggml 553c44706c ) crashes too:

the last output lines

ggml_backend_vk_buffer_init_tensor(0x55555963f8f0 (0x555559b0ce40), 0x7ffbc551c020)
ggml_backend_vk_buffer_init_tensor(0x55555963f8f0 (0x555559b0ce40), 0x7ffbc551c1d0)
ggml_backend_vk_buffer_set_tensor(0x55555963f8f0, 0x7ffbc548e060, 0x555559602820, 0, 4)
ggml_vk_buffer_write(4)
ggml_vk_buffer_write_2d(4, 1)
ggml_vk_create_temporary_context(0x55555a1ff900)
ggml_vk_ctx_begin(Vulkan1)
ggml_vk_create_cmd_buffer()
ggml_vk_buffer_write_2d_async(4, 1)
STAGING
ggml_vk_sync_buffers()
ggml_vk_ctx_end(0x55555a1ff900, 1)
ggml_vk_submit(0x55555a1ff900, 0x55555959c410)
ggml_vk_queue_command_pools_cleanup()
ggml_backend_vk_buffer_set_tensor(0x55555963f8f0, 0x7ffbc54a24c0, 0x7ffbe003f3a0, 0, 160)
ggml_vk_buffer_write(160)
ggml_vk_buffer_write_2d(160, 1)
ggml_vk_create_temporary_context(0x55555a1ff900)
ggml_vk_ctx_begin(Vulkan1)
ggml_vk_create_cmd_buffer()
ggml_vk_buffer_write_2d_async(160, 1)
STAGING
ggml_vk_sync_buffers()
ggml_vk_ctx_end(0x55555a1ff900, 1)
ggml_vk_submit(0x55555a1ff900, 0x55555959c410)
ggml_vk_queue_command_pools_cleanup()
ggml_vk_command_pool_cleanup()
ggml_backend_vk_buffer_set_tensor(0x55555963f8f0, 0x7ffbc54a2670, 0x55555ab88040, 0, 640)
ggml_vk_buffer_write(640)
ggml_vk_buffer_write_2d(640, 1)
ggml_vk_create_temporary_context(0x55555a1ff900)
ggml_vk_ctx_begin(Vulkan1)
ggml_vk_create_cmd_buffer()
ggml_vk_buffer_write_2d_async(640, 1)
STAGING
ggml_vk_sync_buffers()
ggml_vk_ctx_end(0x55555a1ff900, 1)
ggml_vk_submit(0x55555a1ff900, 0x55555959c410)
ggml_vk_queue_command_pools_cleanup()
ggml_backend_vk_graph_compute(1154 nodes)
ggml_vk_build_graph(0x7ffbc54a2820, RESHAPE)
ggml_vk_build_graph(0x7ffbc54a29d0, RESHAPE)
ggml_vk_build_graph(0x7ffbc54a2b80, GET_ROWS)
ggml_pipeline_request_descriptor_sets(
Thread 1 "sd" received signal SIGSEGV, Segmentation fault.
0x00007ffff7d54b24 in std::basic_ostream<char, std::char_traits<char> >& std::operator<< <char, std::char_traits<char>, std::allocator<char> >(std::basic_ostream<char, std::char_traits<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
 from /lib/x86_64-linux-gnu/libstdc++.so.6
(gdb)

GDB backtrace

(gdb) bt
#0 0x00007ffff7d54b24 in std::basic_ostream<char, std::char_traits<char> >& std::operator<< <char, std::char_traits<char>, std::allocator<char> >(std::basic_ostream<char, std::char_traits<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
 from /lib/x86_64-linux-gnu/libstdc++.so.6
#1 0x00005555558c25b2 in ggml_pipeline_request_descriptor_sets (ctx=0x5555594e62e0, pipeline=std::shared_ptr<vk_pipeline_struct> (empty) = {...}, n=1)
 at ./ggml/src/ggml-vulkan/ggml-vulkan.cpp:1653
#2 0x00005555559a38f3 in ggml_vk_build_graph (ctx=0x5555594e62e0, cgraph=0x7ffbc548e210, node_idx=2, node_begin=0x0, node_idx_begin=0, dryrun=true, 
 last_node=false, almost_ready=false, submit=false) at ./ggml/src/ggml-vulkan/ggml-vulkan.cpp:10650
#3 0x00005555559a9e1e in ggml_backend_vk_graph_compute (backend=0x555559547bb0, cgraph=0x7ffbc548e210)
 at ./ggml/src/ggml-vulkan/ggml-vulkan.cpp:11743
#4 0x0000555555a79755 in ggml_backend_graph_compute_async (backend=0x555559547bb0, cgraph=0x7ffbc548e210)
 at ./ggml/src/ggml-backend.cpp:359
#5 0x0000555555a796e5 in ggml_backend_graph_compute (backend=0x555559547bb0, cgraph=0x7ffbc548e210)
 at ./ggml/src/ggml-backend.cpp:352
#6 0x0000555555682db2 in GGMLRunner::compute(std::function<ggml_cgraph* ()>, int, bool, ggml_tensor**, ggml_context*) (this=0x555559d6b550, 
 get_graph=..., n_threads=1, free_compute_buffer_immediately=true, output=0x7fffffffb3a8, output_ctx=0x55555942bf60)
 at ./ggml_extend.hpp:1824
#7 0x0000555555694938 in Qwen::Qwen2_5_VLRunner::compute (this=0x555559d6b550, n_threads=1, input_ids=0x7ffbe003f210, output=0x7fffffffb3a8, 
 output_ctx=0x55555942bf60) at ./qwenvl.hpp:603
#8 0x00005555556a7e18 in Qwen2_5_VLCLIPEmbedder::get_learned_condition_common (this=0x55555942b9b0, work_ctx=0x55555942bf60, n_threads=1, 
 token_and_weights=std::tuple containing = {...}, clip_skip=-1, zero_out_masked=false) at ./conditioner.hpp:1452
#9 0x00005555556a8287 in Qwen2_5_VLCLIPEmbedder::get_learned_condition (this=0x55555942b9b0, work_ctx=0x55555942bf60, n_threads=1, text="flower", 
 clip_skip=-1, width=512, height=512, adm_in_channels=768, zero_out_masked=false) at ./conditioner.hpp:1500
#10 0x000055555565f7d0 in generate_image_internal (sd_ctx=0x55555942a480, work_ctx=0x55555942bf60, init_latent=0x7ffbdffff060, prompt="flower", 
 negative_prompt="", clip_skip=-1, guidance=..., eta=0, shifted_timestep=0, width=512, height=512, sample_method=EULER, 
 sigmas=std::vector of length 21, capacity 32 = {...}, seed=42, batch_count=1, control_image=..., control_strength=0.899999976, pm_params=..., 
 ref_latents=std::vector of length 0, capacity 0, increase_ref_index=false, concat_latent=0x0, denoise_mask=0x0)
 at ./stable-diffusion.cpp:2086
#11 0x0000555555661c6b in generate_image (sd_ctx=0x55555942a480, sd_img_gen_params=0x7fffffffbe70)
 at ./stable-diffusion.cpp:2492
#12 0x00005555555add61 in main (argc=25, argv=0x7fffffffd828) at ./examples/cli/main.cpp:1392

Copy link

jeffbolznv commented Sep 24, 2025

What are the src and dst types for the GET_ROWS that crashes?

Copy link

Contributor

wbruna commented Sep 24, 2025

(gdb) frame 2
#2 0x00005555559a38f3 in ggml_vk_build_graph (ctx=0x5555594e62e0, cgraph=0x7ffbc548e210, node_idx=2, node_begin=0x0, node_idx_begin=0, dryrun=true, 
 last_node=false, almost_ready=false, submit=false) at ./ggml/src/ggml-vulkan/ggml-vulkan.cpp:10650
10650 ggml_pipeline_request_descriptor_sets(ctx, pipeline, 1);
(gdb) print src0->type
6ドル = GGML_TYPE_Q4_K
(gdb) print src1->type
7ドル = GGML_TYPE_I32
(gdb) print src2->type
Cannot access memory at address 0x0
(gdb) print node->op
8ドル = GGML_OP_GET_ROWS

Interesting... the model files are qwen-image-Q4_0.gguf and Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf .

vulkan: support GET_ROWS for k-quants ggml-org/llama.cpp#16235

Copy link

jeffbolznv commented Sep 24, 2025

Thanks. We're missing the K quants but I don't think there's any reason for this. I'll add it.

@jeffbolznv jeffbolznv mentioned this pull request

Sep 24, 2025

Merged

Copy link

jeffbolznv commented Sep 24, 2025

Please try ggml-org/llama.cpp#16235.

@SeanTater

Copy link

SeanTater commented Sep 25, 2025 •

edited

Loading

After applying the change from Jeff's PR in llama.cpp to the ggml submodule in stable-diffusion.cpp, it does run, no crash. But I get garbed output, and even though it does recognize the devices:

[DEBUG] stable-diffusion.cpp:153 - Using Vulkan backend
[DEBUG] ggml_extend.hpp:62 - ggml_vulkan: Found 2 Vulkan devices:
[DEBUG] ggml_extend.hpp:62 - ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
[DEBUG] ggml_extend.hpp:62 - ggml_vulkan: 1 = Intel(R) Graphics (RPL-S) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none

.. and it swears it places them on VRAM ..

[INFO ] stable-diffusion.cpp:664 - total params memory size = 16634.26MB (VRAM 16634.26MB, RAM 0.00MB): text_encoders 4034.09MB(VRAM), diffusion_model 12460.33MB(VRAM), vae 139.84MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)

rocm-smi disagrees:

Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% 
 (DID, GUID) (Edge) (Avg) (Mem, Compute, ID) 
================================================================================================================
0 1 0x744c, 57282 37.0°C 6.0W N/A, N/A, 0 0Mhz 96Mhz 0% auto 327.0W 11% 0%

It does finish in 17 seconds per step as opposed to about 70 for successful CPU sampling, but I think that may be a red herring since the output is garbage and the GPU is idle

unsupported op 'IM2COL_3D' on Mac #850

Copy link

Contributor

wbruna commented Sep 25, 2025

After applying ggml-org/llama.cpp@9073a73 and ggml-org/llama.cpp#16235 , I got a broken image too:

./sd --diffusion-model ./Qwen_Image_Distill-Q4_0.gguf --vae ./Qwen_Image-VAE-f16.gguf --qwen2vl ./Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf -p '(...)' --cfg-scale 2.5 --sampling-method euler -v --offload-to-cpu -H 512 -W 512 --diffusion-fa --steps 20

testqwen01

VAE tiling also crashes with a Floating point exception(core dumped) .

@wbruna wbruna mentioned this pull request

Sep 25, 2025

Open


 Merge branch 'master' into qwen_image

178a415

Copy link

jeffbolznv commented Sep 25, 2025

I'm seeing similar corruption. I'll try to debug it.


 Merge branch 'master' into qwen_image

94f4f29

Copy link

Owner Author

leejet commented Sep 25, 2025

I updated ggml to the latest commit and optimized the handling of embedding weights, so there’s no need to use k_quant’s get_rows. I’m not sure if this will fix the Vulkan issue.

Copy link

jeffbolznv commented Sep 25, 2025

I don't think it's related to get_rows. Setting GGML_VK_DISABLE_FUSION=1 seems to fix it. I'll continue to narrow it down.

Copy link

jeffbolznv commented Sep 25, 2025

Oops, I think I mixed up my experiments. I think it's forcing GGML_PREC_F32 for matrix-matrix multiplies that's fixing it. I don't know which multiplies, I just forced it for all of them.

@dekstop

Copy link

dekstop commented Sep 25, 2025 •

edited

Loading

Thanks for the quick work on a first implementation! Note this currently terminates with ggml_abort() on macOS (ARM) under Metal (compiled with -DSD_METAL=ON), both for the original pull request and the current version.

Last lines of output:

[INFO] stable-diffusion.cpp:2185 - generating image: 1/1 - seed 42
[ERROR] ggml_extend.hpp:71 - ggml_metal_encode_node: error: unsupported op 'MUL_MAT'
.../stable-diffusion.cpp/ggml/src/ggml-metal/ggml-metal.m: 2068: unsupported op

Command:

sd --diffusion-model Qwen_Image-Q4_0.gguf --vae Qwen_Image-VAE.safetensors --qwen2vl Qwen2.5-VL-7B-Instruct-Q4_0.gguf -p "(my test prompt)" --steps 2 --width 640 --height 480

Stack trace from crash log:

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 libsystem_kernel.dylib 	 0x1844e2388 __pthread_kill + 8
1 libsystem_pthread.dylib 	 0x18451b88c pthread_kill + 296
2 libsystem_c.dylib 	 0x184424a3c abort + 124
3 sd 	 0x10464d144 ggml_abort + 160
4 sd 	 0x10464a758 ggml_metal_encode_node + 27288
5 sd 	 0x104643c2c __ggml_backend_metal_set_n_cb_block_invoke + 596
6 sd 	 0x1046436c4 ggml_backend_metal_graph_compute + 368
7 sd 	 0x104663098 ggml_backend_graph_compute + 32
8 sd 	 0x1045179c4 GGMLRunner::compute(std::__1::function<ggml_cgraph* ()>, int, bool, ggml_tensor**, ggml_context*) + 648
9 sd 	 0x10454d498 QwenImageModel::compute(int, DiffusionParams, ggml_tensor**, ggml_context*) + 152

Copy link

Contributor

wbruna commented Sep 25, 2025

I updated ggml to the latest commit and optimized the handling of embedding weights, so there’s no need to use k_quant’s get_rows. I’m not sure if this will fix the Vulkan issue.

Unfortunately, I get similar bad results with 94f4f29 .

ROCm fails, too (fully black images).

Copy link

jeffbolznv commented Sep 25, 2025

This is the minimum precision change I could make to get this working in Vulkan. It's the img_mlp in QwenImageTransformerBlock that needs fp32 accumulation.

diff --git a/common.hpp b/common.hpp
index 9c8aba1..1e01825 100644
--- a/common.hpp
+++ b/common.hpp
@@ -242,7 +242,8 @@ public:
 FeedForward(int64_t dim,
 int64_t dim_out,
 int64_t mult = 4,
- Activation activation = Activation::GEGLU) {
+ Activation activation = Activation::GEGLU,
+ bool force_prec_f32 = false) {
 int64_t inner_dim = dim * mult;
 
 if (activation == Activation::GELU) {
@@ -252,7 +253,7 @@ public:
 }
 
 // net_1 is nn.Dropout(), skip for inference
- blocks["net.2"] = std::shared_ptr<GGMLBlock>(new Linear(inner_dim, dim_out));
+ blocks["net.2"] = std::shared_ptr<GGMLBlock>(new Linear(inner_dim, dim_out, true, false, force_prec_f32));
 }
 
 struct ggml_tensor* forward(struct ggml_context* ctx, struct ggml_tensor* x) {
diff --git a/ggml_extend.hpp b/ggml_extend.hpp
index 965b979..e3b4926 100644
--- a/ggml_extend.hpp
+++ b/ggml_extend.hpp
@@ -933,8 +933,12 @@ __STATIC_INLINE__ struct ggml_tensor* ggml_group_norm_32(struct ggml_context* ct
 __STATIC_INLINE__ struct ggml_tensor* ggml_nn_linear(struct ggml_context* ctx,
 struct ggml_tensor* x,
 struct ggml_tensor* w,
- struct ggml_tensor* b) {
+ struct ggml_tensor* b,
+ bool force_prec_f32 = false) {
 x = ggml_mul_mat(ctx, w, x);
+ if (force_prec_f32) {
+ ggml_mul_mat_set_prec(x, GGML_PREC_F32);
+ }
 if (b != NULL) {
 x = ggml_add_inplace(ctx, x, b);
 }
@@ -1947,6 +1951,7 @@ protected:
 int64_t out_features;
 bool bias;
 bool force_f32;
+ bool force_prec_f32;
 
 void init_params(struct ggml_context* ctx, const String2GGMLType& tensor_types = {}, const std::string prefix = "") {
 enum ggml_type wtype = get_type(prefix + "weight", tensor_types, GGML_TYPE_F32);
@@ -1964,11 +1969,13 @@ public:
 Linear(int64_t in_features,
 int64_t out_features,
 bool bias = true,
- bool force_f32 = false)
+ bool force_f32 = false,
+ bool force_prec_f32 = false)
 : in_features(in_features),
 out_features(out_features),
 bias(bias),
- force_f32(force_f32) {}
+ force_f32(force_f32),
+ force_prec_f32(force_prec_f32) {}
 
 struct ggml_tensor* forward(struct ggml_context* ctx, struct ggml_tensor* x) {
 struct ggml_tensor* w = params["weight"];
@@ -1976,7 +1983,7 @@ public:
 if (bias) {
 b = params["bias"];
 }
- return ggml_nn_linear(ctx, x, w, b);
+ return ggml_nn_linear(ctx, x, w, b, force_prec_f32);
 }
 };
 
diff --git a/qwen_image.hpp b/qwen_image.hpp
index 2f5dad8..ab16b82 100644
--- a/qwen_image.hpp
+++ b/qwen_image.hpp
@@ -196,7 +196,7 @@ namespace Qwen {
 
 blocks["img_norm1"] = std::shared_ptr<GGMLBlock>(new LayerNorm(dim, eps, false));
 blocks["img_norm2"] = std::shared_ptr<GGMLBlock>(new LayerNorm(dim, eps, false));
- blocks["img_mlp"] = std::shared_ptr<GGMLBlock>(new FeedForward(dim, dim, 4, FeedForward::Activation::GELU));
+ blocks["img_mlp"] = std::shared_ptr<GGMLBlock>(new FeedForward(dim, dim, 4, FeedForward::Activation::GELU, true));
 
 // txt_mod.0 is nn.SiLU()
 blocks["txt_mod.1"] = std::shared_ptr<GGMLBlock>(new Linear(dim, 6 * dim, true));