Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

add Qwen Image support #851

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
leejet wants to merge 12 commits into master
base: master
Choose a base branch
Loading
from qwen_image
Open

add Qwen Image support #851

leejet wants to merge 12 commits into master from qwen_image

Conversation

Copy link
Owner

@leejet leejet commented Sep 22, 2025
edited
Loading

txt2img

.\bin\Release\sd.exe --diffusion-model ..\..\ComfyUI\models\diffusion_models\qwen-image-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\qwen_image_vae.safetensors --qwen2vl ..\..\ComfyUI\models\text_encoders\Qwen2.5-VL-7B-Instruct-Q8_0.gguf -p '一个穿着"QWEN"标志的T恤的中国美女正拿着黑色的马克笔面相镜头微笑。她身后的玻璃板上手写体写着 "一、Qwen-Image的技术路线: 探索视觉生成基础模型的极限,开创理解与生成一体化的未来。二、Qwen-Image的模型特色:1、复杂文字渲染。支持中英渲染、自动布局; 2、精准图像编辑。支持文字编辑、物体增减、风格变换。三、Qwen-Image的未来愿景:赋能专业内容创作、助力生成式AI发展。"' --cfg-scale 2.5 --sampling-method euler -v --offload-to-cpu -H 1024 -W 1024 --diffusion-fa --flow-shift 3
qwen_image_t2i

img2img

.\bin\Release\sd.exe --diffusion-model ..\..\ComfyUI\models\diffusion_models\qwen-image-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\qwen_image_vae.safetensors --qwen2vl ..\..\ComfyUI\models\text_encoders\Qwen2.5-VL-7B-Instruct-Q8_0.gguf --cfg-scale 2.5 --sampling-method euler -v --offload-to-cpu --diffusion-fa --flow-shift 3 -i ..\assets\flux\flux1-dev-q8_0.png -p "a lovely cat"
qwen_image_i2i

WIP

  • Qwen Image Edit

fszontagh, luboslenco, dontgiveahack, stduhpf, rujialiu, lin72h, velartrill, jmorganca, and Seas0 reacted with heart emoji Green-Sky, luboslenco, repsac-by, dontgiveahack, ggerganov, stduhpf, rujialiu, lin72h, and jmorganca reacted with rocket emoji
@leejet leejet mentioned this pull request Sep 22, 2025
Copy link

Thanks for adding this! I got it working on CPU on my machine, but as you would expect, it's quite slow.
I tried compiling with Vulkan, which compiles, but segfaults immediately as it starts the diffusion. Are you already working on that?

FWIW, Codex suggests changing ggml_vk_build_graph, which does get it to compute something - but its nonsense results. I get garbled output which doesn't appear to depend on the prompt. It's the same with or without diffusion-fa. With vae tiling, I get a floating point exception.
2025年09月23日T12:54:42-04:00
When doing the VAE on the CPU instead, we have a different problem: we get tiled field like this, the exact color of which varies.
vulkan-variation-01-seed1000

I suspect maybe there is an as-yet-unimplemented op that it's basically just stubbing.

Copy link

Where/how does it crash with Vulkan?

Copy link
Contributor

wbruna commented Sep 24, 2025

Where/how does it crash with Vulkan?

Testing it here, I get:

$ ./sd --diffusion-model ./qwen-image-Q4_0.gguf --vae ./Qwen_Image-VAE.safetensors --qwen2vl ./Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf -p 一个穿着"QWEN"标志的T恤的中国美女正拿着黑色的马克笔面相镜头微笑。她身后的玻璃板上手写体写着 "一、Qwen-Image的技术路线: 探索视觉生成基础模型的极限,开创理解与生成一体化的未来。二、Qwen-Image的模型特色:1、复杂文字渲染。支持中英渲染、自动布局; 2、精准图像编辑。支持文字编辑、物体增减、风格变换。三、Qwen-Image的未来愿景:赋能专业内容创作、助力生成式AI发展。" --cfg-scale 2.5 --sampling-method euler -v --offload-to-cpu -H 512 -W 512 --diffusion-fa --flow-shift 3
Option: 
 n_threads: 4
 mode: img_gen
 model_path: 
 wtype: unspecified
 clip_l_path: 
 clip_g_path: 
 clip_vision_path: 
 t5xxl_path: 
 qwen2vl_path: ./Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf
 diffusion_model_path: ./qwen-image-Q4_0.gguf
 high_noise_diffusion_model_path: 
 vae_path: ./Qwen_Image-VAE.safetensors
 taesd_path: 
 esrgan_path: 
 control_net_path: 
 embedding_dir: 
 photo_maker_path: 
 pm_id_images_dir: 
 pm_id_embed_path: 
 pm_style_strength: 20.00
 output_path: output.png
 init_image_path: 
 end_image_path: 
 mask_image_path: 
 control_image_path: 
 ref_images_paths:
 control_video_path: 
 increase_ref_index: false
 offload_params_to_cpu: true
 clip_on_cpu: false
 control_net_cpu: false
 vae_on_cpu: false
 diffusion flash attention: true
 diffusion Conv2d direct: false
 vae_conv_direct: false
 control_strength: 0.90
 prompt: 一个穿着"QWEN"标志的T恤的中国美女正拿着黑色的马克笔面相镜头微笑。她身后的玻璃板上手写体写着 "一、Qwen-Image的技术路线: 探索视觉生成基础模型的极限,开创理解与生成一体化的未来。二、Qwen-Image的模型特色:1、复杂文字渲染。支持中英渲染、自动布局; 2、精准图像编辑。支持文字编辑、物体增减、风格变换。三、Qwen-Image的未来愿景:赋能专业内容创作、助力生成式AI发展。"
 negative_prompt: 
 clip_skip: -1
 width: 512
 height: 512
 sample_params: (txt_cfg: 2.50, img_cfg: 2.50, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: default, sample_method: euler, sample_steps: 20, eta: 0.00, shifted_timestep: 0)
 high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: default, sample_method: default, sample_steps: -1, eta: 0.00, shifted_timestep: 0)
 moe_boundary: 0.875
 flow_shift: 3.00
 strength(img2img): 0.75
 rng: cuda
 seed: 42
 batch_count: 1
 vae_tiling: false
 upscale_repeats: 1
 chroma_use_dit_mask: true
 chroma_use_t5_mask: false
 chroma_t5_mask_pad: 1
 video_frames: 1
 vace_strength: 1.00
 fps: 16
System Info: 
 SSE3 = 1
 AVX = 1
 AVX2 = 1
 AVX512 = 0
 AVX512_VBMI = 0
 AVX512_VNNI = 0
 FMA = 1
 NEON = 0
 ARM_FMA = 0
 F16C = 1
 FP16_VA = 0
 WASM_SIMD = 0
 VSX = 0
[DEBUG] stable-diffusion.cpp:153 - Using Vulkan backend
[DEBUG] ggml_extend.hpp:62 - ggml_vulkan: Found 1 Vulkan devices:
[DEBUG] ggml_extend.hpp:62 - ggml_vulkan: 0 = AMD Radeon RX 7600 XT (RADV NAVI33) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
[INFO ] stable-diffusion.cpp:209 - loading diffusion model from './qwen-image-Q4_0.gguf'
[INFO ] model.cpp:1071 - load ./qwen-image-Q4_0.gguf using gguf format
[DEBUG] model.cpp:1088 - init from './qwen-image-Q4_0.gguf'
[INFO ] stable-diffusion.cpp:256 - loading qwen2vl from './Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf'
[INFO ] model.cpp:1071 - load ./Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf using gguf format
[DEBUG] model.cpp:1088 - init from './Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf'
[INFO ] stable-diffusion.cpp:263 - loading vae from './Qwen_Image-VAE.safetensors'
[INFO ] model.cpp:1074 - load ./Qwen_Image-VAE.safetensors using safetensors format
[DEBUG] model.cpp:1181 - init from './Qwen_Image-VAE.safetensors', prefix = 'vae.'
[INFO ] stable-diffusion.cpp:275 - Version: Qwen Image 
[INFO ] stable-diffusion.cpp:306 - Weight type: bf16
[INFO ] stable-diffusion.cpp:307 - Conditioner weight type: f32
[INFO ] stable-diffusion.cpp:308 - Diffusion model weight type: bf16
[INFO ] stable-diffusion.cpp:309 - VAE weight type: NONE
[DEBUG] stable-diffusion.cpp:311 - ggml tensor size = 400 bytes
[INFO ] stable-diffusion.cpp:350 - Using flash attention in the diffusion model
[DEBUG] qwenvl.hpp:137 - merges size 151387
[DEBUG] qwenvl.hpp:159 - vocab size: 151665
[DEBUG] ggml_extend.hpp:1738 - qwenvl2.5 params backend buffer size = 3607.26 MB(RAM) (338 tensors)
[DEBUG] ggml_extend.hpp:1738 - qwen_image params backend buffer size = 11303.54 MB(RAM) (1933 tensors)
[DEBUG] ggml_extend.hpp:1738 - wan_vae params backend buffer size = 139.84 MB(RAM) (108 tensors)
[DEBUG] stable-diffusion.cpp:583 - loading weights
[DEBUG] model.cpp:2069 - loading tensors from ./qwen-image-Q4_0.gguf
 |=======================================> | 1933/2465 - 804.75it/s
[DEBUG] model.cpp:2069 - loading tensors from ./Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf
 |==============================================> | 2271/2465 - 222.34it/s
[DEBUG] model.cpp:2069 - loading tensors from ./Qwen_Image-VAE.safetensors
 |==============================================> | 2283/2465 - 223.49it/s[INFO ] model.cpp:2339 - unknown tensor 'first_stage_model.conv1.weight | bf16 | 4 [1, 1, 1, 1024, 1]' in model file
 |================================================> | 2393/2465 - 229.76it/s[INFO ] model.cpp:2339 - unknown tensor 'first_stage_model.conv1.bias | bf16 | 1 [32, 1, 1, 1, 1]' in model file
 |==================================================| 2465/2465 - 232.22it/s
[INFO ] model.cpp:2307 - loading tensors completed, taking 10.65s (process: 0.04s, read: 9.94s, memcpy: 0.00s, convert: 0.10s, copy_to_backend: 0.00s)
[INFO ] stable-diffusion.cpp:664 - total params memory size = 15050.64MB (VRAM 15050.64MB, RAM 0.00MB): text_encoders 3607.26MB(VRAM), diffusion_model 11303.55MB(VRAM), vae 139.84MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:726 - running in FLOW mode
[DEBUG] stable-diffusion.cpp:750 - finished loaded file
[DEBUG] stable-diffusion.cpp:2328 - generate_image 512x512
[INFO ] stable-diffusion.cpp:2441 - TXT2IMG
init (f32): shape(64, 64, 16, 1)
[INFO ] stable-diffusion.cpp:899 - attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:919 - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:920 - prompt after extract and remove lora: "一个穿着"QWEN"标志的T恤的中国美女正拿着黑色的马克笔面相镜头微笑。她身后的玻璃板上手写体写着 "一、Qwen-Image的技术路线: 探索视觉生成基础模型的极限,开创理解与生成一体化的未来。二、Qwen-Image的模型特色:1、复杂文字渲染。支持中英渲染、自动布局; 2、精准图像编辑。支持文字编辑、物体增减、风格变换。三、Qwen-Image的未来愿景:赋能专业内容创作、助力生成式AI发展。""
[DEBUG] conditioner.hpp:1416 - parse '<|im_start|>system
Describe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>
<|im_start|>user
一个穿着"QWEN"标志的T恤的中国美女正拿着黑色的马克笔面相镜头微笑。她身后的玻璃板上手写体写着 "一、Qwen-Image的技术路线: 探索视觉生成基础模型的极限,开创理解与生成一体化的未来。二、Qwen-Image的模型特色:1、复杂文字渲染。支持中英渲染、自动布局; 2、精准图像编辑。支持文字编辑、物体增减、风格变换。三、Qwen-Image的未来愿景:赋能专业内容创作、助力生成式AI发展。"<|im_end|>
<|im_start|>assistant
' to [['<|im_start|>system
Describe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>
<|im_start|>user
一个穿着"QWEN"标志的T恤的中国美女正拿着黑色的马克笔�
[INFO ] ggml_extend.hpp:1661 - qwenvl2.5 offload params (3607.26 MB, 338 tensors) to runtime backend (Vulkan0), taking 1.49s
[DEBUG] ggml_extend.hpp:1563 - qwenvl2.5 compute buffer size: 30.06 MB(VRAM)
Segmentation fault (core dumped)

gdb shows just this:

Thread 1 "sd" received signal SIGSEGV, Segmentation fault.
0x000055555587569d in ggml_vk_build_graph(ggml_backend_vk_context*, ggml_cgraph*, int, ggml_tensor*, int, bool, bool, bool, bool) ()

I'll try on a debug build. @jeffbolznv , anything more specific I could check?

Copy link
Owner Author

leejet commented Sep 24, 2025

@SeanTater @wbruna This is likely because GGML Vulkan doesn’t support im2col_3d. I’ve updated GGML, so you can pull the latest code and try again.

wbruna reacted with thumbs up emoji

Copy link
Contributor

wbruna commented Sep 24, 2025

@leejet , unfortunately a3a2b2d (with ggml 553c44706c ) crashes too:

the last output lines
ggml_backend_vk_buffer_init_tensor(0x55555963f8f0 (0x555559b0ce40), 0x7ffbc551c020)
ggml_backend_vk_buffer_init_tensor(0x55555963f8f0 (0x555559b0ce40), 0x7ffbc551c1d0)
ggml_backend_vk_buffer_set_tensor(0x55555963f8f0, 0x7ffbc548e060, 0x555559602820, 0, 4)
ggml_vk_buffer_write(4)
ggml_vk_buffer_write_2d(4, 1)
ggml_vk_create_temporary_context(0x55555a1ff900)
ggml_vk_ctx_begin(Vulkan1)
ggml_vk_create_cmd_buffer()
ggml_vk_buffer_write_2d_async(4, 1)
STAGING
ggml_vk_sync_buffers()
ggml_vk_ctx_end(0x55555a1ff900, 1)
ggml_vk_submit(0x55555a1ff900, 0x55555959c410)
ggml_vk_queue_command_pools_cleanup()
ggml_backend_vk_buffer_set_tensor(0x55555963f8f0, 0x7ffbc54a24c0, 0x7ffbe003f3a0, 0, 160)
ggml_vk_buffer_write(160)
ggml_vk_buffer_write_2d(160, 1)
ggml_vk_create_temporary_context(0x55555a1ff900)
ggml_vk_ctx_begin(Vulkan1)
ggml_vk_create_cmd_buffer()
ggml_vk_buffer_write_2d_async(160, 1)
STAGING
ggml_vk_sync_buffers()
ggml_vk_ctx_end(0x55555a1ff900, 1)
ggml_vk_submit(0x55555a1ff900, 0x55555959c410)
ggml_vk_queue_command_pools_cleanup()
ggml_vk_command_pool_cleanup()
ggml_backend_vk_buffer_set_tensor(0x55555963f8f0, 0x7ffbc54a2670, 0x55555ab88040, 0, 640)
ggml_vk_buffer_write(640)
ggml_vk_buffer_write_2d(640, 1)
ggml_vk_create_temporary_context(0x55555a1ff900)
ggml_vk_ctx_begin(Vulkan1)
ggml_vk_create_cmd_buffer()
ggml_vk_buffer_write_2d_async(640, 1)
STAGING
ggml_vk_sync_buffers()
ggml_vk_ctx_end(0x55555a1ff900, 1)
ggml_vk_submit(0x55555a1ff900, 0x55555959c410)
ggml_vk_queue_command_pools_cleanup()
ggml_backend_vk_graph_compute(1154 nodes)
ggml_vk_build_graph(0x7ffbc54a2820, RESHAPE)
ggml_vk_build_graph(0x7ffbc54a29d0, RESHAPE)
ggml_vk_build_graph(0x7ffbc54a2b80, GET_ROWS)
ggml_pipeline_request_descriptor_sets(
Thread 1 "sd" received signal SIGSEGV, Segmentation fault.
0x00007ffff7d54b24 in std::basic_ostream<char, std::char_traits<char> >& std::operator<< <char, std::char_traits<char>, std::allocator<char> >(std::basic_ostream<char, std::char_traits<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
 from /lib/x86_64-linux-gnu/libstdc++.so.6
(gdb)
GDB backtrace
(gdb) bt
#0 0x00007ffff7d54b24 in std::basic_ostream<char, std::char_traits<char> >& std::operator<< <char, std::char_traits<char>, std::allocator<char> >(std::basic_ostream<char, std::char_traits<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
 from /lib/x86_64-linux-gnu/libstdc++.so.6
#1 0x00005555558c25b2 in ggml_pipeline_request_descriptor_sets (ctx=0x5555594e62e0, pipeline=std::shared_ptr<vk_pipeline_struct> (empty) = {...}, n=1)
 at ./ggml/src/ggml-vulkan/ggml-vulkan.cpp:1653
#2 0x00005555559a38f3 in ggml_vk_build_graph (ctx=0x5555594e62e0, cgraph=0x7ffbc548e210, node_idx=2, node_begin=0x0, node_idx_begin=0, dryrun=true, 
 last_node=false, almost_ready=false, submit=false) at ./ggml/src/ggml-vulkan/ggml-vulkan.cpp:10650
#3 0x00005555559a9e1e in ggml_backend_vk_graph_compute (backend=0x555559547bb0, cgraph=0x7ffbc548e210)
 at ./ggml/src/ggml-vulkan/ggml-vulkan.cpp:11743
#4 0x0000555555a79755 in ggml_backend_graph_compute_async (backend=0x555559547bb0, cgraph=0x7ffbc548e210)
 at ./ggml/src/ggml-backend.cpp:359
#5 0x0000555555a796e5 in ggml_backend_graph_compute (backend=0x555559547bb0, cgraph=0x7ffbc548e210)
 at ./ggml/src/ggml-backend.cpp:352
#6 0x0000555555682db2 in GGMLRunner::compute(std::function<ggml_cgraph* ()>, int, bool, ggml_tensor**, ggml_context*) (this=0x555559d6b550, 
 get_graph=..., n_threads=1, free_compute_buffer_immediately=true, output=0x7fffffffb3a8, output_ctx=0x55555942bf60)
 at ./ggml_extend.hpp:1824
#7 0x0000555555694938 in Qwen::Qwen2_5_VLRunner::compute (this=0x555559d6b550, n_threads=1, input_ids=0x7ffbe003f210, output=0x7fffffffb3a8, 
 output_ctx=0x55555942bf60) at ./qwenvl.hpp:603
#8 0x00005555556a7e18 in Qwen2_5_VLCLIPEmbedder::get_learned_condition_common (this=0x55555942b9b0, work_ctx=0x55555942bf60, n_threads=1, 
 token_and_weights=std::tuple containing = {...}, clip_skip=-1, zero_out_masked=false) at ./conditioner.hpp:1452
#9 0x00005555556a8287 in Qwen2_5_VLCLIPEmbedder::get_learned_condition (this=0x55555942b9b0, work_ctx=0x55555942bf60, n_threads=1, text="flower", 
 clip_skip=-1, width=512, height=512, adm_in_channels=768, zero_out_masked=false) at ./conditioner.hpp:1500
#10 0x000055555565f7d0 in generate_image_internal (sd_ctx=0x55555942a480, work_ctx=0x55555942bf60, init_latent=0x7ffbdffff060, prompt="flower", 
 negative_prompt="", clip_skip=-1, guidance=..., eta=0, shifted_timestep=0, width=512, height=512, sample_method=EULER, 
 sigmas=std::vector of length 21, capacity 32 = {...}, seed=42, batch_count=1, control_image=..., control_strength=0.899999976, pm_params=..., 
 ref_latents=std::vector of length 0, capacity 0, increase_ref_index=false, concat_latent=0x0, denoise_mask=0x0)
 at ./stable-diffusion.cpp:2086
#11 0x0000555555661c6b in generate_image (sd_ctx=0x55555942a480, sd_img_gen_params=0x7fffffffbe70)
 at ./stable-diffusion.cpp:2492
#12 0x00005555555add61 in main (argc=25, argv=0x7fffffffd828) at ./examples/cli/main.cpp:1392
SeanTater reacted with confused emoji

Copy link

What are the src and dst types for the GET_ROWS that crashes?

Copy link
Contributor

wbruna commented Sep 24, 2025

(gdb) frame 2
#2 0x00005555559a38f3 in ggml_vk_build_graph (ctx=0x5555594e62e0, cgraph=0x7ffbc548e210, node_idx=2, node_begin=0x0, node_idx_begin=0, dryrun=true, 
 last_node=false, almost_ready=false, submit=false) at ./ggml/src/ggml-vulkan/ggml-vulkan.cpp:10650
10650 ggml_pipeline_request_descriptor_sets(ctx, pipeline, 1);
(gdb) print src0->type
6ドル = GGML_TYPE_Q4_K
(gdb) print src1->type
7ドル = GGML_TYPE_I32
(gdb) print src2->type
Cannot access memory at address 0x0
(gdb) print node->op
8ドル = GGML_OP_GET_ROWS

Interesting... the model files are qwen-image-Q4_0.gguf and Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf .

Copy link

Thanks. We're missing the K quants but I don't think there's any reason for this. I'll add it.

Green-Sky reacted with thumbs up emoji wbruna reacted with heart emoji

Copy link

Copy link

SeanTater commented Sep 25, 2025
edited
Loading

After applying the change from Jeff's PR in llama.cpp to the ggml submodule in stable-diffusion.cpp, it does run, no crash. But I get garbed output, and even though it does recognize the devices:

[DEBUG] stable-diffusion.cpp:153 - Using Vulkan backend
[DEBUG] ggml_extend.hpp:62 - ggml_vulkan: Found 2 Vulkan devices:
[DEBUG] ggml_extend.hpp:62 - ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
[DEBUG] ggml_extend.hpp:62 - ggml_vulkan: 1 = Intel(R) Graphics (RPL-S) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none

.. and it swears it places them on VRAM ..

[INFO ] stable-diffusion.cpp:664 - total params memory size = 16634.26MB (VRAM 16634.26MB, RAM 0.00MB): text_encoders 4034.09MB(VRAM), diffusion_model 12460.33MB(VRAM), vae 139.84MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)

rocm-smi disagrees:

Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% 
 (DID, GUID) (Edge) (Avg) (Mem, Compute, ID) 
================================================================================================================
0 1 0x744c, 57282 37.0°C 6.0W N/A, N/A, 0 0Mhz 96Mhz 0% auto 327.0W 11% 0% 

It does finish in 17 seconds per step as opposed to about 70 for successful CPU sampling, but I think that may be a red herring since the output is garbage and the GPU is idle

Copy link
Contributor

wbruna commented Sep 25, 2025

After applying ggml-org/llama.cpp@9073a73 and ggml-org/llama.cpp#16235 , I got a broken image too:

./sd --diffusion-model ./Qwen_Image_Distill-Q4_0.gguf --vae ./Qwen_Image-VAE-f16.gguf --qwen2vl ./Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf -p '(...)' --cfg-scale 2.5 --sampling-method euler -v --offload-to-cpu -H 512 -W 512 --diffusion-fa --steps 20

testqwen01

VAE tiling also crashes with a Floating point exception(core dumped) .

Copy link

I'm seeing similar corruption. I'll try to debug it.

Copy link
Owner Author

leejet commented Sep 25, 2025

I updated ggml to the latest commit and optimized the handling of embedding weights, so there’s no need to use k_quant’s get_rows. I’m not sure if this will fix the Vulkan issue.

Copy link

I don't think it's related to get_rows. Setting GGML_VK_DISABLE_FUSION=1 seems to fix it. I'll continue to narrow it down.

leejet and Green-Sky reacted with thumbs up emoji

Copy link

Oops, I think I mixed up my experiments. I think it's forcing GGML_PREC_F32 for matrix-matrix multiplies that's fixing it. I don't know which multiplies, I just forced it for all of them.

Green-Sky reacted with eyes emoji

Copy link

dekstop commented Sep 25, 2025
edited
Loading

Thanks for the quick work on a first implementation! Note this currently terminates with ggml_abort() on macOS (ARM) under Metal (compiled with -DSD_METAL=ON), both for the original pull request and the current version.

Last lines of output:

[INFO] stable-diffusion.cpp:2185 - generating image: 1/1 - seed 42
[ERROR] ggml_extend.hpp:71 - ggml_metal_encode_node: error: unsupported op 'MUL_MAT'
.../stable-diffusion.cpp/ggml/src/ggml-metal/ggml-metal.m: 2068: unsupported op

Command:

sd --diffusion-model Qwen_Image-Q4_0.gguf --vae Qwen_Image-VAE.safetensors --qwen2vl Qwen2.5-VL-7B-Instruct-Q4_0.gguf -p "(my test prompt)" --steps 2 --width 640 --height 480

Stack trace from crash log:

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 libsystem_kernel.dylib 	 0x1844e2388 __pthread_kill + 8
1 libsystem_pthread.dylib 	 0x18451b88c pthread_kill + 296
2 libsystem_c.dylib 	 0x184424a3c abort + 124
3 sd 	 0x10464d144 ggml_abort + 160
4 sd 	 0x10464a758 ggml_metal_encode_node + 27288
5 sd 	 0x104643c2c __ggml_backend_metal_set_n_cb_block_invoke + 596
6 sd 	 0x1046436c4 ggml_backend_metal_graph_compute + 368
7 sd 	 0x104663098 ggml_backend_graph_compute + 32
8 sd 	 0x1045179c4 GGMLRunner::compute(std::__1::function<ggml_cgraph* ()>, int, bool, ggml_tensor**, ggml_context*) + 648
9 sd 	 0x10454d498 QwenImageModel::compute(int, DiffusionParams, ggml_tensor**, ggml_context*) + 152

Copy link
Contributor

wbruna commented Sep 25, 2025

I updated ggml to the latest commit and optimized the handling of embedding weights, so there’s no need to use k_quant’s get_rows. I’m not sure if this will fix the Vulkan issue.

Unfortunately, I get similar bad results with 94f4f29 .

ROCm fails, too (fully black images).

Copy link

This is the minimum precision change I could make to get this working in Vulkan. It's the img_mlp in QwenImageTransformerBlock that needs fp32 accumulation.

diff --git a/common.hpp b/common.hpp
index 9c8aba1..1e01825 100644
--- a/common.hpp
+++ b/common.hpp
@@ -242,7 +242,8 @@ public:
 FeedForward(int64_t dim,
 int64_t dim_out,
 int64_t mult = 4,
- Activation activation = Activation::GEGLU) {
+ Activation activation = Activation::GEGLU,
+ bool force_prec_f32 = false) {
 int64_t inner_dim = dim * mult;
 
 if (activation == Activation::GELU) {
@@ -252,7 +253,7 @@ public:
 }
 
 // net_1 is nn.Dropout(), skip for inference
- blocks["net.2"] = std::shared_ptr<GGMLBlock>(new Linear(inner_dim, dim_out));
+ blocks["net.2"] = std::shared_ptr<GGMLBlock>(new Linear(inner_dim, dim_out, true, false, force_prec_f32));
 }
 
 struct ggml_tensor* forward(struct ggml_context* ctx, struct ggml_tensor* x) {
diff --git a/ggml_extend.hpp b/ggml_extend.hpp
index 965b979..e3b4926 100644
--- a/ggml_extend.hpp
+++ b/ggml_extend.hpp
@@ -933,8 +933,12 @@ __STATIC_INLINE__ struct ggml_tensor* ggml_group_norm_32(struct ggml_context* ct
 __STATIC_INLINE__ struct ggml_tensor* ggml_nn_linear(struct ggml_context* ctx,
 struct ggml_tensor* x,
 struct ggml_tensor* w,
- struct ggml_tensor* b) {
+ struct ggml_tensor* b,
+ bool force_prec_f32 = false) {
 x = ggml_mul_mat(ctx, w, x);
+ if (force_prec_f32) {
+ ggml_mul_mat_set_prec(x, GGML_PREC_F32);
+ }
 if (b != NULL) {
 x = ggml_add_inplace(ctx, x, b);
 }
@@ -1947,6 +1951,7 @@ protected:
 int64_t out_features;
 bool bias;
 bool force_f32;
+ bool force_prec_f32;
 
 void init_params(struct ggml_context* ctx, const String2GGMLType& tensor_types = {}, const std::string prefix = "") {
 enum ggml_type wtype = get_type(prefix + "weight", tensor_types, GGML_TYPE_F32);
@@ -1964,11 +1969,13 @@ public:
 Linear(int64_t in_features,
 int64_t out_features,
 bool bias = true,
- bool force_f32 = false)
+ bool force_f32 = false,
+ bool force_prec_f32 = false)
 : in_features(in_features),
 out_features(out_features),
 bias(bias),
- force_f32(force_f32) {}
+ force_f32(force_f32),
+ force_prec_f32(force_prec_f32) {}
 
 struct ggml_tensor* forward(struct ggml_context* ctx, struct ggml_tensor* x) {
 struct ggml_tensor* w = params["weight"];
@@ -1976,7 +1983,7 @@ public:
 if (bias) {
 b = params["bias"];
 }
- return ggml_nn_linear(ctx, x, w, b);
+ return ggml_nn_linear(ctx, x, w, b, force_prec_f32);
 }
 };
 
diff --git a/qwen_image.hpp b/qwen_image.hpp
index 2f5dad8..ab16b82 100644
--- a/qwen_image.hpp
+++ b/qwen_image.hpp
@@ -196,7 +196,7 @@ namespace Qwen {
 
 blocks["img_norm1"] = std::shared_ptr<GGMLBlock>(new LayerNorm(dim, eps, false));
 blocks["img_norm2"] = std::shared_ptr<GGMLBlock>(new LayerNorm(dim, eps, false));
- blocks["img_mlp"] = std::shared_ptr<GGMLBlock>(new FeedForward(dim, dim, 4, FeedForward::Activation::GELU));
+ blocks["img_mlp"] = std::shared_ptr<GGMLBlock>(new FeedForward(dim, dim, 4, FeedForward::Activation::GELU, true));
 
 // txt_mod.0 is nn.SiLU()
 blocks["txt_mod.1"] = std::shared_ptr<GGMLBlock>(new Linear(dim, 6 * dim, true));
Green-Sky reacted with eyes emoji

Copy link
Contributor

wbruna commented Sep 25, 2025

This is the minimum precision change I could make to get this working in Vulkan. It's the img_mlp in QwenImageTransformerBlock that needs fp32 accumulation.

It worked 😀

test4-vk-working
LostRuins reacted with rocket emoji

Copy link

This is a great improvement! There's still a bug with terrible results if you don't include "--diffusion-fa", but with it on I do get good results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Reviewers
No reviews
Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

AltStyle によって変換されたページ (->オリジナル) /