-
Notifications
You must be signed in to change notification settings - Fork 33
Qwen image edit doesn't always work on AMD HIP/ROCm #133
-
Hi! I'm a huge fan of this project, it's helped me with video generation on my RX 9070 XT. Recently, I created a workflow for Qwen image editing and wanted to use DistorchLoader to load the GGUF. The problem is that, when it comes to processing the prompt, it loads the clip but also allocates memory for the model because Power LoRA takes both the model and the clip as input. This behaviour completely destroys the workflow because it does not try to use the allocated memory, but instead tries to reload the model. Ultimately, at the ksampler, it just hangs indefinitely.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
Replies: 9 comments
-
I recently reverted a few changes that was causing some memory issues.
Can you please try again, and if you get the same results, please post a workflow and I will attempt to replicate your issue.
Cheers!
Beta Was this translation helpful? Give feedback.
All reactions
-
My workflow:
Test.json
Settings:
pytorch version: 2.9.0.dev20250908+rocm6.4
AMD arch: gfx1201
ROCm version: (6, 4)
The memory situation when it doesn't work:
Image
The memory situation when it works:
Image
I tried the update but unfortunately it didn't change anything. I can tell that the gpu actually crashes
Beta Was this translation helpful? Give feedback.
All reactions
-
👀 1
-
I think that this is relevant, on the power lora:
[MultiGPU Initialization] current_device set to: cuda:0
[MultiGPU_DisTorch2] Successfully patched ModelPatcher.partially_load
gguf qtypes: F32 (1087), BF16 (6), Q5_K (28), Q4_K (580), Q6_K (232)
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX
[MultiGPU_DisTorch2] Settings changed for model 9613af3a. Previous settings hash: None, New settings hash: 0c1d17df681ee861c5929ebf28e73a7361eec3e5dd27031928de093b6cc63b23. Forcing reload.
gguf qtypes: F32 (1087), BF16 (6), Q5_K (28), Q4_K (580), Q6_K (232)
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX
[MultiGPU_DisTorch2] Full allocation string: #cuda:0;15.0;cpu
And later on the ksampler:
[MultiGPU_DisTorch2] Using static allocation for model 9613af3a
[MultiGPU_DisTorch2] Compute Device: cuda:0
[MultiGPU_DisTorch2] Expert String Examples:
Direct(byte) Mode - cuda:0,500mb;cuda:1,3.0g;cpu,5gb* -> '*' cpu = over/underflow device, put 0.50gb on cuda0, 3.00gb on cuda1, and 5.00gb (or the rest) on cpu
Ratio(%) Mode - cuda:0,8%;cuda:1,8%;cpu,4% -> 8:8:4 ratio, put 40% on cuda0, 40% on cuda1, and 20% on cpu
===============================================
DisTorch2 Model Virtual VRAM Analysis
===============================================
Object Role Original(GB) Total(GB) Virt(GB)
-----------------------------------------------
cuda:0 recip 15.92GB 30.92GB +15.00GB
cpu donor 30.95GB 15.95GB -15.00GB
-----------------------------------------------
model model 12.17GB 0.00GB -15.00GB
[MultiGPU_DisTorch2] Final Allocation String: cuda:0,0.0000;cpu,0.4847
==================================================
DisTorch2 Model Device Allocations
==================================================
Device VRAM GB Dev % Model GB Dist %
--------------------------------------------------
cuda:0 15.92 0.0% 0.00 0.0%
cpu 30.95 48.5% 15.00 100.0%
--------------------------------------------------
DisTorch2 Model Layer Distribution
--------------------------------------------------
Layer Type Layers Memory (MB) % Total
--------------------------------------------------
Linear 846 12568.20 100.0%
RMSNorm 241 0.07 0.0%
LayerNorm 241 0.00 0.0%
--------------------------------------------------
DisTorch2 Model Final Device/Layer Assignments
--------------------------------------------------
Device Layers Memory (MB) % Total
--------------------------------------------------
cuda:0 (<0.01%) 484 0.83 0.0%
cpu 844 12567.43 100.0%
--------------------------------------------------
Beta Was this translation helpful? Give feedback.
All reactions
-
Hey, @wasd-tech - I am @italbar over on Discord if you are on that platform and are interested in likely more timely interactions.
Still trying to get your workflow going. The CLIP .gguf I am using is giving the workflow fits. I'll post once I can get a standard generation to run.
Beta Was this translation helpful? Give feedback.
All reactions
-
Of course I can join a discord, just leave the link.
The CLIP .gguf I am using is giving the workflow fits
You need another file to use qwen2.5vl with the GGUF format:https://huggingface.co/QuantStack/Qwen-Image-Edit-GGUF
I'm afraid this is a problem exclusively related to AMD. If you come to this conclusion, do not hesitate to ask me to do tests and other things.
Beta Was this translation helpful? Give feedback.
All reactions
-
Hey, @wasd-tech I know you have seen some variable results on your end. I am happy to keep this issue open if there is something we can work on that looks like a MultGPU-specific issue. 😸
Beta Was this translation helpful? Give feedback.
All reactions
-
@pollockjj thanks a lot for the help 👍 , After some testing it seems that it's qwen image edit the problem and not the power lora, so I changed the title of the issue.
Beta Was this translation helpful? Give feedback.
All reactions
-
Since this seems a problem specific with my AMD gpu (ROCm fails to allocate and use so much memory and only with a virtual vram of about 10 it works) I suggest to leave this open for future reference, so people will not open multiple issues on the same thing. I will continually test and see if something changes, maybe with the major release of ROCm 7.0 or different versions of pytorch.
@pollockjj Are you okay with this?
Beta Was this translation helpful? Give feedback.
All reactions
-
Migrating this to a discussion - that way it remains a open, ongoing resource and can remain open indefinitely.
Hope that works @wasd-tech.
Beta Was this translation helpful? Give feedback.