-
-
Notifications
You must be signed in to change notification settings - Fork 47
Open
@artur-ag
Description
I compiled ggml with -DGGML_CUBLAS=ON and then clip.cpp, and used it to get text encodings, but the GPU is not being used. The code takes the same amount of time as it did with CPU-only. Is this expected? Does clip_text_encode always use the CPU no matter what? Or did I forget to do something?
Details:
ggml is detecting the GPU without problem (Nvidia AGX Orin):
$ ./myapp ggml_init_cublas: found 1 CUDA devices: Device 0: Orin, compute capability 8.7
Simplified version of my code:
#include "clip.h" // ... string model = "clip-vit-base-patch32_ggml-text-model-f16.gguf"; clip_ctx *ctx = clip_model_load(model.c_str(), verbosity); for (int i = 0; i < 1000; i++) clip_tokenize(ctx, "person".c_str(), &tokens); float txt_vec[512]; clip_text_encode(ctx, /*threads:*/4, &tokens, txt_vec, true); }
This takes 8 seconds to finish. While this runs, I have jtop open, and I see the GPU is only active during the first 3 seconds, when ggml gets the GPU name and compute capability to print them. After that, the GPU goes offline. GPU usage is always 0%.
Metadata
Metadata
Assignees
Labels
No labels