ggml is compiled with cublas, but GPU is not used #106

Open

Description

opened

on Nov 27, 2024

I compiled ggml with -DGGML_CUBLAS=ON and then clip.cpp, and used it to get text encodings, but the GPU is not being used. The code takes the same amount of time as it did with CPU-only. Is this expected? Does clip_text_encode always use the CPU no matter what? Or did I forget to do something?

Details:
ggml is detecting the GPU without problem (Nvidia AGX Orin):

$ ./myapp
ggml_init_cublas: found 1 CUDA devices:
 Device 0: Orin, compute capability 8.7

Simplified version of my code:

#include "clip.h"
// ...
string model = "clip-vit-base-patch32_ggml-text-model-f16.gguf";
clip_ctx *ctx = clip_model_load(model.c_str(), verbosity);
for (int i = 0; i < 1000; i++)
 clip_tokenize(ctx, "person".c_str(), &tokens);
 float txt_vec[512];
 clip_text_encode(ctx, /*threads:*/4, &tokens, txt_vec, true);
}

This takes 8 seconds to finish. While this runs, I have jtop open, and I see the GPU is only active during the first 3 seconds, when ggml gets the GPU name and compute capability to print them. After that, the GPU goes offline. GPU usage is always 0%.

Metadata

Assignees

No one assigned

Labels

No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ggml is compiled with cublas, but GPU is not used #106

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions