Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

GPU optimization across different cards #1427

Pinned
JohannesGaessler started this conversation in Ideas
Discussion options

During the implementation of CUDA-accelerated token generation there was a problem when optimizing performance: different people with different GPUs were getting vastly different results in terms of which implementation is the fastest. For example, @ggerganov did an alternative implementation that was 1.4 times faster on his RTX 4080 but 2 times slower on my GTX 1070. The point of this discussion is how to resolve this issue.

I personally believe that there should be some sort of config files for different GPUs. The user could then maybe use a CLI argument like --gpu gtx1070 to get the GPU kernel, CUDA block size, etc. that provide optimal performance. The determination of the optimal configuration could then be outsourced to users who don't need programming knowledge to find out the optimal parameters for specific GPUs; they only need to edit a config file and test whether the program becomes slower or faster.

You must be logged in to vote

Replies: 8 comments 8 replies

Comment options

It could be determined on runtime by timing and tuning some parameters, not the most optimal but for most users, acceptable.

Or there could be another utility that outputs a config file. These files could be added to the repo by PRs to share with other users. This is pretty much how CLBlast does it.

You must be logged in to vote
0 replies
Comment options

Some numbers for my old 1080Ti for 13B, and shiny new 3090Ti for 13B and 30B:

$ ldd ./main | grep blas
	libcublas.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.12 (0x00007fc822800000)

13B, 1080Ti:

$ CUDA_VISIBLE_DEVICES=2 ./main -m ./models/13B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -c 2048 -n 512 --ignore-eos -s 8 -n 64 -t 8 -ngl 1000
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 546 (08737ef)
main: seed = 8
llama.cpp: loading model from ./models/13B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v2 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 90.75 KB
llama_model_load_internal: mem required = 9807.48 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 7660 MB
llama_init_from_file: kv self size = 1600.00 MB
system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 64, n_keep = 0
 I believe the meaning of life is to be found in the mysterious relationship between mind and body. To me, it means that the purpose of life is to experience all aspects of human potential—the physical, emotional, intellectual, creative, and spiritual dimensions.
A healthy mind and body are interdependent. If we ignore or repress
llama_print_timings: load time = 4872.46 ms
llama_print_timings: sample time = 42.85 ms / 64 runs ( 0.67 ms per token)
llama_print_timings: prompt eval time = 1920.90 ms / 8 tokens ( 240.11 ms per token)
llama_print_timings: eval time = 7254.80 ms / 63 runs ( 115.16 ms per token)
llama_print_timings: total time = 12187.07 ms

13B, 3090Ti:

$ CUDA_VISIBLE_DEVICES=0 ./main -m ./models/13B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -c 2048 -n 512 --ignore-eos -s 8 -n 64 -t 8 -ngl 1000
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 546 (08737ef)
main: seed = 8
llama.cpp: loading model from ./models/13B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v2 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 90.75 KB
llama_model_load_internal: mem required = 9807.48 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 7660 MB
llama_init_from_file: kv self size = 1600.00 MB
system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 64, n_keep = 0
 I believe the meaning of life is to be found in the mysterious relationship between mind and body. To me, it means that the purpose of life is to experience all aspects of human potential—the physical, emotional, intellectual, creative, and spiritual dimensions.
A healthy mind and body are interdependent. If we ignore or repress
llama_print_timings: load time = 3956.45 ms
llama_print_timings: sample time = 52.94 ms / 64 runs ( 0.83 ms per token)
llama_print_timings: prompt eval time = 415.26 ms / 8 tokens ( 51.91 ms per token)
llama_print_timings: eval time = 2686.38 ms / 63 runs ( 42.64 ms per token)
llama_print_timings: total time = 6716.90 ms

30B, 3080Ti:

$ CUDA_VISIBLE_DEVICES=0 ./main -m ./models/30B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -c 2048 -n 512 --ignore-eos -s 8 -n 64 -t 8 -ngl 1000
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 546 (08737ef)
main: seed = 8
llama.cpp: loading model from ./models/30B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v2 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 135.75 KB
llama_model_load_internal: mem required = 21695.49 MB (+ 3124.00 MB per state)
llama_model_load_internal: [cublas] offloading 60 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 19261 MB
llama_init_from_file: kv self size = 3120.00 MB
system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 64, n_keep = 0
 I believe the meaning of life is to be happy, and happiness can be achieved only through serving others.
—Dalai Lama
The first time I met His Holiness the Dalai Lama was in 1987. It was a brief meeting, but one that has stayed with me throughout my life. At the time,
llama_print_timings: load time = 13960.70 ms
llama_print_timings: sample time = 44.82 ms / 64 runs ( 0.70 ms per token)
llama_print_timings: prompt eval time = 1027.37 ms / 8 tokens ( 128.42 ms per token)
llama_print_timings: eval time = 5416.17 ms / 63 runs ( 85.97 ms per token)
llama_print_timings: total time = 19439.58 ms
You must be logged in to vote
0 replies
Comment options

So, do you consider add both implementations into the current build (that can be switch with an single arg), then ask people to report the speed difference on these implementations?

You must be logged in to vote
0 replies
Comment options

More or less, but it's also about e.g. determining which block sizes work well for specific cards.

You must be logged in to vote
0 replies
Comment options

PC - 12400/3070ti, model Wizard-Vicuna-13B-Uncensored.ggml.q8_0.
~360ms per token without -ngl
~300ms per token with -ngl 16
cuda load ~20-30%

It's faster, but not too much. Maybe i did something wrong?

You must be logged in to vote
0 replies
Comment options

Some additional comparative numbers for 13B indicating that the performance improvement is much more apparent for the 3090Ti compared to the 1080Ti. However, even the 1080Ti is appreciably improved.

Before:

$ git log | head -3
commit 7f15c5c477d9933689a9d1c40794483e350c2f19
Author: Georgi Gerganov <ggerganov@gmail.com>
Date: Fri Apr 28 21:32:52 2023 +0300

13B, 1080Ti, 16xPCIe Gen3:

llama_print_timings: load time = 2252.10 ms
llama_print_timings: sample time = 42.16 ms / 64 runs ( 0.66 ms per run)
llama_print_timings: prompt eval time = 1458.86 ms / 8 tokens ( 182.36 ms per token)
llama_print_timings: eval time = 13105.05 ms / 63 runs ( 208.02 ms per run)
llama_print_timings: total time = 15400.48 ms

13B, 3090Ti, 16xPCIe Gen3:

llama_print_timings: load time = 2992.57 ms
llama_print_timings: sample time = 41.59 ms / 64 runs ( 0.65 ms per run)
llama_print_timings: prompt eval time = 1458.74 ms / 8 tokens ( 182.34 ms per token)
llama_print_timings: eval time = 13272.89 ms / 63 runs ( 210.68 ms per run)
llama_print_timings: total time = 16308.19 ms

After:

$ git log | head -3
commit 08737ef720f0510c7ec2aa84d7f70c691073c35d
Author: Georgi Gerganov <ggerganov@gmail.com>
Date: Sat May 13 17:40:58 2023 +0300

13B, 1080Ti, 16xPCIe Gen3:

llama_print_timings: load time = 4539.69 ms
llama_print_timings: sample time = 42.74 ms / 64 runs ( 0.67 ms per token)
llama_print_timings: prompt eval time = 1860.12 ms / 8 tokens ( 232.51 ms per token)
llama_print_timings: eval time = 6957.73 ms / 63 runs ( 110.44 ms per token)
llama_print_timings: total time = 11556.87 ms

13B, 3090Ti, 16xPCIe Gen3:

llama_print_timings: load time = 3856.21 ms
llama_print_timings: sample time = 42.73 ms / 64 runs ( 0.67 ms per token)
llama_print_timings: prompt eval time = 413.36 ms / 8 tokens ( 51.67 ms per token)
llama_print_timings: eval time = 2572.92 ms / 63 runs ( 40.84 ms per token)
llama_print_timings: total time = 6488.52 ms
You must be logged in to vote
0 replies
Comment options

@ggerganov as requested here are benchmarks on an A6000. CPU is AMD Ryzen Threadripper 3960X 24-Core Processor. CUDA toolkit is 11.8.

Setup

Setup
root@51a789f0df2d:~/llama.cpp# git log | head -1
commit 08737ef720f0510c7ec2aa84d7f70c691073c35d
root@51a789f0df2d:~/llama.cpp# make clean && LLAMA_CUBLAS=1 make -j
I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: x86_64
I UNAME_M: x86_64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:
I CC: cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
I CXX: g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state build-info.h
I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: x86_64
I UNAME_M: x86_64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS: -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC: cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
I CXX: g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
cc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c examples/common.cpp -o common.o
nvcc --forward-unknown-to-host-compiler -arch=native -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -Wno-pedantic -c ggml-cuda.cu -o ggml-cuda.o
llama.cpp: In function ‘size_t llama_set_state_data(llama_context*, const uint8_t*)’:
llama.cpp:2686:27: warning: cast from type ‘const uint8_t*’ {aka ‘const unsigned char*’} to type ‘void*’ casts away qualifiers [-Wcast-qual]
 2686 | kin3d->data = (void *) inp;
 | ^~~~~~~~~~~~
llama.cpp:2690:27: warning: cast from type ‘const uint8_t*’ {aka ‘const unsigned char*’} to type ‘void*’ casts away qualifiers [-Wcast-qual]
 2690 | vin3d->data = (void *) inp;
 | ^~~~~~~~~~~~
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/main/main.cpp ggml.o llama.o common.o ggml-cuda.o -o main -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/quantize/quantize.cpp ggml.o llama.o ggml-cuda.o -o quantize -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/quantize-stats/quantize-stats.cpp ggml.o llama.o ggml-cuda.o -o quantize-stats -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/perplexity/perplexity.cpp ggml.o llama.o common.o ggml-cuda.o -o perplexity -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/embedding/embedding.cpp ggml.o llama.o common.o ggml-cuda.o -o embedding -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include pocs/vdot/vdot.cpp ggml.o ggml-cuda.o -o vdot -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
==== Run ./main -h for help. ====
root@51a789f0df2d:~/llama.cpp#

30B

30B: 75.31 ms per token

30B log
root@51a789f0df2d:~/llama.cpp# ./main -m /workspace/huggy30.ggml.q4_0.bin -p "I believe the meaning of life is" -c 2048 -n 512 --ignore-eos -s 8 -n 64 -t 8 -ngl 1000
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 546 (08737ef)
main: seed = 8
llama.cpp: loading model from /workspace/huggy30.ggml.q4_0.bin
llama_model_load_internal: format = ggjt v2 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 135.75 KB
llama_model_load_internal: mem required = 21695.49 MB (+ 3124.00 MB per state)
llama_model_load_internal: [cublas] offloading 60 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 19261 MB
llama_init_from_file: kv self size = 3120.00 MB
system_info: n_threads = 8 / 48 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 64, n_keep = 0
 I believe the meaning of life is to be happy, and happiness can be achieved only through serving others.
—Dalai Lama
The first time I met His Holiness the Dalai Lama was in 1987. It was a brief meeting, but one that has stayed with me throughout my life. At the time,
llama_print_timings: load time = 7813.07 ms
llama_print_timings: sample time = 36.00 ms / 64 runs ( 0.56 ms per token)
llama_print_timings: prompt eval time = 1096.81 ms / 8 tokens ( 137.10 ms per token)
llama_print_timings: eval time = 4744.71 ms / 63 runs ( 75.31 ms per token)
llama_print_timings: total time = 12606.04 ms
root@51a789f0df2d:~/llama.cpp#

65B

65B: 173.08 ms per token

65B log

root@51a789f0df2d:~/llama.cpp# ./main -m /workspace/huggy65B.ggml.q4_0.bin -p "I believe the meaning of life is" -c 2048 -n 512 --ignore-eos -s 8 -n 64 -t 12 -ngl 1000
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 546 (08737ef)
main: seed = 8
llama.cpp: loading model from /workspace/huggy65B.ggml.q4_0.bin
llama_model_load_internal: format = ggjt v2 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 22016
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 180.75 KB
llama_model_load_internal: mem required = 42501.71 MB (+ 5120.00 MB per state)
llama_model_load_internal: [cublas] offloading 80 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 38756 MB
llama_init_from_file: kv self size = 5120.00 MB
system_info: n_threads = 12 / 48 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 64, n_keep = 0
 I believe the meaning of life is to be happy and make others happy. My husband and I have always been a team; we’ve worked hard, raised two children with good values, and now are enjoying our grandchildren.
My life has not been all roses and sunshine. There were times in my early childhood when my
llama_print_timings: load time = 9865.45 ms
llama_print_timings: sample time = 38.29 ms / 64 runs ( 0.60 ms per token)
llama_print_timings: prompt eval time = 2518.71 ms / 8 tokens ( 314.84 ms per token)
llama_print_timings: eval time = 10904.31 ms / 63 runs ( 173.08 ms per token)
llama_print_timings: total time = 20822.38 ms
root@51a789f0df2d:~/llama.cpp#

wow!

I am really impressed by these figures. llama.cpp is getting really close to pytorch GPU inference.

I went on to test one of the models I uploaded to HF recently, gpt4-alpaca-lora_mlp-65B.

q4_0 GGML:

~/llama.cpp/main -m /workspace/models/GGML/gpt4-alpaca-lora_mlp-65B.ggml.q4_0.bin -p "### Instruction: write a story about llamas\n### Response:" -c 2048 -n -1 --ignore-eos -s 8 -n 64 -t 8 -ngl 1000

Result: 6.86 tokens/s (145.73 ms per run)

4bit GPTQ, tested with AutoGPTQ in CUDA mode

Result: 12.46 tokens/s

4bit GPTQ, tested with AutoGPTQ in Triton mode

Result: 6.12 tok/s

So llama.cpp is getting really competitive with pytorch/transformers GPU inference and even beating Triton code. Well done!

You must be logged in to vote
1 reply
Comment options

@TheBloke

Thank you for the data - very useful!

Btw, I think I saw that you were working on a code to perform the exact same perplexity computation for AutoGPTQ as we do here in llama.cpp. If that is true, definitely let us know about your progress and if you succeed in computing perplexity results for the GPTQ models that we can compare with the ggml Q4 and Q5 methods

Comment options

Quick and dirty test using Wizard-Vicuna-13B-Uncensored.ggml.q4_0.bin

GPU RTX 3060 12gb - CPU Ryzen 5600x - RAM 32gb - OS Linux Mint.

Starting prompt:
USER: What is 4x8?
ASSISTANT:
And then asking:
write a 2 paragraph story about two llamas being good friends.

Before:

llama_print_timings: load time = 2752,82 ms
llama_print_timings: sample time = 47,11 ms / 120 runs ( 0,39 ms per token)
llama_print_timings: prompt eval time = 3017,18 ms / 36 tokens ( 83,81 ms per token)
llama_print_timings: eval time = 26119,80 ms / 120 runs ( 217,67 ms per token)
llama_print_timings: total time = 66963,68 ms

After (with --n-gpu-layers 40):

llama_print_timings: load time = 3195,00 ms
llama_print_timings: sample time = 71,59 ms / 182 runs ( 0,39 ms per token)
llama_print_timings: prompt eval time = 3467,63 ms / 36 tokens ( 96,32 ms per token)
llama_print_timings: eval time = 13326,25 ms / 182 runs ( 73,22 ms per token)
llama_print_timings: total time = 47430,71 ms

In some previous runs with --n-gpu-layers 40 I had even faster times in some cases. Overall, a great jump in inference speed.

You must be logged in to vote
7 replies
Comment options

There's so many moving parts in terms of GPU speed, GPU memory bandwidth, PCIe bandwidth, CPU speed, CPU memory bandwidth that the ./perplexity test is likely a better way to get a definitive performance comparison. I haven't had time to yet run some ./perplexity tests with ngl > 0.

Perf is the ETA (in hours) for the first 406 lines of wikitext.raw. Efficiency is a metric I devised as 1 / (Perf * Perplexity):

Model Size Quantization Perf Perplexity Efficiency Model Src Pull / Commit Hardware
llama 7B q4_1 0.03 6.2 4.8 FB 08737ef, ngl=0 AMD 1950X 3.4GHz 16-Core, NVidia GTX 3090Ti, 16x PCIe Gen3
llama 7B q4_0 0.08 6.5 1.9 FB 7f15c5c AMD 1950X 3.4GHz 16-Core, NVidia GTX 1080Ti, 16x PCIe Gen3
llama 7B q4_0 0.53 6.5 0.3 FB be87b6e AMD 1950X 3.4GHz 16-Core
Comment options

JohannesGaessler May 15, 2023
Collaborator Author

Perplexity calculations are pure prompt processing though, right? So llama.cpp dequantizes the weights and then uses cuBLAS to do matrix matrix multiplications. So the implementation that I did for token generation is not used at all. In fact, I implemented specialized dequantization + matrix vector multiplication kernels because dequantization and then general matrix matrix multiplication is too slow for token generation.

Comment options

Ahh. Good to know. We need another benchmark.

Comment options

JohannesGaessler May 15, 2023
Collaborator Author

I just generate a bunch of tokens with an empty prompt.

Comment options

That sounds like a good microbenchmark. Maybe we want to add it to the repo to collect some data across a wide range of GPUs?

I'm more interested in real world end-to-end performance comparisons, of which admittedly ./perplexity was always a poor candidate, but the best that was easily available to all.

I'm looking at running EleutherAI/lm-evaluation-harness via llama-cpp-python. lm-eval already supports GPT-3, so it might offer the possibility of providing a standardized performance and quality benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

AltStyle によって変換されたページ (->オリジナル) /