Speeding up OpenCL matrix-vector multiplication

Question 1

So I'd like to get a performance boost above and beyond standard Julia matrix-vector multiply, using my Intel HD Graphics 4000 1536 MB GPU, but I can't do better than an order of magnitude worse performance.

The kernel I'm using is based on this approach. I don't know how ArrayFire gets such fast speeds, they're clearly using some sort of black magic, but I don't know what it is. Anyways here is the test code:

using OpenCL
const cl = OpenCL
function mvmulJulia(M::Int32, N::Int32)
 srand(1)
 A = rand(Float32, M, N)
 x = rand(Float32, N)
 t = @elapsed A * x
 println(t, " seconds for Julia")
 nothing
end
function mvmulGPU(M::Int32, N::Int32, P::Int32)
 @assert N % P == 0
 srand(1)
 TPG = div(N, P)
 A = rand(Float32, M, N)
 x = rand(Float32, N)
 device, ctx, queue = cl.create_compute_context()
 ctx = cl.Context(device)
 queue = cl.CmdQueue(ctx, :profile)
 A_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=A)
 x_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=x)
 y_buff = cl.Buffer(Float32, ctx, :w, M)
 const mvmulkernel = """
 kernel void mvmul(int M,
 int N,
 int P,
 int TPG,
 const global float *A,
 const global float *x,
 global float *y)
 { 
 int i = get_global_id(0);
 int j = get_global_id(1);
 int tpg = get_local_id(1);
 local float sums[$(TPG)];
 float sum = 0.0f;
 for (int p=0; p<P; p++)
 {
 sum += A[M * (TPG * p + tpg) + i] * x[TPG * p + tpg];
 }
 sums[tpg] = sum;
 barrier(CLK_LOCAL_MEM_FENCE);
 if (j == 0)
 {
 float sumtotal = 0.0f;
 for (int t=0; t<TPG; t++)
 {
 sumtotal += sums[t];
 }
 y[i] = sumtotal;
 }
 }
 """
 program = cl.Program(ctx, source=mvmulkernel) |> cl.build!
 kernel = cl.Kernel(program, "mvmul")
 evt = cl.call(queue, kernel, (M, N), (1, P), M, N, P, TPG, A_buff, x_buff, y_buff)
 t = round(evt[:profile_duration] * 1e-9, 6)
 println(t, " seconds on GPU")
 y = cl.read(queue, y_buff)
 println(isapprox(y, A * x))
 nothing
end
M = Int32(4000)
N = Int32(300)
P = Int32(50)
mvmulJulia(M, N)
mvmulGPU(M, N, P)

You can try out different M, N, and P at your leisure.

Question 2

Ok, so I figured out what I was doing wrong. Basically I was completely misunderstanding the way the work-groups and work-items were supposed to break up.

What I have in the code in my original post is 1 thread for each element of the matrix A, and then I broke each row of this matrix up into work-groups of size P.

What I was supposed to do instead was have P threads per row, which is a total of (M, P) threads, and then collect each row into a single work-group (of size P). So instead of having N/P work-groups of size P per row, I now only have a single work-group of size P for each row. Hopefully that makes sense to everyone.

So here is the corrected code. I didn't put it into functions this time, so just run the script as is. The script includes both mvmul1 and mvmul2 from that website I linked to in my original post.

using OpenCL
const cl = OpenCL
srand(1)
M = Int32(30)
N = Int32(300)
P = Int32(30)
A = rand(Float32, M, N)
x = rand(Float32, N)
mvmul1 = """
kernel void mvmul1(int M,
 int N,
 const global float *A,
 const global float *x,
 global float *y)
 { 
 int i = get_global_id(0);
 float acc = 0.0f;
 for (int j=0; j<N; j++)
 {
 acc += A[M * j + i] * x[j];
 }
 y[i] = acc;
 }
"""
mvmul2 = """
kernel void mvmul2(int M,
 int N,
 int P,
 const global float *A,
 const global float *x,
 global float *y)
 { 
 int i = get_global_id(0);
 int j = get_global_id(1);
 local float sums[$(P)];
 float sum = 0.0f;
 for (int q=0; q<(N / P); q++)
 {
 sum += A[M * (j + P * q) + i] * x[j + P * q];
 }
 sums[j] = sum;
 barrier(CLK_LOCAL_MEM_FENCE);
 if (j == 0)
 {
 float sumtotal = 0.0f;
 for (int p=0; p<P; p++)
 {
 sumtotal += sums[p];
 }
 y[i] = sumtotal;
 }
 }
"""
device, ctx, queue = cl.create_compute_context()
ctx = cl.Context(device)
queue = cl.CmdQueue(ctx, :profile)
"""mvmul1"""
A_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=A)
x_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=x)
y_buff = cl.Buffer(Float32, ctx, :w, M)
program = cl.Program(ctx, source=mvmul1) |> cl.build!
kernel = cl.Kernel(program, "mvmul1")
evt = cl.call(queue, kernel, M, nothing, M, N, A_buff, x_buff, y_buff)
tout = round(evt[:profile_duration] * 1e-9, 6)
yout = cl.read(queue, y_buff)
t = @elapsed y = A * x
println("mvmul1 is ", round(t / tout, 3), " times as fast as Julia. ", isapprox(yout, y))
"""mvmul2"""
A_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=A)
x_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=x)
y_buff = cl.Buffer(Float32, ctx, :w, M)
program = cl.Program(ctx, source=mvmul2) |> cl.build!
kernel = cl.Kernel(program, "mvmul2")
evt = cl.call(queue, kernel, (M, P), (1, P), M, N, P, A_buff, x_buff, y_buff)
tout = round(evt[:profile_duration] * 1e-9, 6)
yout = cl.read(queue, y_buff)
t = @elapsed y = A * x
println("mvmul2 is ", round(t / tout, 3), " times as fast as Julia. ", isapprox(yout, y))

You'll notice that for M < N, mvmul2 does well while mvmul1 doesn't.

But if you change M, N and P such that M>= N, then mvmul1 does well while mvmul2 doesn't, interesting

Question 3

Just out of curiosity, what's the performance like now?

Question 4

@OscarSmith, honestly I have no idea, it must have been better, but it was too long ago. My best advice would be to compare and contrast yourself.

Set Set 3722 silver badges14 bronze badges · Answer 1 · 2016-07-15 15:03:50Z

Ok, so I figured out what I was doing wrong. Basically I was completely misunderstanding the way the work-groups and work-items were supposed to break up.

What I have in the code in my original post is 1 thread for each element of the matrix A, and then I broke each row of this matrix up into work-groups of size P.

What I was supposed to do instead was have P threads per row, which is a total of (M, P) threads, and then collect each row into a single work-group (of size P). So instead of having N/P work-groups of size P per row, I now only have a single work-group of size P for each row. Hopefully that makes sense to everyone.

So here is the corrected code. I didn't put it into functions this time, so just run the script as is. The script includes both mvmul1 and mvmul2 from that website I linked to in my original post.

using OpenCL
const cl = OpenCL
srand(1)
M = Int32(30)
N = Int32(300)
P = Int32(30)
A = rand(Float32, M, N)
x = rand(Float32, N)
mvmul1 = """
kernel void mvmul1(int M,
 int N,
 const global float *A,
 const global float *x,
 global float *y)
 { 
 int i = get_global_id(0);
 float acc = 0.0f;
 for (int j=0; j<N; j++)
 {
 acc += A[M * j + i] * x[j];
 }
 y[i] = acc;
 }
"""
mvmul2 = """
kernel void mvmul2(int M,
 int N,
 int P,
 const global float *A,
 const global float *x,
 global float *y)
 { 
 int i = get_global_id(0);
 int j = get_global_id(1);
 local float sums[$(P)];
 float sum = 0.0f;
 for (int q=0; q<(N / P); q++)
 {
 sum += A[M * (j + P * q) + i] * x[j + P * q];
 }
 sums[j] = sum;
 barrier(CLK_LOCAL_MEM_FENCE);
 if (j == 0)
 {
 float sumtotal = 0.0f;
 for (int p=0; p<P; p++)
 {
 sumtotal += sums[p];
 }
 y[i] = sumtotal;
 }
 }
"""
device, ctx, queue = cl.create_compute_context()
ctx = cl.Context(device)
queue = cl.CmdQueue(ctx, :profile)
"""mvmul1"""
A_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=A)
x_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=x)
y_buff = cl.Buffer(Float32, ctx, :w, M)
program = cl.Program(ctx, source=mvmul1) |> cl.build!
kernel = cl.Kernel(program, "mvmul1")
evt = cl.call(queue, kernel, M, nothing, M, N, A_buff, x_buff, y_buff)
tout = round(evt[:profile_duration] * 1e-9, 6)
yout = cl.read(queue, y_buff)
t = @elapsed y = A * x
println("mvmul1 is ", round(t / tout, 3), " times as fast as Julia. ", isapprox(yout, y))
"""mvmul2"""
A_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=A)
x_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=x)
y_buff = cl.Buffer(Float32, ctx, :w, M)
program = cl.Program(ctx, source=mvmul2) |> cl.build!
kernel = cl.Kernel(program, "mvmul2")
evt = cl.call(queue, kernel, (M, P), (1, P), M, N, P, A_buff, x_buff, y_buff)
tout = round(evt[:profile_duration] * 1e-9, 6)
yout = cl.read(queue, y_buff)
t = @elapsed y = A * x
println("mvmul2 is ", round(t / tout, 3), " times as fast as Julia. ", isapprox(yout, y))

You'll notice that for M < N, mvmul2 does well while mvmul1 doesn't.

But if you change M, N and P such that M>= N, then mvmul1 does well while mvmul2 doesn't, interesting

@OscarSmith, honestly I have no idea, it must have been better, but it was too long ago. My best advice would be to compare and contrast yourself.

Stack Exchange Network

Speeding up OpenCL matrix-vector multiplication

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Speeding up OpenCL matrix-vector multiplication

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions