4
\$\begingroup\$

So I'd like to get a performance boost above and beyond standard Julia matrix-vector multiply, using my Intel HD Graphics 4000 1536 MB GPU, but I can't do better than an order of magnitude worse performance.

The kernel I'm using is based on this approach. I don't know how ArrayFire gets such fast speeds, they're clearly using some sort of black magic, but I don't know what it is. Anyways here is the test code:

using OpenCL
const cl = OpenCL
function mvmulJulia(M::Int32, N::Int32)
 srand(1)
 A = rand(Float32, M, N)
 x = rand(Float32, N)
 t = @elapsed A * x
 println(t, " seconds for Julia")
 nothing
end
function mvmulGPU(M::Int32, N::Int32, P::Int32)
 @assert N % P == 0
 srand(1)
 TPG = div(N, P)
 A = rand(Float32, M, N)
 x = rand(Float32, N)
 device, ctx, queue = cl.create_compute_context()
 ctx = cl.Context(device)
 queue = cl.CmdQueue(ctx, :profile)
 A_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=A)
 x_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=x)
 y_buff = cl.Buffer(Float32, ctx, :w, M)
 const mvmulkernel = """
 kernel void mvmul(int M,
 int N,
 int P,
 int TPG,
 const global float *A,
 const global float *x,
 global float *y)
 { 
 int i = get_global_id(0);
 int j = get_global_id(1);
 int tpg = get_local_id(1);
 local float sums[$(TPG)];
 float sum = 0.0f;
 for (int p=0; p<P; p++)
 {
 sum += A[M * (TPG * p + tpg) + i] * x[TPG * p + tpg];
 }
 sums[tpg] = sum;
 barrier(CLK_LOCAL_MEM_FENCE);
 if (j == 0)
 {
 float sumtotal = 0.0f;
 for (int t=0; t<TPG; t++)
 {
 sumtotal += sums[t];
 }
 y[i] = sumtotal;
 }
 }
 """
 program = cl.Program(ctx, source=mvmulkernel) |> cl.build!
 kernel = cl.Kernel(program, "mvmul")
 evt = cl.call(queue, kernel, (M, N), (1, P), M, N, P, TPG, A_buff, x_buff, y_buff)
 t = round(evt[:profile_duration] * 1e-9, 6)
 println(t, " seconds on GPU")
 y = cl.read(queue, y_buff)
 println(isapprox(y, A * x))
 nothing
end
M = Int32(4000)
N = Int32(300)
P = Int32(50)
mvmulJulia(M, N)
mvmulGPU(M, N, P)

You can try out different M, N, and P at your leisure.

200_success
145k22 gold badges190 silver badges478 bronze badges
asked Jul 14, 2016 at 22:53
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

Ok, so I figured out what I was doing wrong. Basically I was completely misunderstanding the way the work-groups and work-items were supposed to break up.

What I have in the code in my original post is 1 thread for each element of the matrix A, and then I broke each row of this matrix up into work-groups of size P.

What I was supposed to do instead was have P threads per row, which is a total of (M, P) threads, and then collect each row into a single work-group (of size P). So instead of having N/P work-groups of size P per row, I now only have a single work-group of size P for each row. Hopefully that makes sense to everyone.

So here is the corrected code. I didn't put it into functions this time, so just run the script as is. The script includes both mvmul1 and mvmul2 from that website I linked to in my original post.

using OpenCL
const cl = OpenCL
srand(1)
M = Int32(30)
N = Int32(300)
P = Int32(30)
A = rand(Float32, M, N)
x = rand(Float32, N)
mvmul1 = """
kernel void mvmul1(int M,
 int N,
 const global float *A,
 const global float *x,
 global float *y)
 { 
 int i = get_global_id(0);
 float acc = 0.0f;
 for (int j=0; j<N; j++)
 {
 acc += A[M * j + i] * x[j];
 }
 y[i] = acc;
 }
"""
mvmul2 = """
kernel void mvmul2(int M,
 int N,
 int P,
 const global float *A,
 const global float *x,
 global float *y)
 { 
 int i = get_global_id(0);
 int j = get_global_id(1);
 local float sums[$(P)];
 float sum = 0.0f;
 for (int q=0; q<(N / P); q++)
 {
 sum += A[M * (j + P * q) + i] * x[j + P * q];
 }
 sums[j] = sum;
 barrier(CLK_LOCAL_MEM_FENCE);
 if (j == 0)
 {
 float sumtotal = 0.0f;
 for (int p=0; p<P; p++)
 {
 sumtotal += sums[p];
 }
 y[i] = sumtotal;
 }
 }
"""
device, ctx, queue = cl.create_compute_context()
ctx = cl.Context(device)
queue = cl.CmdQueue(ctx, :profile)
"""mvmul1"""
A_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=A)
x_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=x)
y_buff = cl.Buffer(Float32, ctx, :w, M)
program = cl.Program(ctx, source=mvmul1) |> cl.build!
kernel = cl.Kernel(program, "mvmul1")
evt = cl.call(queue, kernel, M, nothing, M, N, A_buff, x_buff, y_buff)
tout = round(evt[:profile_duration] * 1e-9, 6)
yout = cl.read(queue, y_buff)
t = @elapsed y = A * x
println("mvmul1 is ", round(t / tout, 3), " times as fast as Julia. ", isapprox(yout, y))
"""mvmul2"""
A_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=A)
x_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=x)
y_buff = cl.Buffer(Float32, ctx, :w, M)
program = cl.Program(ctx, source=mvmul2) |> cl.build!
kernel = cl.Kernel(program, "mvmul2")
evt = cl.call(queue, kernel, (M, P), (1, P), M, N, P, A_buff, x_buff, y_buff)
tout = round(evt[:profile_duration] * 1e-9, 6)
yout = cl.read(queue, y_buff)
t = @elapsed y = A * x
println("mvmul2 is ", round(t / tout, 3), " times as fast as Julia. ", isapprox(yout, y))

You'll notice that for M < N, mvmul2 does well while mvmul1 doesn't.

But if you change M, N and P such that M>= N, then mvmul1 does well while mvmul2 doesn't, interesting

answered Jul 15, 2016 at 15:03
\$\endgroup\$
2
  • \$\begingroup\$ Just out of curiosity, what's the performance like now? \$\endgroup\$ Commented Oct 30, 2016 at 2:11
  • \$\begingroup\$ @OscarSmith, honestly I have no idea, it must have been better, but it was too long ago. My best advice would be to compare and contrast yourself. \$\endgroup\$ Commented Oct 30, 2016 at 2:14

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.