So I'd like to get a performance boost above and beyond standard Julia matrix-vector multiply, using my Intel HD Graphics 4000 1536 MB GPU, but I can't do better than an order of magnitude worse performance.
The kernel I'm using is based on this approach. I don't know how ArrayFire gets such fast speeds, they're clearly using some sort of black magic, but I don't know what it is. Anyways here is the test code:
using OpenCL
const cl = OpenCL
function mvmulJulia(M::Int32, N::Int32)
srand(1)
A = rand(Float32, M, N)
x = rand(Float32, N)
t = @elapsed A * x
println(t, " seconds for Julia")
nothing
end
function mvmulGPU(M::Int32, N::Int32, P::Int32)
@assert N % P == 0
srand(1)
TPG = div(N, P)
A = rand(Float32, M, N)
x = rand(Float32, N)
device, ctx, queue = cl.create_compute_context()
ctx = cl.Context(device)
queue = cl.CmdQueue(ctx, :profile)
A_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=A)
x_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=x)
y_buff = cl.Buffer(Float32, ctx, :w, M)
const mvmulkernel = """
kernel void mvmul(int M,
int N,
int P,
int TPG,
const global float *A,
const global float *x,
global float *y)
{
int i = get_global_id(0);
int j = get_global_id(1);
int tpg = get_local_id(1);
local float sums[$(TPG)];
float sum = 0.0f;
for (int p=0; p<P; p++)
{
sum += A[M * (TPG * p + tpg) + i] * x[TPG * p + tpg];
}
sums[tpg] = sum;
barrier(CLK_LOCAL_MEM_FENCE);
if (j == 0)
{
float sumtotal = 0.0f;
for (int t=0; t<TPG; t++)
{
sumtotal += sums[t];
}
y[i] = sumtotal;
}
}
"""
program = cl.Program(ctx, source=mvmulkernel) |> cl.build!
kernel = cl.Kernel(program, "mvmul")
evt = cl.call(queue, kernel, (M, N), (1, P), M, N, P, TPG, A_buff, x_buff, y_buff)
t = round(evt[:profile_duration] * 1e-9, 6)
println(t, " seconds on GPU")
y = cl.read(queue, y_buff)
println(isapprox(y, A * x))
nothing
end
M = Int32(4000)
N = Int32(300)
P = Int32(50)
mvmulJulia(M, N)
mvmulGPU(M, N, P)
You can try out different M
, N
, and P
at your leisure.
1 Answer 1
Ok, so I figured out what I was doing wrong. Basically I was completely misunderstanding the way the work-groups and work-items were supposed to break up.
What I have in the code in my original post is 1 thread for each element of the matrix A, and then I broke each row of this matrix up into work-groups of size P.
What I was supposed to do instead was have P threads per row, which is a total of (M, P) threads, and then collect each row into a single work-group (of size P). So instead of having N/P work-groups of size P per row, I now only have a single work-group of size P for each row. Hopefully that makes sense to everyone.
So here is the corrected code. I didn't put it into functions this time, so just run the script as is. The script includes both mvmul1
and mvmul2
from that website I linked to in my original post.
using OpenCL
const cl = OpenCL
srand(1)
M = Int32(30)
N = Int32(300)
P = Int32(30)
A = rand(Float32, M, N)
x = rand(Float32, N)
mvmul1 = """
kernel void mvmul1(int M,
int N,
const global float *A,
const global float *x,
global float *y)
{
int i = get_global_id(0);
float acc = 0.0f;
for (int j=0; j<N; j++)
{
acc += A[M * j + i] * x[j];
}
y[i] = acc;
}
"""
mvmul2 = """
kernel void mvmul2(int M,
int N,
int P,
const global float *A,
const global float *x,
global float *y)
{
int i = get_global_id(0);
int j = get_global_id(1);
local float sums[$(P)];
float sum = 0.0f;
for (int q=0; q<(N / P); q++)
{
sum += A[M * (j + P * q) + i] * x[j + P * q];
}
sums[j] = sum;
barrier(CLK_LOCAL_MEM_FENCE);
if (j == 0)
{
float sumtotal = 0.0f;
for (int p=0; p<P; p++)
{
sumtotal += sums[p];
}
y[i] = sumtotal;
}
}
"""
device, ctx, queue = cl.create_compute_context()
ctx = cl.Context(device)
queue = cl.CmdQueue(ctx, :profile)
"""mvmul1"""
A_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=A)
x_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=x)
y_buff = cl.Buffer(Float32, ctx, :w, M)
program = cl.Program(ctx, source=mvmul1) |> cl.build!
kernel = cl.Kernel(program, "mvmul1")
evt = cl.call(queue, kernel, M, nothing, M, N, A_buff, x_buff, y_buff)
tout = round(evt[:profile_duration] * 1e-9, 6)
yout = cl.read(queue, y_buff)
t = @elapsed y = A * x
println("mvmul1 is ", round(t / tout, 3), " times as fast as Julia. ", isapprox(yout, y))
"""mvmul2"""
A_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=A)
x_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=x)
y_buff = cl.Buffer(Float32, ctx, :w, M)
program = cl.Program(ctx, source=mvmul2) |> cl.build!
kernel = cl.Kernel(program, "mvmul2")
evt = cl.call(queue, kernel, (M, P), (1, P), M, N, P, A_buff, x_buff, y_buff)
tout = round(evt[:profile_duration] * 1e-9, 6)
yout = cl.read(queue, y_buff)
t = @elapsed y = A * x
println("mvmul2 is ", round(t / tout, 3), " times as fast as Julia. ", isapprox(yout, y))
You'll notice that for M < N, mvmul2
does well while mvmul1
doesn't.
But if you change M, N and P such that M>= N, then mvmul1
does well while mvmul2
doesn't, interesting
-
\$\begingroup\$ Just out of curiosity, what's the performance like now? \$\endgroup\$Oscar Smith– Oscar Smith2016年10月30日 02:11:39 +00:00Commented Oct 30, 2016 at 2:11
-
\$\begingroup\$ @OscarSmith, honestly I have no idea, it must have been better, but it was too long ago. My best advice would be to compare and contrast yourself. \$\endgroup\$Set– Set2016年10月30日 02:14:28 +00:00Commented Oct 30, 2016 at 2:14