0

I'm trying to develop a better intuition of the mapping between OpenCL's abstraction and the actual hardware. For instance, using the late-2011 Macbook pro's configuration:

1)

Radeon 6770M GPU: http://www.amd.com/us/products/notebook/graphics/amd-radeon-6000m/amd-radeon-6700m-6600m/Pages/amd-radeon-6700m-6600m.aspx#2

"480 Stream Processors" I guess is the important number there.

2)

On the other hand the OpenCL API gives me these numbers:

DEVICE_NAME = ATI Radeon HD 6770M
DRIVER_VERSION = 1.0
DEVICE_VENDOR = AMD
DEVICE_VERSION = OpenCL 1.1 
DEVICE_MAX_COMPUTE_UNITS = 6
DEVICE_MAX_CLOCK_FREQUENCY = 675
DEVICE_GLOBAL_MEM_SIZE = 1073741824
DEVICE_LOCAL_MEM_SIZE = 32768
CL_DEVICE_ADDRESS_BITS = 32
CL_DEVICE_GLOBAL_MEM_CACHE_SIZE = 0
CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE = 0
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE = 65536
CL_DEVICE_MAX_WORK_GROUP_SIZE = 1024
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
CL_DEVICE_MAX_WORK_ITEM_SIZES = (1024, 1024, 1024)

And querying the work group size and multiple for a trivial kernel (pass-through float4 form input to output global mem)

CL_KERNEL_PREFERRED_WORKGROUP_SIZE_MULTIPLE = 64
CL_KERNEL_WORK_GROUP_SIZE = 256

3)

The OpenCL specification states that an entire work group must be able to run concurrently on a device's compute unit.

4)

OpenCL also give the device's SIMD-width through the multiple, which is 64 in the above case.

Somehow I cannot put the "6" the "480" and powers of two in relationship. If the number of compute units is 6 and the SIMD width is 64 I get to 384.

Can anybody explain how these numbers relate, especially to hardware?

asked Feb 7, 2013 at 22:03

1 Answer 1

1

In this GPU, each "compute unit" is a core executing one or more work-groups.

The max size of each work-group is 256 for your specific kernel (obtained with clGetKernelWorkgroupInfo). it can be less if your kernel requires more resources (registers, local memory).

In each core, 16 work-items are physically active at a given time, and execute the same "large instruction" (see VLIW5) mapped on 5 arithmetic units (ALU), that gives 5*16 ALU per core or 480 "stream processors" for the 6 cores.

Work-items are actually executed in blocks of 64 (a "wavefront" in AMD terminology); all 64 work-items executing the same VLIW5 instruction, and being physically executed in 4 passes of 16. This is why you get a preferred workgroup size multiple of 64.

Recent AMD GPUs have switched to a VLIW4 model, where each instruction maps to only 4 ALU.

answered Feb 7, 2013 at 23:56
Sign up to request clarification or add additional context in comments.

1 Comment

Even more recent AMD GPUs have switched away from VLIW4 to a "RISC" like SIMD architecture.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.