OpenCL abstraction and actual hardware

Question 1

I'm trying to develop a better intuition of the mapping between OpenCL's abstraction and the actual hardware. For instance, using the late-2011 Macbook pro's configuration:

1)

Radeon 6770M GPU: http://www.amd.com/us/products/notebook/graphics/amd-radeon-6000m/amd-radeon-6700m-6600m/Pages/amd-radeon-6700m-6600m.aspx#2

"480 Stream Processors" I guess is the important number there.

2)

On the other hand the OpenCL API gives me these numbers:

DEVICE_NAME = ATI Radeon HD 6770M
DRIVER_VERSION = 1.0
DEVICE_VENDOR = AMD
DEVICE_VERSION = OpenCL 1.1 
DEVICE_MAX_COMPUTE_UNITS = 6
DEVICE_MAX_CLOCK_FREQUENCY = 675
DEVICE_GLOBAL_MEM_SIZE = 1073741824
DEVICE_LOCAL_MEM_SIZE = 32768
CL_DEVICE_ADDRESS_BITS = 32
CL_DEVICE_GLOBAL_MEM_CACHE_SIZE = 0
CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE = 0
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE = 65536
CL_DEVICE_MAX_WORK_GROUP_SIZE = 1024
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
CL_DEVICE_MAX_WORK_ITEM_SIZES = (1024, 1024, 1024)

And querying the work group size and multiple for a trivial kernel (pass-through float4 form input to output global mem)

CL_KERNEL_PREFERRED_WORKGROUP_SIZE_MULTIPLE = 64
CL_KERNEL_WORK_GROUP_SIZE = 256

3)

The OpenCL specification states that an entire work group must be able to run concurrently on a device's compute unit.

4)

OpenCL also give the device's SIMD-width through the multiple, which is 64 in the above case.

Somehow I cannot put the "6" the "480" and powers of two in relationship. If the number of compute units is 6 and the SIMD width is 64 I get to 384.

Can anybody explain how these numbers relate, especially to hardware?

Question 2

In this GPU, each "compute unit" is a core executing one or more work-groups.

The max size of each work-group is 256 for your specific kernel (obtained with clGetKernelWorkgroupInfo). it can be less if your kernel requires more resources (registers, local memory).

In each core, 16 work-items are physically active at a given time, and execute the same "large instruction" (see VLIW5) mapped on 5 arithmetic units (ALU), that gives 5*16 ALU per core or 480 "stream processors" for the 6 cores.

Work-items are actually executed in blocks of 64 (a "wavefront" in AMD terminology); all 64 work-items executing the same VLIW5 instruction, and being physically executed in 4 passes of 16. This is why you get a preferred workgroup size multiple of 64.

Recent AMD GPUs have switched to a VLIW4 model, where each instruction maps to only 4 ALU.

Question 3

Even more recent AMD GPUs have switched away from VLIW4 to a "RISC" like SIMD architecture.

Eric Bainville 9,9461 gold badge27 silver badges28 bronze badges · Accepted Answer · 2013-02-07 23:56:55Z

In this GPU, each "compute unit" is a core executing one or more work-groups.

The max size of each work-group is 256 for your specific kernel (obtained with clGetKernelWorkgroupInfo). it can be less if your kernel requires more resources (registers, local memory).

In each core, 16 work-items are physically active at a given time, and execute the same "large instruction" (see VLIW5) mapped on 5 arithmetic units (ALU), that gives 5*16 ALU per core or 480 "stream processors" for the 6 cores.

Work-items are actually executed in blocks of 64 (a "wavefront" in AMD terminology); all 64 work-items executing the same VLIW5 instruction, and being physically executed in 4 passes of 16. This is why you get a preferred workgroup size multiple of 64.

Recent AMD GPUs have switched to a VLIW4 model, where each instruction maps to only 4 ALU.

Even more recent AMD GPUs have switched away from VLIW4 to a "RISC" like SIMD architecture.

CollectivesTM on Stack Overflow

OpenCL abstraction and actual hardware

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related