Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

OpenCL is the most powerful programming language ever created. Yet the OpenCL C++ bindings are cumbersome and the code overhead prevents many people from getting started. I created this lightweight OpenCL-Wrapper to greatly simplify OpenCL software development with C++ while keeping functionality and performance.

License

Notifications You must be signed in to change notification settings

ProjectPhysX/OpenCL-Wrapper

Repository files navigation

OpenCL-Wrapper

OpenCL is the most powerful programming language ever created. Yet the OpenCL C++ bindings are cumbersome and the code overhead prevents many people from getting started. I created this lightweight OpenCL-Wrapper to greatly simplify OpenCL software development with C++ while keeping functionality and performance.

Works in Windows, Linux and Android with C++17.

Use-case example: FluidX3D builds entirely on top of this OpenCL-Wrapper.

Getting started:

Install GPU Drivers and OpenCL Runtime (click to expand section)
  • Windows

    GPUs
    • Download and install the AMD/Intel/Nvidia GPU Drivers, which contain the OpenCL Runtime.
    • Reboot.
    CPUs
  • Linux

    AMD GPUs
    • Download and install AMD GPU Drivers, which contain the OpenCL Runtime, with:
      sudo apt update && sudo apt upgrade -y
      sudo apt install -y g++ git make ocl-icd-libopencl1 ocl-icd-opencl-dev
      mkdir -p ~/amdgpu
      wget -P ~/amdgpu https://repo.radeon.com/amdgpu-install/6.4.2.1/ubuntu/noble/amdgpu-install_6.4.60402-1_all.deb
      sudo apt install -y ~/amdgpu/amdgpu-install*.deb
      sudo amdgpu-install -y --usecase=graphics,rocm,opencl --opencl=rocr
      sudo usermod -a -G render,video $(whoami)
      rm -r ~/amdgpu
      sudo shutdown -r now
    Intel GPUs
    • Intel GPU Drivers come already installed since Linux Kernel 6.2, but they don't contain the OpenCL Runtime.
    • The the OpenCL Runtime has to be installed separately with:
      sudo apt update && sudo apt upgrade -y
      sudo apt install -y g++ git make ocl-icd-libopencl1 ocl-icd-opencl-dev intel-opencl-icd
      sudo usermod -a -G render $(whoami)
      sudo shutdown -r now
    Nvidia GPUs
    • Download and install Nvidia GPU Drivers, which contain the OpenCL Runtime, with:
      sudo apt update && sudo apt upgrade -y
      sudo apt install -y g++ git make ocl-icd-libopencl1 ocl-icd-opencl-dev nvidia-driver-580
      sudo shutdown -r now
    CPUs
    • Option 1: Download and install the oneAPI DPC++ Compiler and oneTBB with:
      export OCLV="oclcpuexp-2025206.0.04_224945_rel"
      export TBBV="oneapi-tbb-202220"
      sudo apt update && sudo apt upgrade -y
      sudo apt install -y g++ git make ocl-icd-libopencl1 ocl-icd-opencl-dev
      sudo mkdir -p ~/cpurt /opt/intel/${OCLV} /etc/OpenCL/vendors /etc/ld.so.conf.d
      sudo wget -P ~/cpurt https://github.com/intel/llvm/releases/download/2025-WW27/${OCLV}.tar.gz
      sudo wget -P ~/cpurt https://github.com/uxlfoundation/oneTBB/releases/download/v2022.2.0/${TBBV}-lin.tgz
      sudo tar -zxvf ~/cpurt/${OCLV}.tar.gz -C /opt/intel/${OCLV}
      sudo tar -zxvf ~/cpurt/${TBBV}-lin.tgz -C /opt/intel
      echo /opt/intel/${OCLV}/x64/libintelocl.so | sudo tee /etc/OpenCL/vendors/intel_expcpu.icd
      echo /opt/intel/${OCLV}/x64 | sudo tee /etc/ld.so.conf.d/libintelopenclexp.conf
      sudo ln -sf /opt/intel/${TBBV}/lib/intel64/gcc4.8/libtbb.so /opt/intel/${OCLV}/x64
      sudo ln -sf /opt/intel/${TBBV}/lib/intel64/gcc4.8/libtbbmalloc.so /opt/intel/${OCLV}/x64
      sudo ln -sf /opt/intel/${TBBV}/lib/intel64/gcc4.8/libtbb.so.12 /opt/intel/${OCLV}/x64
      sudo ln -sf /opt/intel/${TBBV}/lib/intel64/gcc4.8/libtbbmalloc.so.2 /opt/intel/${OCLV}/x64
      sudo ldconfig -f /etc/ld.so.conf.d/libintelopenclexp.conf
      sudo rm -r ~/cpurt
    • Option 2: Download and install PoCL with:
      sudo apt update && sudo apt upgrade -y
      sudo apt install -y g++ git make ocl-icd-libopencl1 ocl-icd-opencl-dev pocl-opencl-icd
  • Android

    ARM GPUs
    • Download the Termux .apk and install it.
    • In the Termux app, run:
      apt update && apt upgrade -y
      apt install -y clang git make

Download+unzip the source code or git clone https://github.com/ProjectPhysX/OpenCL-Wrapper.git

Compiling on Windows (click to expand section)
Compiling on Linux / macOS / Android (click to expand section)
  • Compile and run with:
    chmod +x make.sh
    ./make.sh
  • Compiling requires g++ with C++17, which is supported since version 8 (check with g++ --version).
  • Operating system (Linux/macOS/Android) is detected automatically.

Key simplifications:

  1. select a Device with 1 line
    • automatically select fastest device / device with most memory / device with specified ID from a list of all devices

    • easily get device information (performance in TFLOPs/s, amount of memory and cache, FP64/FP16 capabilities, etc.)

    • automatic OpenCL C code compilation when creating the Device object

      • automatically enable FP64/FP16 capabilities in OpenCL C code
      • automatically print log to console if there are compile errors
      • easy option to generate PTX assembly for Nvidia GPUs and save that in a .ptx file
    • contains all device-specific workarounds/patches to make OpenCL fully cross-compatible
      • AMD
        • fix for wrong device name reporting on AMD GPUs
        • fix for wrong reporting of dual-CU count as CU count on AMD RDNA+ GPUs
        • fix for maximum buffer allocation size limit for AMD GPUs
      • Intel
        • enable >4GB single buffer VRAM allocations on Intel Arc GPUs
        • fix for wrong VRAM capacity reporting on Intel Arc GPUs
        • fix for maximum buffer allocation size limit in Intel CPU Runtime for OpenCL
        • fix for false dp4a reporting on Intel
      • Nvidia
        • enable basic FP16 support on Nvidia Pascal and newer GPUs with driver 520 or newer
      • ARM
        • disable broken zero-copy on ARM iGPUs
        • fix for terrible fma performance on ARM GPUs
      • other
        • enable FP64, FP16 and INT64 atomics support on supported devices
        • fix for unreliable OpenCL C version reporting
        • always compile for latest supported OpenCL C standard
  2. create a Memory object with 1 line
    • one object for both host and device memory
    • easy host <-> device memory transfer (also for 1D/2D/3D grid domains)
    • easy handling of multi-dimensional vectors
    • can also be used to only allocate memory on host or only allocate memory on device
    • automatically tracks total global memory usage of device when allocating/deleting memory
    • automatically uses zero-copy buffers on CPUs/iGPUs
  3. create a Kernel with 1 line
    • Memory objects and constants are linked to OpenCL C kernel parameters during Kernel creation
    • a list of Memory objects and constants can be added to Kernel parameters in one line (add_parameters(...))
    • Kernel parameters can be edited (set_parameters(...))
    • easy Kernel execution: kernel.run();
    • Kernel function calls can be daisy chained, for example: kernel.set_parameters(3u, time).run();
    • failsafe: you'll get an error message if kernel parameters mismatch between C++ and OpenCL code
  4. OpenCL C code is embedded into C++
    • syntax highlighting in the code editor is retained
    • notes / peculiarities of this workaround:
      • the #define R(...) string(" "#__VA_ARGS__" ") stringification macro converts its arguments to string literals; '\n' is converted to ' ' in the process
      • these string literals cannot be arbitrarily long, so interrupt them periodically with )+R(
      • to use unbalanced round brackets '('/')', exit the R(...) macro and insert a string literal manually: )+"void function("+R( and )+") {"+R(
      • to use preprocessor switch macros, exit the R(...) macro and insert a string literal manually: )+"#define TEST"+R( and )+"#endif"+R( // TEST
      • preprocessor replacement macros (for example #define VARIABLE 42) don't work; hand these to the Device constructor directly instead

No need to:

  • have code overhead for selecting a platform/device, passing the OpenCL C code, etc.
  • keep track of length and data type for buffers
  • have duplicate code for host and device buffers
  • keep track of total global memory usage
  • keep track of global/local range for kernels
  • bother with Queue, Context, Source, Program
  • load a .cl file at runtime
  • bother with device-specific workarounds/patches

Example (OpenCL vector addition)

main.cpp

#include "opencl.hpp"
int main() {
	Device device(select_device_with_most_flops()); // compile OpenCL C code for the fastest available device
	const uint N = 1024u; // size of vectors
	Memory<float> A(device, N); // allocate memory on both host and device
	Memory<float> B(device, N);
	Memory<float> C(device, N);
	Kernel add_kernel(device, N, "add_kernel", A, B, C); // kernel that runs on the device
	for(uint n=0u; n<N; n++) {
		A[n] = 3.0f; // initialize memory
		B[n] = 2.0f;
		C[n] = 1.0f;
	}
	print_info("Value before kernel execution: C[0] = "+to_string(C[0]));
	A.write_to_device(); // copy data from host memory to device memory
	B.write_to_device();
	add_kernel.run(); // run add_kernel on the device
	C.read_from_device(); // copy data from device memory to host memory
	print_info("Value after kernel execution: C[0] = "+to_string(C[0]));
	wait();
	return 0;
}

kernel.cpp

#include "kernel.hpp" // note: unbalanced round brackets () are not allowed and string literals can't be arbitrarily long, so periodically interrupt with )+R(
string opencl_c_container() { return R( // ########################## begin of OpenCL C code ####################################################################
kernel void add_kernel(global float* A, global float* B, global float* C) { // equivalent to "for(uint n=0u; n<N; n++) {", but executed in parallel
	const uint n = get_global_id(0);
	C[n] = A[n]+B[n];
}
);} // ############################################################### end of OpenCL C code #####################################################################

For comparison, the very same OpenCL vector addition example looks like this when directly using the OpenCL C++ bindings:

#define CL_HPP_MINIMUM_OPENCL_VERSION 100
#define CL_HPP_TARGET_OPENCL_VERSION 300
#include <CL/opencl.hpp>
#include "utilities.hpp"
#define WORKGROUP_SIZE 64
int main() {
	// 1. select device
	vector<cl::Device> cl_devices; // get all devices of all platforms
	{
		vector<cl::Platform> cl_platforms; // get all platforms (drivers)
		cl::Platform::get(&cl_platforms);
		for(uint i=0u; i<(uint)cl_platforms.size(); i++) {
			vector<cl::Device> cl_devices_available;
			cl_platforms[i].getDevices(CL_DEVICE_TYPE_ALL, &cl_devices_available);
			for(uint j=0u; j<(uint)cl_devices_available.size(); j++) {
				cl_devices.push_back(cl_devices_available[j]);
			}
		}
	}
	cl::Device cl_device; // select fastest available device
	{
		float best_value = 0.0f;
		uint best_i = 0u; // index of fastest device
		for(uint i=0u; i<(uint)cl_devices.size(); i++) { // find device with highest (estimated) floating point performance
			const string name = trim(cl_device.getInfo<CL_DEVICE_NAME>()); // device name
			const string vendor = trim(cl_device.getInfo<CL_DEVICE_VENDOR>()); // device vendor
			const uint compute_units = (uint)cl_device.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>(); // compute units (CUs) can contain multiple cores depending on the microarchitecture
			const uint clock_frequency = (uint)cl_device.getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>(); // in MHz
			const bool is_gpu = cl_device.getInfo<CL_DEVICE_TYPE>()==CL_DEVICE_TYPE_GPU;
			const int vendor_id = (int)cl_device.getInfo<CL_DEVICE_VENDOR_ID>(); // AMD=0x1002, Intel=0x8086, Nvidia=0x10DE, Apple=0x1027F00
			float cores_per_cu = 1.0f;
			if(vendor_id==0x1002) { // AMD GPU/CPU
				const bool amd_128_cores_per_dualcu = contains(to_lower(name), "gfx10"); // identify RDNA/RDNA2 GPUs where dual CUs are reported
				const bool amd_256_cores_per_dualcu = contains(to_lower(name), "gfx11"); // identify RDNA3 GPUs where dual CUs are reported
				cores_per_cu = is_gpu ? (amd_256_cores_per_dualcu ? 256.0f : amd_128_cores_per_dualcu ? 128.0f : 64.0f) : 0.5f; // 64 cores/CU (GCN, CDNA), 128 cores/dualCU (RDNA, RDNA2), 256 cores/dualCU (RDNA3), 1/2 core/CU (CPUs)
			} else if(vendor_id==0x8086) { // Intel GPU/CPU
				const bool intel_16_cores_per_cu = contains_any(to_lower(name), {"gpu max", "140v", "130v", "b580", "b570"}); // identify PVC/Xe2 GPUs
				cores_per_cu = is_gpu ? (intel_16_cores_per_cu ? 16.0f : 8.0f) : 0.5f; // Intel GPUs have 16 cores/CU (PVC) or 8 cores/CU (integrated/Arc), Intel CPUs (with HT) have 1/2 core/CU
			} else if(vendor_id==0x10DE||vendor_id==0x13B5) { // Nvidia GPU/CPU
				const uint nvidia_compute_capability = 10u*(uint)cl_device.getInfo<CL_DEVICE_COMPUTE_CAPABILITY_MAJOR_NV>()+(uint)cl_device.getInfo<CL_DEVICE_COMPUTE_CAPABILITY_MINOR_NV>();
				const bool nvidia__32_cores_per_cu = (nvidia_compute_capability <30); // identify Fermi GPUs
				const bool nvidia_192_cores_per_cu = (nvidia_compute_capability>=30&&nvidia_compute_capability<50); // identify Kepler GPUs
				const bool nvidia__64_cores_per_cu = (nvidia_compute_capability>=70&&nvidia_compute_capability<80)||contains_any(to_lower(name), {"p100", "a100", "a30"}); // identify Volta, Turing, P100, A100, A30
				cores_per_cu = is_gpu ? (nvidia__32_cores_per_cu ? 32.0f : nvidia_192_cores_per_cu ? 192.0f : nvidia__64_cores_per_cu ? 64.0f : 128.0f) : 1.0f; // 32 (Fermi), 192 (Kepler), 64 (Volta, Turing, P100, A100, A30), 128 (Maxwell, Pascal, Ampere, Hopper, Ada, Blackwell) or 1 (CPUs)
			} else if(vendor_id==0x1027F00) { // Apple iGPU
				cores_per_cu = 128.0f; // Apple ARM GPUs usually have 128 cores/CU
			} else if(vendor_id==0x1022||vendor_id==0x10006||vendor_id==0x6C636F70) { // x86 CPUs with PoCL runtime
				cores_per_cu = 0.5f; // CPUs typically have 1/2 cores/CU due to SMT/hyperthreading
			} else if(contains(to_lower(vendor), "arm")) { // ARM
				cores_per_cu = is_gpu ? 8.0f : 1.0f; // ARM GPUs usually have 8 cores/CU, ARM CPUs have 1 core/CU
			}
			const uint ipc = is_gpu ? 2u : 32u; // IPC (instructions per cycle) is 2 for GPUs and 32 for most modern CPUs
			const uint cores = to_uint((float)compute_units*cores_per_cu); // for CPUs, compute_units is the number of threads (twice the number of cores with hyperthreading)
			const float tflops = 1E-6f*(float)cores*(float)ipc*(float)clock_frequency; // estimated device floating point performance in TeraFLOPs/s
			if(tflops>best_value) {
				best_value = tflops;
				best_i = i;
			}
		}
		const string name = trim(cl_devices[best_i].getInfo<CL_DEVICE_NAME>()); // device name
		cl_device = cl_devices[best_i];
		print_info(name); // print device name
	}
	// 2. embed OpenCL C code (raw string literal breaks syntax highlighting)
	string opencl_c_code = R"(
		kernel void add_kernel(global float* A, global float* B, global float* C) { // equivalent to "for(uint n=0u; n<N; n++) {", but executed in parallel
			const uint n = get_global_id(0);
			C[n] = A[n]+B[n];
		}
	)";
	// 3. compile OpenCL C code
	cl::Context cl_context;
	cl::Program cl_program;
	cl::CommandQueue cl_queue;
	{
		cl_context = cl::Context(cl_device);
		cl_queue = cl::CommandQueue(cl_context, cl_device);
		cl::CommandQueue cl_queue(cl_context, cl_device); // queue to push commands for the device
		cl::Program::Sources cl_source;
		cl_source.push_back({ opencl_c_code.c_str(), opencl_c_code.length() });
		cl_program = cl::Program(cl_context, cl_source);
		int error = cl_program.build({ cl_device }, "-cl-finite-math-only -cl-no-signed-zeros -cl-mad-enable -w"); // compile OpenCL C code, disable warnings
		if(error) print_warning(cl_program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(cl_device)); // print build log
		if(error) print_error("OpenCL C code compilation failed.");
		else print_info("OpenCL C code successfully compiled.");
	}
	// 4. allocate memory on host and device
	const uint N = 1024u;
	float* host_A;
	float* host_B;
	float* host_C;
	cl::Buffer device_A;
	cl::Buffer device_B;
	cl::Buffer device_C;
	{
		host_A = new float[N];
		host_B = new float[N];
		host_C = new float[N];
		for(uint i=0u; i<N; i++) {
			host_A[i] = 0.0f; // zero all buffers
			host_B[i] = 0.0f;
			host_C[i] = 0.0f;
		}
		int error = 0;
		device_A = cl::Buffer(cl_context, CL_MEM_READ_WRITE, N*sizeof(float), nullptr, &error);
		if(error) print_error("OpenCL Buffer allocation failed with error code "+to_string(error)+".");
		device_B = cl::Buffer(cl_context, CL_MEM_READ_WRITE, N*sizeof(float), nullptr, &error);
		if(error) print_error("OpenCL Buffer allocation failed with error code "+to_string(error)+".");
		device_C = cl::Buffer(cl_context, CL_MEM_READ_WRITE, N*sizeof(float), nullptr, &error);
		if(error) print_error("OpenCL Buffer allocation failed with error code "+to_string(error)+".");
		cl_queue.enqueueWriteBuffer(device_A, true, 0u, N*sizeof(float), (void*)host_A); // have to keep track of buffer range and buffer data type
		cl_queue.enqueueWriteBuffer(device_B, true, 0u, N*sizeof(float), (void*)host_B);
		cl_queue.enqueueWriteBuffer(device_C, true, 0u, N*sizeof(float), (void*)host_C);
	}
	// 5. create Kernel object and link input parameters
	cl::NDRange cl_range_global, cl_range_local;
	cl::Kernel cl_kernel;
	{
		cl_kernel = cl::Kernel(cl_program, "add_kernel");
		cl_kernel.setArg(0, device_A);
		cl_kernel.setArg(1, device_B);
		cl_kernel.setArg(2, device_C);
		cl_range_local = cl::NDRange(WORKGROUP_SIZE);
		cl_range_global = cl::NDRange(((N+WORKGROUP_SIZE-1)/WORKGROUP_SIZE)*WORKGROUP_SIZE); // make global range a multiple of local range
	}
	// 6. finally run the actual program
	{
		for(uint i=0u; i<N; i++) {
			host_A[i] = 3.0f; // initialize buffers on host
			host_B[i] = 2.0f;
			host_C[i] = 1.0f;
		}
		print_info("Value before kernel execution: C[0] = "+to_string(host_C[0]));
		cl_queue.enqueueWriteBuffer(device_A, true, 0u, N*sizeof(float), (void*)host_A); // copy A and B to device
		cl_queue.enqueueWriteBuffer(device_B, true, 0u, N*sizeof(float), (void*)host_B); // have to keep track of buffer range and buffer data type
		cl_queue.enqueueNDRangeKernel(cl_kernel, cl::NullRange, cl_range_global, cl_range_local); // have to keep track of kernel ranges
		cl_queue.finish(); // don't forget to finish the queue
		cl_queue.enqueueReadBuffer(device_C, true, 0u, N*sizeof(float), (void*)host_C);
		print_info("Value after kernel execution: C[0] = "+to_string(host_C[0]));
	}
	wait();
	return 0;
}

About

OpenCL is the most powerful programming language ever created. Yet the OpenCL C++ bindings are cumbersome and the code overhead prevents many people from getting started. I created this lightweight OpenCL-Wrapper to greatly simplify OpenCL software development with C++ while keeping functionality and performance.

Topics

Resources

License

Stars

Watchers

Forks

Languages

AltStyle によって変換されたページ (->オリジナル) /