OpenCL
Open Computing Language (OpenCL) is a popular API for running compute tasks to a variety of devices (other CPUs, GPUs, FPGAs, etc.) in a standardized way. OpenCL does heterogeneous parallelization where one process runs tasks on other devices (or even threads in the same process depending on the platform).
Loader
Unlike other parallelization APIs, OpenCL in principle lets one load and use devices from multiple platforms simultaneously. It does this using using an ICD loader which looks for ICD files provided by platforms telling the loader how to load the platform.
Due to limitations in how the NHR Modules software stack is built, only one platform can be used at a time.
Rather than relying on a loader from the OS, the ocl-icd loader is provided in the software stack itself as a module. It is needed to use any of the platforms, compile OpenCL code, check available devices, etc. It is loaded by
For a specific version, run
module load ocl-icd/VERSION
and for the default version, run
module load ocl-icd
Compiling OpenCL Code
While it is possible to build OpenCL code against a single platform, it is generally best to compile it against the ocl-icd loader so that it is easy to change the particular platform used at runtime.
Simply load the module for it as in the above section and the OpenCL headers and library become available.
The path to the headers directory is automatically added to the INCLUDE
, C_INCLUDE_PATH
, CPLUS_INCLUDE_PATH
, and CPATH
environmental variables used by some C/C++ compilers.
If you need to pass the path manually for some reason, they are in the $OPENCL_C_HEADERS_MODULE_INSTALL_PREFIX/include
directory.
The path to the directory containing the libOpenCL.so
library is automatically added to the LD_RUN_PATH
environmental variable (you might have to add the -Wl,-rpath -Wl,$LD_RUN_PATH
argument to your C/C++ compiler if it can’t find it).
Platforms
The available OpenCL platforms are listed in the table below.
Simply load the module for a platform to use it, or to use the Nvidia driver platform have no other platforms loaded (you don’t even have to load the cuda
module).
Note that PoCL is provided in two variants, a CPU only variant and a CPU + Nvidia GPU variant.
Platform | Devices | NHR Modules name |
---|---|---|
Nvidia driver | Nvidia GPU | no other platform loaded |
PoCL | CPU | pocl |
CPU, Nvidia GPU | pocl/VERSION_cuda-CUDAMAJORVERSION |
Checking OpenCL Devices
You can use clinfo to walk through the OpenCL platforms that ocl-icd can find which then walks through their devices printing information about each one it finds.
It is loaded by
For a specific version, run
module load clinfo/VERSION
and for the default version, run
module load clinfo
And then you can just run it as
clinfo
Quick Benchmark to Check Performance
One of the major choices in OpenCL codes is what vector size to use for each type (integers, float, double, half, etc.), which can vary from platform to platform and from device to device. You can get what the vendor/platform thinks is the best size from clinfo (the “preferred size”). But in many cases, one must resort to empirical testing. While it is best to test it with the actual code to be used, you can get a crude guess using clpeak which runs quick benchmarks on the different vector sizes. It also benchmarks other important things like transfer bandwidth, kernel latency, etc.
It is loaded by
For a specific version, run
module load clpeak/VERSION
and for the default version, run
module load clpeak
And then to benchmark all platforms and devices it can find, run it as
clpeak
or for a specific platform and device
clpeak -p PLATFORM -d DEVICE
where you have gotten the PLATFORM
and DEVICE
numbers from clinfo.
Example jobs to benchmark the platforms and their results are
A job to benchmark of PoCL on the CPUs of an Emmy Phase 2 node is
#!/usr/bin/env bash
#SBATCH --job-name=clpeak-emmyp2
#SBATCH -p standard96:el8
#SBATCH -t 00:15:00
#SBATCH -N 1
#SBATCH -n 1
module load clpeak
module load pocl
clpeak
and the result is
Platform: Portable Computing Language
Device: cpu-cascadelake-Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz
Driver version : 5.0 (Linux x64)
Compute units : 192
Clock frequency : 3800 MHz
Global memory bandwidth (GBPS)
float : 5.06
float2 : 10.69
float4 : 18.82
float8 : 28.35
float16 : 30.73
Single-precision compute (GFLOPS)
float : 106.05
float2 : 215.18
float4 : 430.83
float8 : 854.67
float16 : 1660.30
Half-precision compute (GFLOPS)
half : 27.61
half2 : 48.37
half4 : 93.31
half8 : 185.87
half16 : 344.81
Double-precision compute (GFLOPS)
double : 109.07
double2 : 208.74
double4 : 429.89
double8 : 822.74
double16 : 1438.83
Integer compute (GIOPS)
int : 210.39
int2 : 165.35
int4 : 327.13
int8 : 629.76
int16 : 1107.31
Integer compute Fast 24bit (GIOPS)
int : 216.43
int2 : 165.50
int4 : 320.48
int8 : 634.08
int16 : 1101.22
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 4.80
enqueueReadBuffer : 4.64
enqueueWriteBuffer non-blocking : 4.77
enqueueReadBuffer non-blocking : 5.12
enqueueMapBuffer(for read) : 4234.00
memcpy from mapped ptr : 4.57
enqueueUnmap(after write) : 9485.35
memcpy to mapped ptr : 5.19
Kernel launch latency : 797.62 us
from which we could conclude that a vector size of 16 is a good initial guess for the optimum vector size. Also notice how poor half-precision performance is in comparison to single and double precision. For doing half precision computation, it is generally better to do it on the GPUs (like Grete) or on CPUs with better builtin half precision support (like Emmy Phase 3)
A job to benchmark of PoCL on the CPUs of an Emmy Phase 3 node is
#!/usr/bin/env bash
#SBATCH --job-name=clpeak-emmyp3
#SBATCH -p medium96s
#SBATCH -t 00:15:00
#SBATCH -N 1
#SBATCH -n 1
module load clpeak
module load pocl
clpeak
and the result is
Platform: Portable Computing Language
Device: cpu-sapphirerapids-Intel(R) Xeon(R) Platinum 8468
Driver version : 5.0 (Linux x64)
Compute units : 192
Clock frequency : 3800 MHz
Global memory bandwidth (GBPS)
float : 72.84
float2 : 85.54
float4 : 88.44
float8 : 97.76
float16 : 105.01
Single-precision compute (GFLOPS)
float : 126.41
float2 : 254.00
float4 : 512.32
float8 : 1028.37
float16 : 1839.51
Half-precision compute (GFLOPS)
half : 106.61
half2 : 217.59
half4 : 449.15
half8 : 878.24
half16 : 1842.21
Double-precision compute (GFLOPS)
double : 125.22
double2 : 249.81
double4 : 502.12
double8 : 881.17
double16 : 1515.57
Integer compute (GIOPS)
int : 247.44
int2 : 167.32
int4 : 333.63
int8 : 687.28
int16 : 1229.16
Integer compute Fast 24bit (GIOPS)
int : 250.40
int2 : 167.86
int4 : 340.70
int8 : 679.93
int16 : 1228.72
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 9.89
enqueueReadBuffer : 9.83
enqueueWriteBuffer non-blocking : 8.45
enqueueReadBuffer non-blocking : 8.76
enqueueMapBuffer(for read) : 973.92
memcpy from mapped ptr : 8.85
enqueueUnmap(after write) : 2446.44
memcpy to mapped ptr : 9.91
Kernel launch latency : 822.36 us
from which we could conclude that a vector size of 16 is a good initial guess for the optimum vector size.
A job to benchmark of the Nvidia driver on the Nvidia GPUs and PoCL on the CPUs and Nvidia GPUs of a Grete node is
#!/usr/bin/env bash
#SBATCH --job-name=clpeak-grete
#SBATCH -p grete
#SBATCH -t 00:15:00
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -G A100:4
# Force the software stack to be nhr-lmod even before it becomes the default.
export PREFERRED_SOFTWARE_STACK=nhr-lmod
. /etc/profile
module load clpeak
# Just Nvidia driver
echo '################################################################################'
echo '#'
echo '# Platform: Nvidia driver'
echo '#'
echo '################################################################################'
echo ''
echo ''
clpeak -p 0 -d 0
# PoCL
module load pocl/5.0_cuda-11
echo ''
echo ''
echo ''
echo ''
echo ''
echo '################################################################################'
echo '#'
echo '# Platform: PoCL - CUDA'
echo '#'
echo '################################################################################'
echo ''
echo ''
POCL_DEVICES=cuda clpeak -p 0 -d 0
echo ''
echo ''
echo ''
echo ''
echo ''
echo '################################################################################'
echo '#'
echo '# Platform: PoCL - CPU'
echo '#'
echo '################################################################################'
echo ''
echo ''
POCL_DEVICES=cpu clpeak
and the result is
################################################################################
#
# Platform: Nvidia driver
#
################################################################################
Platform: NVIDIA CUDA
Device: NVIDIA A100-SXM4-40GB
Driver version : 535.104.12 (Linux x64)
Compute units : 108
Clock frequency : 1410 MHz
Global memory bandwidth (GBPS)
float : 1305.59
float2 : 1377.32
float4 : 1419.30
float8 : 1443.96
float16 : 1464.56
Single-precision compute (GFLOPS)
float : 19352.94
float2 : 19386.81
float4 : 19351.17
float8 : 19274.51
float16 : 19104.15
No half precision support! Skipped
Double-precision compute (GFLOPS)
double : 9721.90
double2 : 9706.06
double4 : 9681.32
double8 : 9615.65
double16 : 9533.18
Integer compute (GIOPS)
int : 19276.27
int2 : 19318.02
int4 : 19260.69
int8 : 19341.64
int16 : 19333.69
Integer compute Fast 24bit (GIOPS)
int : 19302.15
int2 : 19297.12
int4 : 19294.55
int8 : 19217.17
int16 : 19033.39
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 14.48
enqueueReadBuffer : 12.99
enqueueWriteBuffer non-blocking : 14.16
enqueueReadBuffer non-blocking : 12.76
enqueueMapBuffer(for read) : 20.25
memcpy from mapped ptr : 19.63
enqueueUnmap(after write) : 26.76
memcpy to mapped ptr : 20.63
Kernel launch latency : 9.07 us
################################################################################
#
# Platform: PoCL - CUDA
#
################################################################################
Platform: Portable Computing Language
Device: NVIDIA A100-SXM4-40GB
Driver version : 5.0 (Linux x64)
Compute units : 108
Clock frequency : 1410 MHz
Global memory bandwidth (GBPS)
float : 1301.55
float2 : 1368.14
float4 : 1405.72
float8 : 1438.37
float16 : 1459.01
Single-precision compute (GFLOPS)
float : 19369.37
float2 : 19358.33
float4 : 19357.20
float8 : 19278.51
float16 : 19135.89
Half-precision compute (GFLOPS)
half : 19368.83
half2 : 73221.23
half4 : 66732.34
half8 : 60351.88
half16 : 62031.69
Double-precision compute (GFLOPS)
double : 9700.11
double2 : 9687.95
double4 : 9675.77
double8 : 9644.12
double16 : 9565.58
Integer compute (GIOPS)
int : 12937.51
int2 : 12943.77
int4 : 13225.68
int8 : 12975.10
int16 : 13058.68
Integer compute Fast 24bit (GIOPS)
int : 12937.18
int2 : 12943.10
int4 : 13225.43
int8 : 12975.01
int16 : 13032.36
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 20.21
enqueueReadBuffer : 20.04
enqueueWriteBuffer non-blocking : 20.22
enqueueReadBuffer non-blocking : 20.01
enqueueMapBuffer(for read) : 142689.94
memcpy from mapped ptr : 20.62
enqueueUnmap(after write) : 16.03
memcpy to mapped ptr : 20.65
Kernel launch latency : -83.58 us
################################################################################
#
# Platform: PoCL - CPU
#
################################################################################
Platform: Portable Computing Language
Device: cpu-znver3-AMD EPYC 7513 32-Core Processor
Driver version : 5.0 (Linux x64)
Compute units : 128
Clock frequency : 3681 MHz
Global memory bandwidth (GBPS)
float : 20.33
float2 : 43.19
float4 : 45.07
float8 : 57.86
float16 : 43.87
Single-precision compute (GFLOPS)
float : 138.16
float2 : 280.69
float4 : 584.32
float8 : 1093.85
float16 : 1866.94
Half-precision compute (GFLOPS)
half : 33.58
half2 : 57.93
half4 : 118.63
half8 : 241.08
half16 : 418.97
Double-precision compute (GFLOPS)
double : 143.13
double2 : 275.95
double4 : 484.71
double8 : 951.55
double16 : 1663.51
Integer compute (GIOPS)
int : 213.32
int2 : 455.37
int4 : 942.36
int8 : 1711.27
int16 : 2909.55
Integer compute Fast 24bit (GIOPS)
int : 194.02
int2 : 520.80
int4 : 851.99
int8 : 1827.18
int16 : 2845.90
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 12.72
enqueueReadBuffer : 14.27
enqueueWriteBuffer non-blocking : 11.82
enqueueReadBuffer non-blocking : 10.53
enqueueMapBuffer(for read) : 3315.04
memcpy from mapped ptr : 11.59
enqueueUnmap(after write) : 3206.16
memcpy to mapped ptr : 19.58
Kernel launch latency : 146.01 us
from which we could conclude that the GPUs (regardless of platform) vastly outperform the CPUs in this benchmark, the vector size makes little difference on the Nvidia GPUs except for half precision where a size of 2 is a good initial guess, a vector size of 16 is a good initial guess on the CPUs for the optimum vector size. Notice in particular how half precision on the GPUs is considerably faster than single and double precision, but the opposite (slower) on the CPUs. For doing half precision computation, it is generally better to do it on the GPUs or on CPUs with better builtin half precision support (like Emmy Phase 3)