OpenCL

Open Computing Language (OpenCL) is a popular API for running compute tasks to a variety of devices (other CPUs, GPUs, FPGAs, etc.) in a standardized way. OpenCL does heterogeneous parallelization where one process runs tasks on other devices (or even threads in the same process depending on the platform).

Loader

Unlike other parallelization APIs, OpenCL in principle lets one load and use devices from multiple platforms simultaneously. It does this using using an ICD loader which looks for ICD files provided by platforms telling the loader how to load the platform.

Warning

Due to limitations in how the NHR Modules software stack is built, only one platform can be used at a time.

Rather than relying on a loader from the OS, the ocl-icd loader is provided in the software stack itself as a module. It is needed to use any of the platforms, compile OpenCL code, check available devices, etc. It is loaded by

load OpenCL Loader:

For a specific version, run

module load ocl-icd/VERSION

and for the default version, run

module load ocl-icd

Compiling OpenCL Code

While it is possible to build OpenCL code against a single platform, it is generally best to compile it against the ocl-icd loader so that it is easy to change the particular platform used at runtime. Simply load the module for it as in the above section and the OpenCL headers and library become available. The path to the headers directory is automatically added to the INCLUDE, C_INCLUDE_PATH, CPLUS_INCLUDE_PATH, and CPATH environmental variables used by some C/C++ compilers. If you need to pass the path manually for some reason, they are in the $OPENCL_C_HEADERS_MODULE_INSTALL_PREFIX/include directory. The path to the directory containing the libOpenCL.so library is automatically added to the LD_RUN_PATH environmental variable (you might have to add the -Wl,-rpath -Wl,$LD_RUN_PATH argument to your C/C++ compiler if it can’t find it).

Platforms

The available OpenCL platforms are listed in the table below. Simply load the module for a platform to use it, or to use the Nvidia driver platform have no other platforms loaded (you don’t even have to load the cuda module). Note that PoCL is provided in two variants, a CPU only variant and a CPU + Nvidia GPU variant.

PlatformDevicesNHR Modules name
Nvidia driverNvidia GPUno other platform loaded
PoCLCPUpocl
CPU, Nvidia GPUpocl/VERSION_cuda-CUDAMAJORVERSION

Checking OpenCL Devices

You can use clinfo to walk through the OpenCL platforms that ocl-icd can find which then walks through their devices printing information about each one it finds.

It is loaded by

load clinfo:

For a specific version, run

module load clinfo/VERSION

and for the default version, run

module load clinfo

And then you can just run it as

clinfo

Quick Benchmark to Check Performance

One of the major choices in OpenCL codes is what vector size to use for each type (integers, float, double, half, etc.), which can vary from platform to platform and from device to device. You can get what the vendor/platform thinks is the best size from clinfo (the “preferred size”). But in many cases, one must resort to empirical testing. While it is best to test it with the actual code to be used, you can get a crude guess using clpeak which runs quick benchmarks on the different vector sizes. It also benchmarks other important things like transfer bandwidth, kernel latency, etc.

It is loaded by

load clinfo:

For a specific version, run

module load clpeak/VERSION

and for the default version, run

module load clpeak

And then to benchmark all platforms and devices it can find, run it as

clpeak

or for a specific platform and device

clpeak -p PLATFORM -d DEVICE

where you have gotten the PLATFORM and DEVICE numbers from clinfo.

Example jobs to benchmark the platforms and their results are

opencl benchmarks:

A job to benchmark of PoCL on the CPUs of an Emmy Phase 2 node is

#!/usr/bin/env bash

#SBATCH --job-name=clpeak-emmyp2
#SBATCH -p standard96:el8
#SBATCH -t 00:15:00
#SBATCH -N 1
#SBATCH -n 1

module load clpeak
module load pocl

clpeak

and the result is

Platform: Portable Computing Language
  Device: cpu-cascadelake-Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz
    Driver version  : 5.0 (Linux x64)
    Compute units   : 192
    Clock frequency : 3800 MHz

    Global memory bandwidth (GBPS)
      float   : 5.06
      float2  : 10.69
      float4  : 18.82
      float8  : 28.35
      float16 : 30.73

    Single-precision compute (GFLOPS)
      float   : 106.05
      float2  : 215.18
      float4  : 430.83
      float8  : 854.67
      float16 : 1660.30

    Half-precision compute (GFLOPS)
      half   : 27.61
      half2  : 48.37
      half4  : 93.31
      half8  : 185.87
      half16 : 344.81

    Double-precision compute (GFLOPS)
      double   : 109.07
      double2  : 208.74
      double4  : 429.89
      double8  : 822.74
      double16 : 1438.83

    Integer compute (GIOPS)
      int   : 210.39
      int2  : 165.35
      int4  : 327.13
      int8  : 629.76
      int16 : 1107.31

    Integer compute Fast 24bit (GIOPS)
      int   : 216.43
      int2  : 165.50
      int4  : 320.48
      int8  : 634.08
      int16 : 1101.22

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 4.80
      enqueueReadBuffer               : 4.64
      enqueueWriteBuffer non-blocking : 4.77
      enqueueReadBuffer non-blocking  : 5.12
      enqueueMapBuffer(for read)      : 4234.00
        memcpy from mapped ptr        : 4.57
      enqueueUnmap(after write)       : 9485.35
        memcpy to mapped ptr          : 5.19

    Kernel launch latency : 797.62 us

from which we could conclude that a vector size of 16 is a good initial guess for the optimum vector size. Also notice how poor half-precision performance is in comparison to single and double precision. For doing half precision computation, it is generally better to do it on the GPUs (like Grete) or on CPUs with better builtin half precision support (like Emmy Phase 3)

A job to benchmark of PoCL on the CPUs of an Emmy Phase 3 node is

#!/usr/bin/env bash

#SBATCH --job-name=clpeak-emmyp3
#SBATCH -p medium96s
#SBATCH -t 00:15:00
#SBATCH -N 1
#SBATCH -n 1

module load clpeak
module load pocl

clpeak

and the result is

Platform: Portable Computing Language
  Device: cpu-sapphirerapids-Intel(R) Xeon(R) Platinum 8468
    Driver version  : 5.0 (Linux x64)
    Compute units   : 192
    Clock frequency : 3800 MHz

    Global memory bandwidth (GBPS)
      float   : 72.84
      float2  : 85.54
      float4  : 88.44
      float8  : 97.76
      float16 : 105.01

    Single-precision compute (GFLOPS)
      float   : 126.41
      float2  : 254.00
      float4  : 512.32
      float8  : 1028.37
      float16 : 1839.51

    Half-precision compute (GFLOPS)
      half   : 106.61
      half2  : 217.59
      half4  : 449.15
      half8  : 878.24
      half16 : 1842.21

    Double-precision compute (GFLOPS)
      double   : 125.22
      double2  : 249.81
      double4  : 502.12
      double8  : 881.17
      double16 : 1515.57

    Integer compute (GIOPS)
      int   : 247.44
      int2  : 167.32
      int4  : 333.63
      int8  : 687.28
      int16 : 1229.16

    Integer compute Fast 24bit (GIOPS)
      int   : 250.40
      int2  : 167.86
      int4  : 340.70
      int8  : 679.93
      int16 : 1228.72

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 9.89
      enqueueReadBuffer               : 9.83
      enqueueWriteBuffer non-blocking : 8.45
      enqueueReadBuffer non-blocking  : 8.76
      enqueueMapBuffer(for read)      : 973.92
        memcpy from mapped ptr        : 8.85
      enqueueUnmap(after write)       : 2446.44
        memcpy to mapped ptr          : 9.91

    Kernel launch latency : 822.36 us

from which we could conclude that a vector size of 16 is a good initial guess for the optimum vector size.

A job to benchmark of the Nvidia driver on the Nvidia GPUs and PoCL on the CPUs and Nvidia GPUs of a Grete node is

#!/usr/bin/env bash

#SBATCH --job-name=clpeak-grete
#SBATCH -p grete
#SBATCH -t 00:15:00
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -G A100:4

# Force the software stack to be nhr-lmod even before it becomes the default.
export PREFERRED_SOFTWARE_STACK=nhr-lmod
. /etc/profile

module load clpeak

# Just Nvidia driver

echo '################################################################################'
echo '#'
echo '# Platform: Nvidia driver'
echo '#'
echo '################################################################################'
echo ''
echo ''

clpeak -p 0 -d 0

# PoCL

module load pocl/5.0_cuda-11

echo ''
echo ''
echo ''
echo ''
echo ''
echo '################################################################################'
echo '#'
echo '# Platform: PoCL - CUDA'
echo '#'
echo '################################################################################'
echo ''
echo ''

POCL_DEVICES=cuda clpeak -p 0 -d 0

echo ''
echo ''
echo ''
echo ''
echo ''
echo '################################################################################'
echo '#'
echo '# Platform: PoCL - CPU'
echo '#'
echo '################################################################################'
echo ''
echo ''

POCL_DEVICES=cpu clpeak

and the result is

################################################################################
#
# Platform: Nvidia driver
#
################################################################################



Platform: NVIDIA CUDA
  Device: NVIDIA A100-SXM4-40GB
    Driver version  : 535.104.12 (Linux x64)
    Compute units   : 108
    Clock frequency : 1410 MHz

    Global memory bandwidth (GBPS)
      float   : 1305.59
      float2  : 1377.32
      float4  : 1419.30
      float8  : 1443.96
      float16 : 1464.56

    Single-precision compute (GFLOPS)
      float   : 19352.94
      float2  : 19386.81
      float4  : 19351.17
      float8  : 19274.51
      float16 : 19104.15

    No half precision support! Skipped

    Double-precision compute (GFLOPS)
      double   : 9721.90
      double2  : 9706.06
      double4  : 9681.32
      double8  : 9615.65
      double16 : 9533.18

    Integer compute (GIOPS)
      int   : 19276.27
      int2  : 19318.02
      int4  : 19260.69
      int8  : 19341.64
      int16 : 19333.69

    Integer compute Fast 24bit (GIOPS)
      int   : 19302.15
      int2  : 19297.12
      int4  : 19294.55
      int8  : 19217.17
      int16 : 19033.39

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 14.48
      enqueueReadBuffer               : 12.99
      enqueueWriteBuffer non-blocking : 14.16
      enqueueReadBuffer non-blocking  : 12.76
      enqueueMapBuffer(for read)      : 20.25
        memcpy from mapped ptr        : 19.63
      enqueueUnmap(after write)       : 26.76
        memcpy to mapped ptr          : 20.63

    Kernel launch latency : 9.07 us






################################################################################
#
# Platform: PoCL - CUDA
#
################################################################################



Platform: Portable Computing Language
  Device: NVIDIA A100-SXM4-40GB
    Driver version  : 5.0 (Linux x64)
    Compute units   : 108
    Clock frequency : 1410 MHz

    Global memory bandwidth (GBPS)
      float   : 1301.55
      float2  : 1368.14
      float4  : 1405.72
      float8  : 1438.37
      float16 : 1459.01

    Single-precision compute (GFLOPS)
      float   : 19369.37
      float2  : 19358.33
      float4  : 19357.20
      float8  : 19278.51
      float16 : 19135.89

    Half-precision compute (GFLOPS)
      half   : 19368.83
      half2  : 73221.23
      half4  : 66732.34
      half8  : 60351.88
      half16 : 62031.69

    Double-precision compute (GFLOPS)
      double   : 9700.11
      double2  : 9687.95
      double4  : 9675.77
      double8  : 9644.12
      double16 : 9565.58

    Integer compute (GIOPS)
      int   : 12937.51
      int2  : 12943.77
      int4  : 13225.68
      int8  : 12975.10
      int16 : 13058.68

    Integer compute Fast 24bit (GIOPS)
      int   : 12937.18
      int2  : 12943.10
      int4  : 13225.43
      int8  : 12975.01
      int16 : 13032.36

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 20.21
      enqueueReadBuffer               : 20.04
      enqueueWriteBuffer non-blocking : 20.22
      enqueueReadBuffer non-blocking  : 20.01
      enqueueMapBuffer(for read)      : 142689.94
        memcpy from mapped ptr        : 20.62
      enqueueUnmap(after write)       : 16.03
        memcpy to mapped ptr          : 20.65

    Kernel launch latency : -83.58 us






################################################################################
#
# Platform: PoCL - CPU
#
################################################################################



Platform: Portable Computing Language
  Device: cpu-znver3-AMD EPYC 7513 32-Core Processor
    Driver version  : 5.0 (Linux x64)
    Compute units   : 128
    Clock frequency : 3681 MHz

    Global memory bandwidth (GBPS)
      float   : 20.33
      float2  : 43.19
      float4  : 45.07
      float8  : 57.86
      float16 : 43.87

    Single-precision compute (GFLOPS)
      float   : 138.16
      float2  : 280.69
      float4  : 584.32
      float8  : 1093.85
      float16 : 1866.94

    Half-precision compute (GFLOPS)
      half   : 33.58
      half2  : 57.93
      half4  : 118.63
      half8  : 241.08
      half16 : 418.97

    Double-precision compute (GFLOPS)
      double   : 143.13
      double2  : 275.95
      double4  : 484.71
      double8  : 951.55
      double16 : 1663.51

    Integer compute (GIOPS)
      int   : 213.32
      int2  : 455.37
      int4  : 942.36
      int8  : 1711.27
      int16 : 2909.55

    Integer compute Fast 24bit (GIOPS)
      int   : 194.02
      int2  : 520.80
      int4  : 851.99
      int8  : 1827.18
      int16 : 2845.90

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 12.72
      enqueueReadBuffer               : 14.27
      enqueueWriteBuffer non-blocking : 11.82
      enqueueReadBuffer non-blocking  : 10.53
      enqueueMapBuffer(for read)      : 3315.04
        memcpy from mapped ptr        : 11.59
      enqueueUnmap(after write)       : 3206.16
        memcpy to mapped ptr          : 19.58

    Kernel launch latency : 146.01 us

from which we could conclude that the GPUs (regardless of platform) vastly outperform the CPUs in this benchmark, the vector size makes little difference on the Nvidia GPUs except for half precision where a size of 2 is a good initial guess, a vector size of 16 is a good initial guess on the CPUs for the optimum vector size. Notice in particular how half precision on the GPUs is considerably faster than single and double precision, but the opposite (slower) on the CPUs. For doing half precision computation, it is generally better to do it on the GPUs or on CPUs with better builtin half precision support (like Emmy Phase 3)