PoCL

Portable Computing Language (PoCL) is a widely used OpenCL platform particularly well known for providing the host CPUs as an OpenCL device. But PoCL does support other devices, such as Nvidia GPUs via CUDA. The module pocl is available in the GWDG Modules and NHR Modules software stacks.

Warning

In the NHR Modules, the pocl module is built without Nvidia GPU support. If you want Nvida GPU support, you must instead use the pocl/VERSION_cuda-CUDAMAJORVERSION module. For example, pocl/5.0 would be the non-GPU variant and pocl/5.0_cuda-11 would be a GPU variant using CUDA 11.x.

Warning

Due to limitations in how the GWDG Modules and NHR Modules software stacks are built, only one platform can be used at a time.

To load a specific version, run

module load pocl/VERSION

and for the default version (non-GPU), run

module load pocl

Controlling Runtime Behavior

PoCL uses a variety of environmental variables to control its runtime behavior, which are described in the PoCL Documentation. Two important environmental variables are POCL_DEVICES and POCL_MAX_CPU_CU_COUNT.

By default, PoCL provides access to the cpu device and all non-CPU devices it was compiled for. Setting POCL_DEVICES to a space separated list of devices limits PoCL to providing access to only those kinds of devices. The relevant device names are in the table below. Setting POCL_DEVICES=cuda would limit PoCL to only Nvidia GPUs, while setting it to POCL_DEVICES="cpu cuda" would limit PoCL to the host CPUs (with threads) and Nvidia GPUs.

Name for `POCL_DEVICES`	Description
`cpu`	All CPUs on the host using threads
`cuda`	Nvidia GPUs using CUDA

At the present time, PoCL is unable to determine how many CPUs it should use based on the limits set by Slurm. Instead, it tries to use one thread for every core it sees on the host including hyperthread cores, even if the Slurm job was run with say -c 1. To override the number of CPUs that PoCL sees (and therefore threads it uses for the cpu device), use the environmental variable POCL_MAX_CPU_CU_COUNT. This is particularly bad when running a job that doesn’t use all the cores on a shared node, in which case it usually makes the most sense to either first run

export POCL_MAX_CPU_CU_COUNT="$SLURM_CPUS_PER_TASK"

if one wants to use all hyperthreads, or

export POCL_MAX_CPU_CU_COUNT="$(( $SLURM_CPUS_PER_TASK / 2))"

if one wants only one thread per physical core (not using all hyperthreads).