Gaudi2 Getting Started
This section provides step-by-step instructions for first-time users to run machine learning workloads on Gaudi2 HPUs on the Future Technology Platform.
Initial Setup (One-Time Configuration)
1. Container Environment Setup
# Allocate resources for building the container
salloc -p gaudi --reservation=Gaudi2 --time=01:00:00 --mem=128G --job-name=apptainer-build
# Gaudi2 can only be accessed in exclusive mode on FTP
# Load Apptainer and build the PyTorch-Habana container
module load apptainer
apptainer build ~/pytorch-habana.sif docker://vault.habana.ai/gaudi-docker/1.21.2/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
mkdir -p "$HOME/datasets" "$HOME/tmp" "$HOME/habana_logs"
Gaudi uses its own fork of PyTorch, and it is best to extract the latest version from the Gaudi Docker image and build a custom .sif
file to work with Apptainer.
You can check the latest PyTorch Docker image files from here: Gaudi Docker Images
2. System Verification
# Enter the container to check system specifications
apptainer shell --cleanenv --contain \
--bind "$HOME/habana_logs:/var/log/habana_logs" \
--bind /dev:/dev \
--env HABANA_LOGS=/var/log/habana_logs \
"$HOME/pytorch-habana.sif"
# Verify CPU configuration
lscpu | grep -E '^CPU\(s\):|^Socket|^Core'
nproc
# Confirm HPU devices are accessible
ls /dev/accel*
# Test PyTorch HPU integration
python -c "import habana_frameworks.torch.core as ht; print(f'HPU devices: {ht.hpu.device_count()}')"
# Test HPU management system
hl-smi
exit # Exit container shell
3. Model Repository and Directory Setup
There are several official examples available in the Habana AI GitHub. You can also look at our direct links to the examples here.
# Clone the official model reference repository
git clone https://github.com/HabanaAI/Model-References
Make sure you exit the reservation before continuing with the code below or make adjustments.
Single HPU Example: MNIST Classification
This example demonstrates basic single-HPU usage with the classic MNIST handwritten digit classification task.
sbatch -p gaudi --reservation=Gaudi2 --time=02:00:00 --exclusive \
-J mnist-single-hpu -o mnist-single-hpu_%j.out -e mnist-single-hpu_%j.err \
--wrap "/bin/bash -lc 'module load apptainer; \
apptainer exec --cleanenv --contain \
--bind \$HOME:\$HOME \
--bind \$HOME/habana_logs:/var/log/habana_logs \
--bind \$HOME/datasets:/datasets \
--bind \$HOME/tmp:/worktmp \
--bind /dev:/dev --bind /sys/class/accel:/sys/class/accel --bind /sys/kernel/debug:/sys/kernel/debug \
--env HABANA_LOGS=/var/log/habana_logs,PT_HPU_LAZY_MODE=1,HABANA_INITIAL_WORKSPACE_SIZE_MB=8192,TMPDIR=/worktmp,TORCH_HOME=\$HOME/.cache/torch,PYTHONNOUSERSITE=1 \
--pwd \$HOME/Model-References/PyTorch/examples/computer_vision/hello_world \
\$HOME/pytorch-habana.sif python3 mnist.py --epochs 5 --batch-size 128 --data-path /datasets/mnist'"
Multi-HPU Example: YOLOX Object Detection
This comprehensive example demonstrates distributed training across multiple HPUs using the YOLOX object detection model with the COCO 2017 dataset.
1. YOLOX Dependencies Installation
# Set up environment variables
export SIF="$HOME/pytorch-habana.sif"
export YOLOX_DIR="$HOME/Model-References/PyTorch/computer_vision/detection/yolox"
export TMPDIR_HOST="$HOME/tmp"
mkdir -p "$TMPDIR_HOST" "$HOME/habana_logs"
module load apptainer
# Install YOLOX requirements in the container
apptainer exec --cleanenv --contain \
--bind "$HOME:$HOME" \
--bind "$HOME/habana_logs:/var/log/habana_logs:rw" \
--bind "$TMPDIR_HOST:/worktmp" \
--env HOME=$HOME,TMPDIR=/worktmp,PIP_CACHE_DIR=/worktmp/pip,PIP_TMPDIR=/worktmp,XDG_CACHE_HOME=/worktmp \
--pwd "$YOLOX_DIR" \
"$SIF" bash -lc '
python3 -m pip install --user --no-cache-dir --prefer-binary -r requirements.txt
python3 -m pip install -v -e .
python3 - <<PY
import site,loguru
print("USER_SITE:", site.getusersitepackages())
print("loguru:", loguru.__version__)
PY'
2. COCO 2017 Dataset Download
export DATA_COCO="$HOME/datasets/COCO"
mkdir -p "$DATA_COCO"
sbatch -p gaudi --reservation=Gaudi2 --time=02:00:00 --exclusive \
-J coco-download -o coco-download_%j.out -e coco-download_%j.err \
--wrap "/bin/bash -lc '
set -euo pipefail
apptainer exec --cleanenv --contain \
--bind $HOME:$HOME \
--bind $DATA_COCO:/data/COCO \
--bind $TMPDIR_HOST:/worktmp \
--env TMPDIR=/worktmp,YOLOX_DATADIR=/data/COCO \
--pwd $YOLOX_DIR \
\"$SIF\" bash -lc \"set -euo pipefail
echo Using YOLOX_DATADIR=\\\$YOLOX_DATADIR
source download_dataset.sh
# Sanity: ensure annotation files exist
test -f /data/COCO/annotations/instances_train2017.json
test -f /data/COCO/annotations/instances_val2017.json
# Patch annotations to guarantee an 'info' key (prevents pycocotools KeyError)
python3 - <<'PY'
import json, os, sys
paths = [
'/data/COCO/annotations/instances_train2017.json',
'/data/COCO/annotations/instances_val2017.json'
]
for p in paths:
with open(p, 'r', encoding='utf-8') as f:
d = json.load(f)
if 'info' not in d or not isinstance(d['info'], dict):
d['info'] = {'description':'COCO 2017','version':'1.0'}
with open(p, 'w', encoding='utf-8') as f:
json.dump(d, f)
print(f\"Patched {os.path.basename(p)}: added 'info'\")
else:
print(f\"{os.path.basename(p)} already has 'info'\")
PY
# Quick listing
ls -l /data/COCO
ls -l /data/COCO/annotations
ls -l /data/COCO/train2017 | head -n 5
ls -l /data/COCO/val2017 | head -n 5
\"
'"
3. Multi-HPU Training Configurations
Single HPU Training
sbatch -p gaudi --reservation=Gaudi2 --time=01:00:00 --exclusive \
-J yolox-1hpu-training -o yolox-1hpu-training_%j.out -e yolox-1hpu-training_%j.err \
--wrap "/bin/bash -lc '
SIF=\$HOME/pytorch-habana.sif
YOLOX_DIR=\$HOME/Model-References/PyTorch/computer_vision/detection/yolox
DATA_COCO=\$HOME/datasets/COCO
TMPDIR_HOST=\$HOME/tmp
module load apptainer || true
apptainer exec --cleanenv --contain \
--bind \$HOME:\$HOME \
--bind \$DATA_COCO:/data/COCO \
--bind \$HOME/habana_logs:/var/log/habana_logs:rw \
--bind \$TMPDIR_HOST:/worktmp \
--bind /dev:/dev --bind /sys/class/accel:/sys/class/accel --bind /sys/kernel/debug:/sys/kernel/debug \
--env PT_HPU_LAZY_MODE=1,TMPDIR=/worktmp,YOLOX_DATADIR=/data/COCO,MASTER_ADDR=localhost,MASTER_PORT=12355,PYTHONPATH=\$YOLOX_DIR:\$HOME/.local/lib/python3.10/site-packages:\$PYTHONPATH \
--pwd \$YOLOX_DIR \
\"\$SIF\" bash -lc \"python3 -u tools/train.py --name yolox-s --devices 1 --batch-size 64 --data_dir /data/COCO --hpu \
steps 100 output_dir ./yolox_output\"
'"
4-HPU Distributed Training (MPIrun)
sbatch -p gaudi --reservation=Gaudi2 --time=02:00:00 --exclusive \
-J yolox-4hpu-training -o yolox-4hpu-training_%j.out -e yolox-4hpu-training_%j.err \
--wrap "/bin/bash -lc '
SIF=\$HOME/pytorch-habana.sif
YOLOX_DIR=\$HOME/Model-References/PyTorch/computer_vision/detection/yolox
DATA_COCO=\$HOME/datasets/COCO
TMPDIR_HOST=\$HOME/tmp
module load apptainer || true
apptainer exec --cleanenv --contain \
--bind \$HOME:\$HOME \
--bind \$DATA_COCO:/data/COCO \
--bind \$HOME/habana_logs:/var/log/habana_logs:rw \
--bind \$TMPDIR_HOST:/worktmp \
--bind /dev:/dev --bind /sys/class/accel:/sys/class/accel --bind /sys/kernel/debug:/sys/kernel/debug \
--env HOME=\$HOME,PT_HPU_LAZY_MODE=1,TMPDIR=/worktmp,YOLOX_DATADIR=/data/COCO,MASTER_ADDR=localhost,MASTER_PORT=12355,PYTHONPATH=\$YOLOX_DIR:\$HOME/.local/lib/python3.10/site-packages:\$PYTHONPATH \
--pwd \$YOLOX_DIR \
\"\$SIF\" bash -lc \"mpirun -n 4 --bind-to core --rank-by core --report-bindings --allow-run-as-root \
python3 -u tools/train.py --name yolox-s --devices 4 --batch-size 64 --data_dir /data/COCO --hpu \
steps 100 output_dir ./yolox_output\"
'"
4-HPU Distributed Training (Torchrun)
sbatch -p gaudi --reservation=Gaudi2 --time=02:00:00 --exclusive \
-J yolox-4hpu-training -o yolox-4hpu-training_%j.out -e yolox-4hpu-training_%j.err \
--wrap "/bin/bash -lc '
SIF=\$HOME/pytorch-habana.sif
YOLOX_DIR=\$HOME/Model-References/PyTorch/computer_vision/detection/yolox
DATA_COCO=\$HOME/datasets/COCO
TMPDIR_HOST=\$HOME/tmp
module load apptainer || true
apptainer exec --cleanenv --contain \
--bind \$HOME:\$HOME \
--bind \$DATA_COCO:/data/COCO \
--bind \$HOME/habana_logs:/var/log/habana_logs:rw \
--bind \$TMPDIR_HOST:/worktmp \
--bind /dev:/dev --bind /sys/class/accel:/sys/class/accel --bind /sys/kernel/debug:/sys/kernel/debug \
--env PT_HPU_LAZY_MODE=1,TMPDIR=/worktmp,YOLOX_DATADIR=/data/COCO,MASTER_ADDR=localhost,MASTER_PORT=12355,PYTHONPATH=\$YOLOX_DIR:\$HOME/.local/lib/python3.10/site-packages:\$PYTHONPATH \
--pwd \$YOLOX_DIR \
\"\$SIF\" bash -lc \"mpirun -n 4 --bind-to core --rank-by core --report-bindings \
python3 -u tools/train.py --name yolox-s --devices 4 --batch-size 64 --data_dir /data/COCO --hpu \
steps 100 output_dir ./yolox_output\"
'"
8-HPU Maximum Scale Training (MPIrun)
sbatch -p gaudi --reservation=Gaudi2 --time=05:00:00 --exclusive \
-J yolox-8hpu-training -o yolox-8hpu-training_%j.out -e yolox-8hpu-training_%j.err \
--wrap "/bin/bash -lc '
SIF=\$HOME/pytorch-habana.sif
YOLOX_DIR=\$HOME/Model-References/PyTorch/computer_vision/detection/yolox
DATA_COCO=\$HOME/datasets/COCO
TMPDIR_HOST=\$HOME/tmp
module load apptainer || true
apptainer exec --cleanenv --contain \
--bind \$HOME:\$HOME \
--bind \$DATA_COCO:/data/COCO \
--bind \$HOME/habana_logs:/var/log/habana_logs:rw \
--bind \$TMPDIR_HOST:/worktmp \
--bind /dev:/dev --bind /sys/class/accel:/sys/class/accel --bind /sys/kernel/debug:/sys/kernel/debug \
--env PT_HPU_LAZY_MODE=1,TMPDIR=/worktmp,YOLOX_DATADIR=/data/COCO,MASTER_ADDR=localhost,MASTER_PORT=12355,PYTHONPATH=\$YOLOX_DIR:\$HOME/.local/lib/python3.10/site-packages:\$PYTHONPATH \
--pwd \$YOLOX_DIR \
\"\$SIF\" bash -lc \"mpirun -n 8 --bind-to core --rank-by core \
python3 -u tools/train.py --name yolox-s --devices 8 --batch-size 64 --data_dir /data/COCO --hpu \
steps 100 output_dir ./yolox_output eval_interval 1000000 data_num_workers 2\"
'"
8-HPU Maximum Scale Training (Torchrun)
sbatch -p gaudi --reservation=Gaudi2 --time=05:00:00 --exclusive \
-J yolox-8hpu-torchrun -o yolox-8hpu-torchrun_%j.out -e yolox-8hpu-torchrun_%j.err \
--wrap "/bin/bash -lc '
SIF=\$HOME/pytorch-habana.sif
YOLOX_DIR=\$HOME/Model-References/PyTorch/computer_vision/detection/yolox
DATA_COCO=\$HOME/datasets/COCO
TMPDIR_HOST=\$HOME/tmp
module load apptainer || true
apptainer exec --cleanenv --contain \
--bind \$HOME:\$HOME \
--bind \$DATA_COCO:/data/COCO \
--bind \$HOME/habana_logs:/var/log/habana_logs:rw \
--bind \$TMPDIR_HOST:/worktmp \
--bind /dev:/dev --bind /sys/class/accel:/sys/class/accel --bind /sys/kernel/debug:/sys/kernel/debug \
--env PT_HPU_LAZY_MODE=1,TMPDIR=/worktmp,YOLOX_DATADIR=/data/COCO,MASTER_ADDR=localhost,MASTER_PORT=12355,PYTHONPATH=\$YOLOX_DIR:\$HOME/.local/lib/python3.10/site-packages:\$PYTHONPATH \
--pwd \$YOLOX_DIR \
\"\$SIF\" bash -lc \"torchrun --nproc_per_node=8 tools/train.py --name yolox-s --devices 8 --batch-size 64 --data_dir /data/COCO --hpu \
steps 100 output_dir ./yolox_output eval_interval 1000000 data_num_workers 2\"
'"