Gaudi2 Getting Started

This section provides step-by-step instructions for first-time users to run machine learning workloads on Gaudi2 HPUs on the Future Technology Platform.

Initial Setup (One-Time Configuration)

1. Container Environment Setup

# Allocate resources for building the container
salloc -p gaudi --reservation=Gaudi2 --time=01:00:00 --mem=128G --job-name=apptainer-build 
# Gaudi2 can only be accessed in exclusive mode on FTP

# Load Apptainer and build the PyTorch-Habana container
module load apptainer
apptainer build ~/pytorch-habana.sif docker://vault.habana.ai/gaudi-docker/1.21.2/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
mkdir -p "$HOME/datasets" "$HOME/tmp" "$HOME/habana_logs"

Gaudi uses its own fork of PyTorch, and it is best to extract the latest version from the Gaudi Docker image and build a custom .sif file to work with Apptainer.

You can check the latest PyTorch Docker image files from here: Gaudi Docker Images

2. System Verification

# Enter the container to check system specifications
apptainer shell --cleanenv --contain \
  --bind "$HOME/habana_logs:/var/log/habana_logs" \
  --bind /dev:/dev \
  --env HABANA_LOGS=/var/log/habana_logs \
  "$HOME/pytorch-habana.sif"
# Verify CPU configuration
lscpu | grep -E '^CPU\(s\):|^Socket|^Core'
nproc
# Confirm HPU devices are accessible
ls /dev/accel*
# Test PyTorch HPU integration
python -c "import habana_frameworks.torch.core as ht; print(f'HPU devices: {ht.hpu.device_count()}')"
# Test HPU management system
hl-smi
exit  # Exit container shell

3. Model Repository and Directory Setup

There are several official examples available in the Habana AI GitHub. You can also look at our direct links to the examples here.

# Clone the official model reference repository
git clone https://github.com/HabanaAI/Model-References
Warning

Make sure you exit the reservation before continuing with the code below or make adjustments.

Single HPU Example: MNIST Classification

This example demonstrates basic single-HPU usage with the classic MNIST handwritten digit classification task.

sbatch -p gaudi --reservation=Gaudi2 --time=02:00:00 --exclusive \
  -J mnist-single-hpu -o mnist-single-hpu_%j.out -e mnist-single-hpu_%j.err \
  --wrap "/bin/bash -lc 'module load apptainer; \
  apptainer exec --cleanenv --contain \
    --bind \$HOME:\$HOME \
    --bind \$HOME/habana_logs:/var/log/habana_logs \
    --bind \$HOME/datasets:/datasets \
    --bind \$HOME/tmp:/worktmp \
    --bind /dev:/dev --bind /sys/class/accel:/sys/class/accel --bind /sys/kernel/debug:/sys/kernel/debug \
    --env HABANA_LOGS=/var/log/habana_logs,PT_HPU_LAZY_MODE=1,HABANA_INITIAL_WORKSPACE_SIZE_MB=8192,TMPDIR=/worktmp,TORCH_HOME=\$HOME/.cache/torch,PYTHONNOUSERSITE=1 \
    --pwd \$HOME/Model-References/PyTorch/examples/computer_vision/hello_world \
    \$HOME/pytorch-habana.sif python3 mnist.py --epochs 5 --batch-size 128 --data-path /datasets/mnist'"

Multi-HPU Example: YOLOX Object Detection

This comprehensive example demonstrates distributed training across multiple HPUs using the YOLOX object detection model with the COCO 2017 dataset.

1. YOLOX Dependencies Installation

# Set up environment variables
export SIF="$HOME/pytorch-habana.sif"
export YOLOX_DIR="$HOME/Model-References/PyTorch/computer_vision/detection/yolox"
export TMPDIR_HOST="$HOME/tmp"

mkdir -p "$TMPDIR_HOST" "$HOME/habana_logs" 
module load apptainer

# Install YOLOX requirements in the container
apptainer exec --cleanenv --contain \
  --bind "$HOME:$HOME" \
  --bind "$HOME/habana_logs:/var/log/habana_logs:rw" \
  --bind "$TMPDIR_HOST:/worktmp" \
  --env HOME=$HOME,TMPDIR=/worktmp,PIP_CACHE_DIR=/worktmp/pip,PIP_TMPDIR=/worktmp,XDG_CACHE_HOME=/worktmp \
  --pwd "$YOLOX_DIR" \
  "$SIF" bash -lc '
    python3 -m pip install --user --no-cache-dir --prefer-binary -r requirements.txt
    python3 -m pip install -v -e .
    python3 - <<PY
import site,loguru
print("USER_SITE:", site.getusersitepackages())
print("loguru:", loguru.__version__)
PY'

2. COCO 2017 Dataset Download

export DATA_COCO="$HOME/datasets/COCO"
mkdir -p "$DATA_COCO"

sbatch -p gaudi --reservation=Gaudi2 --time=02:00:00 --exclusive \
  -J coco-download -o coco-download_%j.out -e coco-download_%j.err \
  --wrap "/bin/bash -lc '
    set -euo pipefail
    apptainer exec --cleanenv --contain \
      --bind $HOME:$HOME \
      --bind $DATA_COCO:/data/COCO \
      --bind $TMPDIR_HOST:/worktmp \
      --env TMPDIR=/worktmp,YOLOX_DATADIR=/data/COCO \
      --pwd $YOLOX_DIR \
      \"$SIF\" bash -lc \"set -euo pipefail
        echo Using YOLOX_DATADIR=\\\$YOLOX_DATADIR
        source download_dataset.sh

        # Sanity: ensure annotation files exist
        test -f /data/COCO/annotations/instances_train2017.json
        test -f /data/COCO/annotations/instances_val2017.json

        # Patch annotations to guarantee an 'info' key (prevents pycocotools KeyError)
        python3 - <<'PY'
import json, os, sys
paths = [
  '/data/COCO/annotations/instances_train2017.json',
  '/data/COCO/annotations/instances_val2017.json'
]
for p in paths:
    with open(p, 'r', encoding='utf-8') as f:
        d = json.load(f)
    if 'info' not in d or not isinstance(d['info'], dict):
        d['info'] = {'description':'COCO 2017','version':'1.0'}
        with open(p, 'w', encoding='utf-8') as f:
            json.dump(d, f)
        print(f\"Patched {os.path.basename(p)}: added 'info'\")
    else:
        print(f\"{os.path.basename(p)} already has 'info'\")
PY

        # Quick listing
        ls -l /data/COCO
        ls -l /data/COCO/annotations
        ls -l /data/COCO/train2017 | head -n 5
        ls -l /data/COCO/val2017   | head -n 5
      \"
  '"

3. Multi-HPU Training Configurations

Single HPU Training

sbatch -p gaudi --reservation=Gaudi2 --time=01:00:00 --exclusive \
  -J yolox-1hpu-training -o yolox-1hpu-training_%j.out -e yolox-1hpu-training_%j.err \
  --wrap "/bin/bash -lc '
    SIF=\$HOME/pytorch-habana.sif
    YOLOX_DIR=\$HOME/Model-References/PyTorch/computer_vision/detection/yolox
    DATA_COCO=\$HOME/datasets/COCO
    TMPDIR_HOST=\$HOME/tmp
    module load apptainer || true
    apptainer exec --cleanenv --contain \
      --bind \$HOME:\$HOME \
      --bind \$DATA_COCO:/data/COCO \
      --bind \$HOME/habana_logs:/var/log/habana_logs:rw \
      --bind \$TMPDIR_HOST:/worktmp \
      --bind /dev:/dev --bind /sys/class/accel:/sys/class/accel --bind /sys/kernel/debug:/sys/kernel/debug \
      --env PT_HPU_LAZY_MODE=1,TMPDIR=/worktmp,YOLOX_DATADIR=/data/COCO,MASTER_ADDR=localhost,MASTER_PORT=12355,PYTHONPATH=\$YOLOX_DIR:\$HOME/.local/lib/python3.10/site-packages:\$PYTHONPATH \
      --pwd \$YOLOX_DIR \
      \"\$SIF\" bash -lc \"python3 -u tools/train.py --name yolox-s --devices 1 --batch-size 64 --data_dir /data/COCO --hpu \
        steps 100 output_dir ./yolox_output\"
  '"

4-HPU Distributed Training (MPIrun)

sbatch -p gaudi --reservation=Gaudi2 --time=02:00:00 --exclusive \
  -J yolox-4hpu-training -o yolox-4hpu-training_%j.out -e yolox-4hpu-training_%j.err \
  --wrap "/bin/bash -lc '
    SIF=\$HOME/pytorch-habana.sif
    YOLOX_DIR=\$HOME/Model-References/PyTorch/computer_vision/detection/yolox
    DATA_COCO=\$HOME/datasets/COCO
    TMPDIR_HOST=\$HOME/tmp
    module load apptainer || true
    apptainer exec --cleanenv --contain \
      --bind \$HOME:\$HOME \
      --bind \$DATA_COCO:/data/COCO \
      --bind \$HOME/habana_logs:/var/log/habana_logs:rw \
      --bind \$TMPDIR_HOST:/worktmp \
      --bind /dev:/dev --bind /sys/class/accel:/sys/class/accel --bind /sys/kernel/debug:/sys/kernel/debug \
      --env HOME=\$HOME,PT_HPU_LAZY_MODE=1,TMPDIR=/worktmp,YOLOX_DATADIR=/data/COCO,MASTER_ADDR=localhost,MASTER_PORT=12355,PYTHONPATH=\$YOLOX_DIR:\$HOME/.local/lib/python3.10/site-packages:\$PYTHONPATH \
      --pwd \$YOLOX_DIR \
      \"\$SIF\" bash -lc \"mpirun -n 4 --bind-to core --rank-by core --report-bindings --allow-run-as-root \
        python3 -u tools/train.py --name yolox-s --devices 4 --batch-size 64 --data_dir /data/COCO --hpu \
        steps 100 output_dir ./yolox_output\"
  '"

4-HPU Distributed Training (Torchrun)

sbatch -p gaudi --reservation=Gaudi2 --time=02:00:00 --exclusive \
  -J yolox-4hpu-training -o yolox-4hpu-training_%j.out -e yolox-4hpu-training_%j.err \
  --wrap "/bin/bash -lc '
    SIF=\$HOME/pytorch-habana.sif
    YOLOX_DIR=\$HOME/Model-References/PyTorch/computer_vision/detection/yolox
    DATA_COCO=\$HOME/datasets/COCO
    TMPDIR_HOST=\$HOME/tmp
    module load apptainer || true
    apptainer exec --cleanenv --contain \
      --bind \$HOME:\$HOME \
      --bind \$DATA_COCO:/data/COCO \
      --bind \$HOME/habana_logs:/var/log/habana_logs:rw \
      --bind \$TMPDIR_HOST:/worktmp \
      --bind /dev:/dev --bind /sys/class/accel:/sys/class/accel --bind /sys/kernel/debug:/sys/kernel/debug \
      --env PT_HPU_LAZY_MODE=1,TMPDIR=/worktmp,YOLOX_DATADIR=/data/COCO,MASTER_ADDR=localhost,MASTER_PORT=12355,PYTHONPATH=\$YOLOX_DIR:\$HOME/.local/lib/python3.10/site-packages:\$PYTHONPATH \
      --pwd \$YOLOX_DIR \
      \"\$SIF\" bash -lc \"mpirun -n 4 --bind-to core --rank-by core --report-bindings \
        python3 -u tools/train.py --name yolox-s --devices 4 --batch-size 64 --data_dir /data/COCO --hpu \
        steps 100 output_dir ./yolox_output\"
  '"

8-HPU Maximum Scale Training (MPIrun)

sbatch -p gaudi --reservation=Gaudi2 --time=05:00:00 --exclusive \
  -J yolox-8hpu-training -o yolox-8hpu-training_%j.out -e yolox-8hpu-training_%j.err \
  --wrap "/bin/bash -lc '
    SIF=\$HOME/pytorch-habana.sif
    YOLOX_DIR=\$HOME/Model-References/PyTorch/computer_vision/detection/yolox
    DATA_COCO=\$HOME/datasets/COCO
    TMPDIR_HOST=\$HOME/tmp
    module load apptainer || true
    apptainer exec --cleanenv --contain \
      --bind \$HOME:\$HOME \
      --bind \$DATA_COCO:/data/COCO \
      --bind \$HOME/habana_logs:/var/log/habana_logs:rw \
      --bind \$TMPDIR_HOST:/worktmp \
      --bind /dev:/dev --bind /sys/class/accel:/sys/class/accel --bind /sys/kernel/debug:/sys/kernel/debug \
      --env PT_HPU_LAZY_MODE=1,TMPDIR=/worktmp,YOLOX_DATADIR=/data/COCO,MASTER_ADDR=localhost,MASTER_PORT=12355,PYTHONPATH=\$YOLOX_DIR:\$HOME/.local/lib/python3.10/site-packages:\$PYTHONPATH \
      --pwd \$YOLOX_DIR \
      \"\$SIF\" bash -lc \"mpirun -n 8 --bind-to core --rank-by core \
        python3 -u tools/train.py --name yolox-s --devices 8 --batch-size 64 --data_dir /data/COCO --hpu \
        steps 100 output_dir ./yolox_output eval_interval 1000000 data_num_workers 2\"

  '"

8-HPU Maximum Scale Training (Torchrun)

sbatch -p gaudi --reservation=Gaudi2 --time=05:00:00 --exclusive \
  -J yolox-8hpu-torchrun -o yolox-8hpu-torchrun_%j.out -e yolox-8hpu-torchrun_%j.err \
  --wrap "/bin/bash -lc '
    SIF=\$HOME/pytorch-habana.sif
    YOLOX_DIR=\$HOME/Model-References/PyTorch/computer_vision/detection/yolox
    DATA_COCO=\$HOME/datasets/COCO
    TMPDIR_HOST=\$HOME/tmp
    module load apptainer || true
    apptainer exec --cleanenv --contain \
      --bind \$HOME:\$HOME \
      --bind \$DATA_COCO:/data/COCO \
      --bind \$HOME/habana_logs:/var/log/habana_logs:rw \
      --bind \$TMPDIR_HOST:/worktmp \
      --bind /dev:/dev --bind /sys/class/accel:/sys/class/accel --bind /sys/kernel/debug:/sys/kernel/debug \
      --env PT_HPU_LAZY_MODE=1,TMPDIR=/worktmp,YOLOX_DATADIR=/data/COCO,MASTER_ADDR=localhost,MASTER_PORT=12355,PYTHONPATH=\$YOLOX_DIR:\$HOME/.local/lib/python3.10/site-packages:\$PYTHONPATH \
      --pwd \$YOLOX_DIR \
      \"\$SIF\" bash -lc \"torchrun --nproc_per_node=8 tools/train.py --name yolox-s --devices 8 --batch-size 64 --data_dir /data/COCO --hpu \
        steps 100 output_dir ./yolox_output eval_interval 1000000 data_num_workers 2\"
  '"