Gaudi2 :: Documentation for HPC

Gaudi2 :: Documentation for HPChttps://docs.hpc.gwdg.de/services/ftp/gaudi2/index.htmlIntroduction Gaudi2 is Intel’s second-generation deep learning accelerator, developed by Habana Labs (now part of Intel). Unlike traditional GPUs, Gaudi2 has been designed from the ground up for large-scale AI training. Each device is powered by Habana Processing Units (HPUs), its purpose-built AI training cores. The memory-centric architecture and Ethernet-based scale-out enable efficient training of today’s large and complex models, while offering a favorable power-to-performance ratio. The platform provides 96 GB of on-chip high-bandwidth memory per device, together with 24×100 Gbps standard Ethernet interfaces. This combination eliminates the need for proprietary interconnects and allows flexible integration into existing cluster infrastructures. On the FTP, we currently host a single Gaudi2 node equipped with 8 HL-225 HPUs, available for researchers and developers to evaluate distributed AI training.Hugode-deGaudi2 Getting Startedhttps://docs.hpc.gwdg.de/services/ftp/gaudi2/getongaudi2/index.htmlMon, 01 Jan 0001 00:00:00 +0000https://docs.hpc.gwdg.de/services/ftp/gaudi2/getongaudi2/index.htmlThis section provides step-by-step instructions for first-time users to run machine learning workloads on Gaudi2 HPUs on the Future Technology Platform. Initial Setup (One-Time Configuration) 1. Container Environment Setup # Allocate resources for building the container salloc -p gaudi --reservation=Gaudi2 --time=01:00:00 --mem=128G --job-name=apptainer-build # Gaudi2 can only be accessed in exclusive mode on FTP # Load Apptainer and build the PyTorch-Habana container module load apptainer apptainer build ~/pytorch-habana.sif docker://vault.habana.ai/gaudi-docker/1.21.2/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest mkdir -p "$HOME/datasets" "$HOME/tmp" "$HOME/habana_logs" Gaudi uses its own fork of PyTorch, and it is best to extract the latest version from the Gaudi Docker image and build a custom .sif file to work with Apptainer.Gaudi2 Tutorials and Exampleshttps://docs.hpc.gwdg.de/services/ftp/gaudi2/gaudi_examples/index.htmlMon, 01 Jan 0001 00:00:00 +0000https://docs.hpc.gwdg.de/services/ftp/gaudi2/gaudi_examples/index.htmlHere are some official tutorials and examples. You will need to have access to FTP and finish the one-time setup before trying any of these. Hello World\MNIST Getting Started with DeepSpeed DeepSpeed User Guide Inferencing on Gaudí Running vLLM Gaudi Compatible vLLM fork Resnet 50 UNET Segmentation YOLOX object detection Bert NLP Training DeepSpeed Bert Training Wav2Vwc speech recognition Inferencing Stable Diffusion Training MLPerf Training 4.0 GPT3 MLPerf Training 4.0 Llama 70B LoRA MLPerf Inference 4.0 Llama 70B MLPerf Inference 4.0 Stable Diffusion XL