LLMs on Grete

This guide walks you through deploying and running Large Language Models (LLMs) on the Grete GPU cluster at GWDG. Whether you’re fine-tuning a transformer or serving a chatbot, this manual ensures you can harness Grete’s full potential.

Prerequisites

Before you begin, ensure the following:

  • You have an active HPC account with access to Grete.
  • You are familiar with SSH, Slurm, and module environments.
  • Your project has sufficient GPU quota and storage.
  • You have a working python environment.

1. Connect to Grete

ssh u12345@glogin-gpu.hpc.gwdg.de

Remember to replace the username u12345 with your own project user name. Once logged in, you can submit jobs to the GPU nodes via Slurm.

2. Set Up Your Environment

Create a virtual environment using, for example, conda. In order to use it, first you need to load the correct module.

module load miniforge3

Once this is done, follow these steps to set up the environment.

conda create -n llm-env python=3.11

After creation, load the environment. The next step will change how the prompt looks. This is normal.

source activate llm-env

Finally, install the required packages.

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers accelerate

3. Prepare Your Script

Here’s a minimal example called rum_llm.py:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilgpt2").to(device)

# Prepare input
input_text = "Once upon a time"
inputs = tokenizer(input_text, return_tensors="pt").to(device)  # Move inputs to same device

# Generate output
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Make sure your model fits into GPU memory. Use accelerate or bitsandbytes for optimization.

4. Submit a Slurm Job

Create a job script run_llm.sh:

#!/bin/bash
#SBATCH --job-name=llm-run
#SBATCH --partition=grete:interactive
#SBATCH -G 1g.20gb:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=40G
#SBATCH --time=02:00:00
#SBATCH --output=llm_output_%J.log
#SBATCH -C inet

module load miniforge3

source activate llm-env
python run_llm.py

Submit with:

sbatch run_llm.sh

5. Tips for Scaling

  • Use DeepSpeed, FSDP, or accelerate for multi-GPU training.
  • For inference, consider model quantization or ONNX export.
  • Monitor GPU usage with nvidia-smi.

6. Using GWDG’s LLM Service

If you prefer not to run your own model, GWDG offers a hosted LLM service with models. See GWDG LLM Service Overview for details.

7. An example for training a model

The GWDG Academy offers a course about training modules on GPU called Deep Learning with GPU Cores. This course also contains a nice and simple example that can be run quickly. Feel free to check it out and clone the repository:

git clone https://gitlab-ce.gwdg.de/hpc-team-public/deep-learning-with-gpu-cores.git

Support

For help, contact the HPC support team at at the indicated support addresses.