Running Large Language Models (LLMs) on Grete

This guide walks you through deploying and running Large Language Models (LLMs) on the Grete GPU cluster at GWDG. Whether you’re fine-tuning a transformer or serving a chatbot, this manual ensures you can harness Grete’s full potential.

Prerequisites

Before you begin, ensure the following:

You have an active HPC account with access to Grete.
You are familiar with SSH, Slurm, and module environments.
Your project has sufficient GPU quota and storage.
You have a working conda or virtualenv environment (recommended).

💡 For account setup and access, refer to the Getting Started guide.

1. Connect to Grete

ssh <username>@glogin-gpu.hpc.gwdg.de

Once logged in, you can submit jobs to the GPU nodes via Slurm.

2. Load Required Modules

Grete provides pre-installed modules.

module load miniforge3

🧠 Tip: Use module spider to explore available versions.

3. Set Up Your Environment

Create a virtual environment using conda:

cd $PROJECT
create --prefix ./llm-env python=3.11
source activate llm-env
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers accelerate

4. Prepare Your Script

Here’s a minimal example:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilgpt2").to(device)

# Prepare input
input_text = "Once upon a time"
inputs = tokenizer(input_text, return_tensors="pt").to(device)  # Move inputs to same device

# Generate output
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

⚠️ Make sure your model fits into GPU memory. Use accelerate or bitsandbytes for optimization.

5. Submit a Slurm Job

Create a job script run_llm.sh:

#!/bin/bash
#SBATCH --job-name=llm-run
#SBATCH --partition=kisski
#SBATCH -G A100:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=40G
#SBATCH --time=02:00:00
#SBATCH --output=llm_output.log
#SBATCH -C inet

module load miniforge3

source activate llm-env
python run_llm.py

Submit with:

sbatch run_llm.sh

6. Tips for Scaling

Use DeepSpeed, FSDP, or accelerate for multi-GPU training.
For inference, consider model quantization or ONNX export.
Monitor GPU usage with nvidia-smi.

7. Using GWDG’s LLM Service

If you prefer not to run your own model, GWDG offers a hosted LLM service with models. See GWDG LLM Service Overview for details.

Support

For help, contact the HPC support team at:

📧 hpc-support@gwdg.de