Running Large Language Models (LLMs) on Grete
This guide walks you through deploying and running Large Language Models (LLMs) on the Grete GPU cluster at GWDG. Whether you’re fine-tuning a transformer or serving a chatbot, this manual ensures you can harness Grete’s full potential.
Prerequisites
Before you begin, ensure the following:
- You have an active HPC account with access to Grete.
- You are familiar with SSH, Slurm, and module environments.
- Your project has sufficient GPU quota and storage.
- You have a working conda or virtualenv environment (recommended).
💡 For account setup and access, refer to the Getting Started guide.
1. Connect to Grete
ssh <username>@glogin-gpu.hpc.gwdg.de
Once logged in, you can submit jobs to the GPU nodes via Slurm.
2. Load Required Modules
Grete provides pre-installed modules.
module load miniforge3
🧠 Tip: Use
module spider
to explore available versions.
3. Set Up Your Environment
Create a virtual environment using conda:
cd $PROJECT
create --prefix ./llm-env python=3.11
source activate llm-env
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers accelerate
4. Prepare Your Script
Here’s a minimal example:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilgpt2").to(device)
# Prepare input
input_text = "Once upon a time"
inputs = tokenizer(input_text, return_tensors="pt").to(device) # Move inputs to same device
# Generate output
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
⚠️ Make sure your model fits into GPU memory. Use
accelerate
orbitsandbytes
for optimization.
5. Submit a Slurm Job
Create a job script run_llm.sh
:
#!/bin/bash
#SBATCH --job-name=llm-run
#SBATCH --partition=kisski
#SBATCH -G A100:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=40G
#SBATCH --time=02:00:00
#SBATCH --output=llm_output.log
#SBATCH -C inet
module load miniforge3
source activate llm-env
python run_llm.py
Submit with:
sbatch run_llm.sh
6. Tips for Scaling
- Use DeepSpeed, FSDP, or accelerate for multi-GPU training.
- For inference, consider model quantization or ONNX export.
- Monitor GPU usage with
nvidia-smi
.
7. Using GWDG’s LLM Service
If you prefer not to run your own model, GWDG offers a hosted LLM service with models. See GWDG LLM Service Overview for details.
Support
For help, contact the HPC support team at: