LLMs on Grete
This guide walks you through deploying and running Large Language Models (LLMs) on the Grete GPU cluster at GWDG. Whether you’re fine-tuning a transformer or serving a chatbot, this manual ensures you can harness Grete’s full potential.
Prerequisites
Before you begin, ensure the following:
- You have an active HPC account with access to Grete.
- You are familiar with SSH, Slurm, and module environments.
- Your project has sufficient GPU quota and storage.
- You have a working python environment.
1. Connect to Grete
ssh u12345@glogin-gpu.hpc.gwdg.deRemember to replace the username u12345 with your own project user name.
Once logged in, you can submit jobs to the GPU nodes via Slurm.
2. Set Up Your Environment
Create a virtual environment using, for example, conda. In order to use it, first you need to load the correct module.
module load miniforge3Once this is done, follow these steps to set up the environment.
conda create -n llm-env python=3.11After creation, load the environment. The next step will change how the prompt looks. This is normal.
source activate llm-envFinally, install the required packages.
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers accelerate3. Prepare Your Script
Here’s a minimal example called rum_llm.py:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilgpt2").to(device)
# Prepare input
input_text = "Once upon a time"
inputs = tokenizer(input_text, return_tensors="pt").to(device) # Move inputs to same device
# Generate output
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))Make sure your model fits into GPU memory.
Use accelerate or bitsandbytes for optimization.
4. Submit a Slurm Job
Create a job script run_llm.sh:
#!/bin/bash
#SBATCH --job-name=llm-run
#SBATCH --partition=grete:interactive
#SBATCH -G 1g.20gb:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=40G
#SBATCH --time=02:00:00
#SBATCH --output=llm_output_%J.log
#SBATCH -C inet
module load miniforge3
source activate llm-env
python run_llm.pySubmit with:
sbatch run_llm.sh5. Tips for Scaling
- Use
DeepSpeed,FSDP, oracceleratefor multi-GPU training. - For inference, consider model quantization or ONNX export.
- Monitor GPU usage with
nvidia-smi.
6. Using GWDG’s LLM Service
If you prefer not to run your own model, GWDG offers a hosted LLM service with models. See GWDG LLM Service Overview for details.
7. An example for training a model
The GWDG Academy offers a course about training modules on GPU called Deep Learning with GPU Cores. This course also contains a nice and simple example that can be run quickly. Feel free to check it out and clone the repository:
git clone https://gitlab-ce.gwdg.de/hpc-team-public/deep-learning-with-gpu-cores.gitSupport
For help, contact the HPC support team at at the indicated support addresses.