Self-Resubmitting jobs

Do you find yourself in the situation that your jobs need more time than allowed? Do you regularly write tickets to lengthen your job times or wait longer because of using QOS=long? Self-Resubmitting jobs might be a solution for you!

The requirement is that the program you are running is able to or can be updated so that is produces checkpoints and is able to restart from any checkpoint after a forced stop. Many programs already have check pointing options, like Gromacs or OpenFoam. Turn those on and update your batch script to resubmit itself to continue running.

Note

It is very important in general to have your jobs do check pointing. No one can promise 100% availability of a node, and every job that stops due to a failed node is wasted energy. We strive to have a very high energy efficiency and urge every user to be able to recover from a failed job without an issue. Even short jobs that fail are a waste of energy if the run needs to be repeated!

Two different types of examples are shown below that run a python script called big-computation.py, which in the first example is creating a checkpoint every 10 minutes and can restart from the last checkpoint file. The second example continuously writes to an output file which can be copied into a checkpoint, which in turn can be used at the restart of the computation.

#!/bin/bash
#SBATCH -p standard96
#SBATCH -t 12:00:00
#SBATCH -N 1
#SBATCH -c 4
#SBATCH -o outoput_%j.txt
#SBATCH -e error_%j.txt

if [ ! -f "finished" ] ; then
	sbatch --dependency=afterany:$SLURM_JOBID resub_job.sh
else
	exit 0
fi

# big-computation creates automatic checkpoints
# and automatically starts from the most recent 
# checkpoint
srun ./big-computation.py --checkpoint-time=10min --checkpoint-file=checkpoint.dat input.dat

# If big-computation.py is not canceled due to time out
# write the finished file to stop the loop.
touch finished
#!/bin/bash
#SBATCH -p standard96
#SBATCH -t 12:00:00
#SBATCH -N 1
#SBATCH -c 4
#SBATCH -o outoput_%j.txt
#SBATCH -e error_%j.txt
#SBATCH --signal=B:10@300
# Send signal 10 (SIG_USR1) 5min before time limit

trap SIGUSR1 'pkill -f big-computation.py -15 ; cp output.dat checkpoint.dat ; exit 1'

JOBID=`sbatch --dependency=afternotok:$SLURM_JOBID resub_job_trap.sh`

if [ -f "checkpoint.dat" ] ; then
	INPUT=checkpoint.dat
else
	INPUT=input.dat
fi

# big-computation creates automatic checkpoints
# and automatically starts from the most recent 
# checkpoint
srun ./big-computation.py $INPUT

scancel $JOBID
exit 0

Both scripts rely on the --dependency flag from the sbatch command. The first example runs until the scheduler kills the process. The next job starts after this one has finished, either by being killed or finishing. Only when the computation did successfully finish, a file called finished will be written, breaking the loop.

The first script will start and wait for one more job once the computation has finished because it always checks for the finished file.

The second script traps a signal to stop the computation. Observe the option #SBATCH --signal=B:10@300, which tells the scheduler to send the signal SIGUSER1(10) 5 minutes before the job is killed due to the time limit. The command trap SIGUSER1 captures this signal, stops the program with the SIGINT(15) signal, copies the output file into a checkpoint from which the computation can resume, and exits the script with code 1 (an error code). The resubmitting sbatch command handles the dependency to start the script only after the last job exited with an error code. Once started, the program either uses the input file or the last checkpoint file as the input, and the script only exits successfully if the computations finishes.

The second script will keep the last submitted job in the queue even though the last job finished successfully. It will be a pending job with reason DependencyNeverSatisfied and this job needs to be canceled manually using the scancel command. Therefore, the job script saves the submitted jobid and cancels it directly once the program exits normally.

Program specific examples

These examples take the ideas explored above and apply it to specific programs.

#!/bin/bash
#SBATCH -p standard96
#SBATCH -t 12:00:00
#SBATCH -N 1
#SBATCH -c 4
#SBATCH -o outoput_%j.txt
#SBATCH -e error_%j.txt

JOBID=`sbatch --dependency=afternotok:$SLURM_JOBID resub_job_gromacs.sh`

module load impi
module load gromacs

mpirun -np 1 gmx_mpi mdrun -s input.tpr -cpi checkpoint.cpt

scancel $JOBID
exit 0