Resource monitoring and Reports

When debugging and optimizing your application, it is important to know what is actually happening on the compute node. Most information has to be collected during the runtime of the job, but a few key metrics are still available after a job has finished.

During Runtime

`sstat`

During a job’s runtime, you can use the sstat command to get live information about its performance, regarding CPU, Tasks, Nodes, Resident Set Size (RSS) and Virtual Memory (VM) for each step.

To use, call sstat -a <jobid> to list all jobsteps of a single job. Due to amount of fields displayed, we recommend appending a | less -S to the command to enable side-scrolling, or to reduce the amount of displayed fields with the -o flag. (Important note: When forgetting the -a flag, sstat may display an empty dataset when the job did not start explicit jobsteps).

You can find all available fields and their meaning, as well as further information about the command in its man page (man sstat) or on its website https://slurm.schedmd.com/sstat.html.

SSH

While a job is running, you can also use ssh to get onto the node(s) allocated to your job to view its performance directly. Use squeue --me to see which nodes your job is running on. Once logged in to the compute node, you can take a look at the resource usage with standard Linux commands, such as htop, ps or free. Please keep in mind that most commands will show ALL resources of the node, not just those allocated to you.

After the Job finished / Reports

`sacct`

sacct is a Slurm tool to query the Slurm database for information about jobs. The returned information range from simple status information (current runstate, submission time, nodes, etc.) to gathered performance metrics.

To use, call sacct -j <jobid>, which will display a small subset of the available job information (Please keep in mind that database operations are async, thus recently started jobs might not yet be in the database and thus are not available in sacct).

To get more than the basic job information immediately available, one can use the --long flag, which will print many more fields per job. Due to the large number of fields and thus hard to read output, we recommend appending a | less -S to the output to enable side-scrolling in the output. Further fields can be manually selected using the --format flag, where the command sacct --helpformat lists all available fields.

As a form of special field, the flag --batch-script will print the slurm script of the job if the job was submitted with sbatch and the flag --env-vars will print the list of environment variables of a job, usefull for debugging. These two flags cannot be combined with others, expect the -j flag.

You can find all available fields and flags, as well as further information about the command in its man page (man sacct) or on its website https://slurm.schedmd.com/sacct.html.

`reportseff`

To get resource efficiency information about your job after it finished, you can use the tool reportseff. This tools queries Slurm to get your allocated resources and compares it to the actually used resources (as reported by Slurm). You can get all information reportseff uses to create reports by manually using sacct, it just collates them in a nice fashion.

This can give you a great overview about your usage: Did I make use of all my cores and all my memory? Was my time limit too long?

Usage example:

# Display your recent jobs
gwdu101:121 14:00:00 ~ > module load py-reportseff
gwdu101:121 14:00:05 ~ > reportseff -u $USER
     JobID    State       Elapsed  TimeEff   CPUEff   MemEff 
 
  12671730  COMPLETED    00:00:01   0.0%      ---      0.0%  
  12671731  COMPLETED    00:00:00   0.0%      ---      0.0%  
  12701482  CANCELLED    00:04:20   7.2%     49.6%     0.0%
 
 
# Give specific Job ID:
gwdu102:29 14:07:17 ~ > reportseff 12701482
     JobID    State       Elapsed  TimeEff   CPUEff   MemEff 
  12701482  CANCELLED    00:04:20   7.2%     49.6%     0.0%

As you can see in the example, the job only took 4:20 minutes out of 1h allocated time, resulting in a TimeEfficiency of 7.2%. Only half the allocated cores (two allocated and only one used) and basically none of the allocated memory were used. For the next similar job, we should reduce the time limit, request one core less and definitely request less memory.