Resource monitoring and Reports

When debugging and optimizing your application, it is important to know what is actually happening on the compute node. Most information has to be collected during the runtime of the job, but a few key metrics are still available after a job has finished.

During Runtime

While a job is running, you can use ssh to get onto the node(s) allocated to your job. Use squeue --me to see which nodes your job is running on. Once logged in to the compute node, you can take a look at the resource usage with standard Linux commands, such as htop, ps or free. Please keep in mind that most commands will show ALL resources of the node, not just those allocated to you.

After the Job finished / Reports

To get resource usage information about your job after it finished, you can use the tool reportseff. This tools queries Slurm to get your allocated resources and compares it to the actually used resources (as reported by Slurm). You can get all information reportseff uses to create reports by manually using sacct, it just collates them in a nice fashion.

This can give you a great overview about your usage: Did I make use of all my cores and all my memory? Was my time limit too long?

Usage example:

# Display your recent jobs
gwdu101:121 14:00:00 ~ > module load py-reportseff
gwdu101:121 14:00:05 ~ > reportseff -u $USER
     JobID    State       Elapsed  TimeEff   CPUEff   MemEff 
 
  12671730  COMPLETED    00:00:01   0.0%      ---      0.0%  
  12671731  COMPLETED    00:00:00   0.0%      ---      0.0%  
  12701482  CANCELLED    00:04:20   7.2%     49.6%     0.0%
 
 
# Give specific Job ID:
gwdu102:29 14:07:17 ~ > reportseff 12701482
     JobID    State       Elapsed  TimeEff   CPUEff   MemEff 
  12701482  CANCELLED    00:04:20   7.2%     49.6%     0.0%  

As you can see in the example, the job only took 4:20 minutes out of 1h allocated time, resulting in a TimeEfficiency of 7.2%. Only half the allocated cores (two allocated and only one used) and basically none of the allocated memory were used. For the next similar job, we should reduce the time limit, request one core less and definitely request less memory.