VTune Profiler (extended workshop version)

Quick performance evaluation with VTune APS or detailed hotspot, memory, or threading analysis with VTune profiler.

First load the environment module:

module load vtune/XXXX

Intro:
Presentation
Getting started

Manuals:
Intel_APS.pdf
VTune_features_for_HPC.pdf

vtune -help

Run VTune via the command line interface

Run your application with VTune wrapper as follows (Intel docs):

mpirun -np 4 aps -collect hotspots advanced-hotspots ./path-to_your/app.exe args_of_your_app

# after completion, explore the results:
aps-report aps_result_*

mpirun -np 4 vtune –collect hotspots -result-dir vtune_hotspot ./path-to_your/app.exe args_of_your_app
 
# after completion, explore the results:
vtune -report summary -r vtune_*

Run VTune-GUI (not recommended)

vtune-gui

Run VTune-GUI remotely on your local browser (recommended)

ssh -L 127.0.0.1:55055:127.0.0.1:55055 glogin.hlrn.de
salloc -p standard96:test -t 01:00:00
ssh -L 127.0.0.1:55055:127.0.0.1:55055 $SLURM_NODELIST
module add intel/19.0.5 impi/2019.9 vtune/2022
vtune-backend --web-port=55055 --enable-server-profiling &

Open 127.0.0.1:55055 in your browser (allow security exception, set initial password).

In 1^st “Welcome” VTune tab (run MPI parallel Performance Snapshot):

Click: Configure Analysis
-> Set application: /path-to-your-application/program.exe
-> Check: Use app. dir. as work dir.
-> In case of MPI parallelism, expand “Advanced”: keep defaults but paste the following wrapper script and check “Trace MPI”:
```
#!/bin/bash

# Run VTune collector (here with 4 MPI ranks)
mpirun -np 4 "$@"
```
Under HOW, run: Performance Snapshot.
(After completion/result finalization a 2nd result tab opens automatically.)

In 2^nd “r0…” VTune tab (explore Performance Snapshot results):

-> Here you find several analysis results e.g. the HPC Perf. Characterization.
-> Under Performance Snapshot - depending on the snapshot outcome - VTune suggests (see % in the figure below) more detailed follow-up analysis types: ![VTune Profiler](/img/vtune_profiler.png) --> For example select/run a Hotspot analysis:

In 3^rd “r0…” VTune tab (Hotspot analysis):

-> Expand sub-tab **Top-down Tree**
--> In **Function Stack** expand the "_start" function and expand further down to the "main" function (first with an entry in the source file column)
--> In the source file column double-click on "filename.c" of the "main" function
-> In the new sub-tab "filename.c" scroll down to the line with maximal CPU Time: Total to find hotspots in the main function

To quit the debug session press “Exit” in the VTune “Hamburger Menu” (upper left symbol of three horizontal bars). Then close the browser page. Exit your compute node via CTRL+D and kill your interactive job:

squeue -l -u $USER
scancel <your-job-id>

Example

Perfomance optimization of a simple MPI parallel program

First, download the example c code:

/*
 Matthias Laeuter, Zuse Institute Berlin
 v00 24.03.2023
 Lewin Stein, Zuse Institute Berlin
 v01 26.04.2023

 Compile example: mpiicc -cc=icx -std=c99 prog.c
*/
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>
#include <math.h>

#define DIMENSION 1

int main(int argc, char *argv[])
{
  const int rep = 10;
  int size;
  int rank;
  int xbegin, xend;

  MPI_Init(NULL,NULL);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  // read input data
  int dofnum[DIMENSION];
  if( argc < 2 )
    {
      for(int i=0; i<DIMENSION; i++) dofnum[i] = 10000000;
    }
  else
    {
      assert( argc == DIMENSION + 1 );
      for(int i=0; i<DIMENSION; i++) dofnum[i] = atoi(argv[i+1]);  
    }
  const double dx = 1. / (double) dofnum[0];

  // domain decomposition
  xbegin = (dofnum[0] / size) * rank;
  xend = (dofnum[0] / size) * (rank+1);
  if( rank+1 == size ) xend = dofnum[0];

  // memory allocation
  const int ldofsize = xend-xbegin;
  double* xvec = malloc(ldofsize * sizeof(double));
  for(int d=0; d<ldofsize; d++) xvec[d] = 0.0;

  // integration
  size_t mem = 0;
  size_t flop = 0;
  const double start = MPI_Wtime();
  double gsum;
  for(int r=0; r<rep; r++) {
    for(int i=xbegin; i<xend; i++) {
      const double x = dx * i;
      const int loci = i-xbegin;
      xvec[loci] = sin(x);
    }

    double sum = 0.;
    for(int d=0; d<ldofsize; d++)
      sum += xvec[d] * dx;
    MPI_Reduce(&sum,&gsum,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
  }
  const double reptime = MPI_Wtime() - start;
  const double deltatime = reptime / rep;

  mem += ldofsize * sizeof(double);
  flop += 2*ldofsize;

  // memory release
  free( xvec );

  // profiling
  double maxtime = 0.0;
  MPI_Reduce(&deltatime,&maxtime,1,MPI_DOUBLE,MPI_MAX,0,MPI_COMM_WORLD);
  const double locgb = ( (double) mem ) / 1024. / 1024. / 1024.;
  double gb = 0.0;
  MPI_Reduce(&locgb,&gb,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
  const double locgflop = ( (double) flop ) / 1024. / 1024. / 1024.;
  double gflop = 0.0;
  MPI_Reduce(&locgflop,&gflop,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);

  // output
  if( rank == 0 ) {
    printf("ranks=%i\n",size);
    printf("int_0^1sinX=%.10f\n",1.0-cos(1.0));
    printf("integral=%.10f dofnum=%i ",gsum,dofnum[0]);
    printf("nt=%i T=%0.4e GB=%0.4e Gflops=%0.4e\n",size,maxtime,gb,gflop/maxtime);
  }

  MPI_Finalize();

  return 0;
}

# gnu compilation (lm flags links math library)
ml -impi -intel gcc/9.3.0 openmpi/gcc.9/4.1.4
mpicc -lm -g -O0 integration.c -o prog.bin
mpicc -lm -g -O2 integration.c -o progAVX.bin
 
# intel compilation
ml -openmpi -gcc intel/2022.2.1 impi/2021.7.1
mpiicc -cc=icx -std=c99 -g -O0 integration.c -o iprog.bin
mpiicc -cc=icx -std=c99 -g -O2 integration.c -o iprogAVX.bin
mpiicc -cc=icx -std=c99 -g -O2 -xHost integration.c -o iprogAVXxHost.bin

Test run with MPI_Wtime() timings:

export & module load (environment)	executed (on standard96)	reference (min. s)	reference (max. s)
`gcc/9.3.0 openmpi/gcc.9/4.1.4`	`mpirun -n 4 ./prog.bin`	0.072	0.075
	`mpirun -n 4 ./progAVX.bin`	0.040	0.043
`intel/2022.2.1 impi/2021.7.1`	`mpirun -n 4 ./iprog.bin`	0.034	0.039
`intel/2022.2.1 impi/2021.7.1` `I_MPI_DEBUG=3`	`mpirun -n 4 ./iprog.bin`	0.034	0.039
	`mpirun -n 4 ./iprogAVX.bin`	0.008	0.011
`intel/2022.2.1 impi/2021.7.1` `I_MPI_DEBUG=3` `SLURM_CPU_BIND=none` `I_MPI_PIN_DOMAIN=core`	`I_MPI_PIN_ORDER=scatter mpirun -n 4 ./iprogAVX.bin`	0.009	0.011
	`I_MPI_PIN_ORDER=compact mpirun -n 4 ./iprogAVX.bin`	0.009	0.009 (0.011)
	`I_MPI_PIN_ORDER=compact mpirun -n 4 ./iprogAVXxHost.bin`	0.0058	0.0060 (0.0067)
`intel/2022.2.1 impi/2021.7.1` `I_MPI_DEBUG=3`	`mpirun -n 4 ./iprogAVXxHost.bin`	0.0052	0.0066

The process to core pinning during MPI startup is printed because of I_MPI_DEBUG level ≥3. During runtime the core load can be watched live with htop.

VTune’s Performance Snapshot:

export & module load (environment)	executed (on standard96)	vectorization	threading (e.g. OpenMP)	memory bound
`intel/2022.2.1 impi/2021.7.1 vtune/2022`	`mpirun -n 4 ./iprog.bin`	0.0%	1.0%	33.5%
	`mpirun -n 4 ./iprogAVX.bin`	97.3%	0.9%	44.4%
	`mpirun -n 4 ./iprogAVXxHost.bin`	99.9%	0.8%	49.0%
`intel/2022.2.1 impi/2021.7.1 vtune/2022` `SLURM_CPU_BIND=none` `I_MPI_PIN_DOMAIN=core`	`I_MPI_PIN_ORDER=compact mpirun -n 4 ./iprogAVXxHost.bin`	99.9%	0.4%	6.3%