Interactive Jobs
What is an interactive job? Why use interactive jobs?
An interactive job requests resources from a partition, and immediately opens a session on the assigned nodes so you can work interactively. This is done usually on specially designated interactive or test partitions, that have low to no wait times (but also usually low maximum resource allocation and short maximum sessions), so your session can start immediately. This can also be attempted on normal partitions, but of course you would have to be present at the terminal when your job actually starts.
There are multiple use cases for interactive jobs:
- Performing trial runs of a program that should not be done on a log-in node. Remember login nodes are in principle just for logging in!
- Testing a new setup, for example a new Conda configuration or a Snakemake workflow, in a realistic node environment. This prevents you from wasting time in a proper partition with waiting times, just for your job to fail due to the wrong packages being loaded.
- Testing a new submission script or SLURM configuration.
- Running heavy installation or compilation jobs or Apptainer container builds.
- Running small jobs that don’t have large resource requirements, thus reducing waiting time.
- Doing quick benchmarks to determine the best partition to run further computations in (e.g. is the code just as fast on Emmy Phase 2 nodes as Emmy Phase 3)
- Testing resource allocation, which can sometimes be tricky particularly for GPUs. Start up a GPU interactive job, and test if your system can see the number and type of GPUs you expected with
nvidia-smi
. Same for other resources such as CPU count and RAM memory allocation. Do remember interactive partitions usually have low resource maximums or use older hardware, so this testing is not perfect! - Running the rare interactive-only tools and programs.
The Jupyter-HPC service is also provided for full graphical interactive JupyterHub, RStudio, IDE, and Desktop sessions.
How to start an interactive job
To start a (proper) interactive job:
srun -p jupyter --pty -n 1 -c 16 bash
This will block your terminal while the job starts, which should be within a few minutes. If for some reason this is taking too long or it returns a message that the request is denied (due to using the wrong partition or exceeding resource allocations for example), you can break the request with Ctrl-c
.
In the above command:
- starts a job in the partition designated after
-p
--pty
runs in pseudo terminal mode (critical for interactive shells)- then the usual SLURM resource allocation options
- finally we ask it to start a session in
bash
Don’t forget to specify the time limit with -t LIMIT
if the job will be short so that it is more likely to start earlier, where LIMIT
can be in the form of MINUTES
, HOURS:MINUTES:SECONDS
, DAYS-HOURS
, etc. (see the srun man page for all available formats).
This is especially important on partitions not specialized for interactive and test jobs.
If your time limit is less than or equal to 2 hours, you can also add --qos=2h
to use the 2 hour QOS to further reduce the likely wait time.
If you want to run a GUI application in the interactive job, you will almost certainly need X11 forwarding.
To do that, you must have SSH-ed into the login node with X11 forwarding and add the --x11
option to the srun
command for starting the interactive job.
This then forwards X11 from the interactive job all the way to your machine via the login node.
Though, in some cases, it might make more sense to use the Jupyter-HPC service instead.
You should see something like the following after your command, notice how the command line prompt (if you haven’t played around with this) changes after the job starts up and logs you into the node:
u12345@glogin5 ~ $ srun -p standard96:test --pty -n 1 -c 16 bash
srun: job 6892631 queued and waiting for resources
srun: job 6892631 has been allocated resources
u12345@gcn2020 ~ $
To stop an interactive session and return to the login node:
exit
Which partitions are interactive?
Any partition can be used interactively if it is empty enough, but some are specialized for it with shorter wait times and thus better suited for interactive jobs. These interactive partitions can change as new partitions are added or retired. Check the list of partitions for the most current information. Partitions whose names match the following are specialized for shorter wait times:
*:interactive
*:test
which have shorter maximum job timesjupyter*
which is shared with the Jupyter-HPC service and are overprovisioned (e.g. your job may share cores with other jobs)
You can also just look for other partitions with nodes in the idle state (or mixed nodes if doing a job that doens’t require a full node on a shared partition) with sinfo -p PARTITION
. For example, if we check the scc-gpu
partition:
[scc_agc_test_accounts] u12283@glogin6 ~ $ sinfo -p scc-gpu
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
scc-gpu up 2-00:00:00 1 inval ggpu194
scc-gpu up 2-00:00:00 3 mix- ggpu[135,138,237]
scc-gpu up 2-00:00:00 1 plnd ggpu145
scc-gpu up 2-00:00:00 1 down* ggpu150
scc-gpu up 2-00:00:00 1 comp ggpu199
scc-gpu up 2-00:00:00 1 drain ggpu152
scc-gpu up 2-00:00:00 1 resv ggpu140
scc-gpu up 2-00:00:00 11 mix ggpu[139,141,147-149,153-155,195-196,212]
scc-gpu up 2-00:00:00 6 alloc ggpu[136,142-144,146,211]
scc-gpu up 2-00:00:00 4 idle ggpu[151,156,197-198]
we can see that there are 4 idle nodes and 11 mixed nodes. This means that an interactive job using a single node should start rather quickly, particularly if it only requires part of a node since then one of the mixed nodes might be able to run it too.
Pseudo-interactive jobs
If you have a job currently running on a given node, you can actually SSH into that node. This can be useful in some cases to debug and check on your program and workflows. For example, you can check on the live GPU load with nvidia-smi
or monitor the CPU processes and the host memory allocation with btop
. Some of these checks are easier and more informative when performed live rather than using after-job reports such as the job output files or sacct
.
u12345@glogin5 ~ $ squeue --me
JOBID PARTITION NAME USER ACCOUNT STATE TIME NODES NODELIST(REASON)
6892631 standard96 bash u12345 myaccount RUNNING 11:33 1 gcn2020
u12345@glogin5 ~ $ ssh gcn2020
u12345@gcn2020 ~ $
If you try this on a node you don’t currently have a job in it will fail, since its resources have not been allocated to your user!
GPU Interactive Jobs
See: GPU Usage.