Interactive jobs are useful mostly for testing purposes, as you can experiment with executing your programs and immediately see any resulting error message.

You can start an interactive job using the --pty bash -il flag, e.g. using

srun -n 1 -c 16 -t 1:00:00 --pty /bin/bash 

The interactive job is put in the queue, like any other batch job. Note therefore that - depending on the load of the system - it can take a while before your interactive session starts. Once a node is available, the terminal where you submitted the interactive job will automatically start the bash on the node that is allocated to you. Now, you can interactively execute commands within this bash instance on the node.

As with regular batch jobs, the walltime determines how long the node is reserved. When the wall time expires, you are logged out automatically and your terminal will return to the login node. If you logout before the walltime expires, the interactive job automatically finishes - hence if you submitted an interactive job of 1 hour, but you log out after 5 minutes, your budget will only be charged for 5 minutes.

If desired, you can request multiple nodes (assuming the nodes you are requesting have 16 CPU cores), e.g. using

srun -n 32 -t 4:00:00 -W 0 --pty /bin/bash

This may be useful in cases where you are debugging a jobscript or software that is running on multiple nodes. At the start of your interactive session, you are logged in to one of the two nodes allocated to you but can use the SLURM_JOB_NODELIST to check which other nodes were allocated. Then, you can use ssh to login to one of these other nodes.

Using salloc

Alternatively, SLURM provides a way to allocate resources with the salloc command, e.g.:

salloc -n 32 -t 4:00:00
This command will allocate the requested resource for the time specified. Once the allocation has been granted, you can find the hostname of the reserved nodes with the squeue command:

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            111111       gpu interact   user1  R      20:30      1 tcn21

and login to the node to work there interactively for the specified time. For example, for the case above:

$ ssh tcn21

You can read more information about the salloc command and the possible flags to use here.

salloc starts a new shell

The salloc command normally takes a user command to execute in the allocation, e.g. salloc -t 2:00:00 -p thin ./mycomputation . If you do not provide such a command then salloc  will start your default shell instead:

# Case 1: salloc with a command to execute

# Note the "tcn426" line, the output of "srun hostname" on the allocated node
snellius paulm@int3 12:36 ~$ salloc -t 00:00:05 -p thin srun hostname
salloc: Pending job allocation 1572110
salloc: job 1572110 queued and waiting for resources
salloc: job 1572110 has been allocated resources
salloc: Granted job allocation 1572110
salloc: Waiting for resource configuration
salloc: Nodes tcn426 are ready for job
tcn426
salloc: Relinquishing job allocation 1572110
salloc: Job allocation 1572110 has been revoked.

# salloc is done at this point, and its process is no longer running
snellius paulm@int3 12:39 ~$ ps faux | grep paulm
root     3287368  0.0  0.0 138752  9988 ?        Ss   12:34   0:00  \_ sshd: paulm [priv]
paulm    3289620  0.0  0.0 138752  5524 ?        S    12:34   0:00  |   \_ sshd: paulm@pts/168
paulm    3289621  0.0  0.0  22568  7276 pts/168  Ss   12:34   0:00  |       \_ -bash
paulm    3313407  0.0  0.0  53620  5668 pts/168  R+   12:40   0:00  |           \_ ps faux
paulm    3313408  0.0  0.0  12136  1076 pts/168  S+   12:40   0:00  |           \_ grep --color=auto paulm

# Case 2: salloc without a command to execute

snellius paulm@int3 12:40 ~$ salloc -t 00:00:05 -p thin 
salloc: Pending job allocation 1572114
salloc: job 1572114 queued and waiting for resources
salloc: job 1572114 has been allocated resources
salloc: Granted job allocation 1572114
salloc: Waiting for resource configuration
salloc: Nodes tcn370 are ready for job

# We're still on the int3 login node, but the current shell we're working in
# (process 3316940) has been started by salloc
snellius paulm@int3 12:41 ~$ ps faux | grep paulm
root     3287368  0.0  0.0 138752  9988 ?        Ss   12:34   0:00  \_ sshd: paulm [priv]
paulm    3289620  0.0  0.0 138752  5524 ?        S    12:34   0:00  |   \_ sshd: paulm@pts/168
paulm    3289621  0.0  0.0  22568  7276 pts/168  Ss   12:34   0:00  |       \_ -bash
paulm    3314101  0.0  0.0 127380  6728 pts/168  Sl   12:40   0:00  |           \_ salloc -t 00:00:05 -p thin
paulm    3316940  0.3  0.0  22572  7480 pts/168  S    12:41   0:00  |               \_ /bin/bash
paulm    3318849  0.0  0.0  53624  5720 pts/168  R+   12:41   0:00  |                   \_ ps faux
paulm    3318850  0.0  0.0  12140  1148 pts/168  S+   12:41   0:00  |                   \_ grep --color=auto paulm

# Note the output! The hostname command is not executed on int3, but on the allocated node tcn370
snellius paulm@int3 12:41 ~$ srun hostname
tcn370

# As we only indicated a very short wallclock time the allocation quickly ends
snellius paulm@int3 12:41 ~$ salloc: Job 1572114 has exceeded its time limit and its allocation has been revoked.

# Since there's no allocation anymore, the srun command fails
snellius paulm@int3 12:43 ~$ srun hostname
srun: error: Slurm job 1572114 has expired
srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 1572114

# However, we're currently still working within the shell (process 3316940) that was started by salloc!
snellius paulm@int3 12:43 ~$ ps faux | grep paulm
root     3287368  0.0  0.0 138752  9988 ?        Ss   12:34   0:00  \_ sshd: paulm [priv]
paulm    3289620  0.0  0.0 138752  5524 ?        S    12:34   0:00  |   \_ sshd: paulm@pts/168
paulm    3289621  0.0  0.0  22568  7276 pts/168  Ss   12:34   0:00  |       \_ -bash
paulm    3314101  0.0  0.0 127380  6728 pts/168  Sl   12:40   0:00  |           \_ salloc -t 00:00:05 -p thin
paulm    3316940  0.0  0.0  22572  7484 pts/168  S    12:41   0:00  |               \_ /bin/bash
paulm    3327419  0.0  0.0  53624  5664 pts/168  R+   12:43   0:00  |                   \_ ps faux
paulm    3327420  0.0  0.0  12140  1080 pts/168  S+   12:43   0:00  |                   \_ grep --color=auto paulm

snellius paulm@int3 12:51 ~$ sleep 10 &
[1] 3367698

snellius paulm@int3 12:51 ~$ ps faux | grep paulm
root     3287368  0.0  0.0 138752  9988 ?        Ss   12:34   0:00  \_ sshd: paulm [priv]
paulm    3289620  0.0  0.0 138752  5524 ?        S    12:34   0:00  |   \_ sshd: paulm@pts/168
paulm    3289621  0.0  0.0  22568  7276 pts/168  Ss   12:34   0:00  |       \_ -bash
paulm    3314101  0.0  0.0 127380  6728 pts/168  Sl   12:40   0:00  |           \_ salloc -t 00:00:05 -p thin
paulm    3316940  0.0  0.0  22572  7484 pts/168  S    12:41   0:00  |               \_ /bin/bash
paulm    3367698  0.0  0.0   7312   904 pts/168  S    12:51   0:00  |                   \_ sleep 10 <-------------------
paulm    3367994  0.0  0.0  53624  5696 pts/168  R+   12:51   0:00  |                   \_ ps faux
paulm    3367995  0.0  0.0  12140  1048 pts/168  S+   12:51   0:00  |                   \_ grep --color=auto paulm

So if you run salloc  without a command you need to do an exit  after the allocation expires, to stop the shell launched by salloc. Otherwise, you might end up (when repeatedly using salloc ) keeping alive a whole hierarchy of salloc  processes.

  • No labels