SLURM

Quickstart – demo

VSC-3/4: script examples/05_submitting_batch_jobs/job.sh:


#!/bin/bash

#SBATCH -J test
#SBATCH -N 1

module purge            # recommended
# module load <modules>

echo 
echo 'Hello from node: '$HOSTNAME
echo 'Number of nodes: '$SLURM_JOB_NUM_NODES
echo 'Tasks per node:  '$SLURM_TASKS_PER_NODE
echo 'Partition used:  '$SLURM_JOB_PARTITION
echo 'Using the nodes: '$SLURM_JOB_NODELIST
echo 
sleep 30 # <do_my_work>

submission:

sbatch job.sh

check what is going on:

squeue -u $USER   ! sq

output:

slurm-<job_id>.out

cancel jobs:

scancel <job_id>
scancel <job_name>
scancel -u $USER

SLURM – basic concepts

Queueing system

  • job/batch script
    • shell script – #!
      that does everything needed to run your calculation
    • independent of queueing system
    • use simple scriptss
      max 50 lines, i.e., put complicated logic elsewhere
    • load modules from scratch
      purge, then load
  • tell scheduler where/how to run jobs
    • number of nodes (or cores)
    • nodetype (i.e., partition & qos)
  • scheduler manages job allocation to compute nodes

SLURM – account and user

#SBATCH --account=<account>   ! use a specific account/project p7.... (other than default)

SLURM – partition and QOS (quality of service)

#SBATCH --qos=<qos>               ! specify quality of service   ! always provide:
#SBATCH --partition=<partition>   ! specify type of hardware     ! qos & partition


VSC hardware overview

VSC hardware details

VSC-4
QOS (standard) partition RAM (GB) CPU Cores IB (HCA) #Nodes
mem_0096 mem_0096* 96 2x Intel Platinum 8174 @ 3.10GHz 2x24 1xEDR 688
mem_0384 mem_0384 384 2x Intel Platinum 8174 @ 3.10GHz 2x24 1xEDR 78
mem_0768 mem_0768 768 2x Intel Platinum 8174 @ 3.10GHz 2x24 1xEDR 12
* default partition, EDR: Intel Omni-Path (100Gbit/s)
effective: 10/2020
 
VSC-3
QOS (standard) partition RAM (GB) CPU Cores IB (HCA) #Nodes
normal_0064 mem_0064* 64 2x Intel E5-2650 v2 @ 2.60GHz 2x8 2xQDR 1849
normal_0128 mem_0128 128 2x Intel E5-2650 v2 @ 2.60GHz 2x8 2xQDR 140
normal_0256 mem_0256 256 2x Intel E5-2650 v2 @ 2.60GHz 2x8 2xQDR 50
vsc3plus_0064 vsc3plus_0064 64 2x Intel E5-2660 v2 @ 2.20GHz 2x10 1xFDR 816
vsc3plus_0256 vsc3plus_0256 256 2x Intel E5-2660 v2 @ 2.20GHz 2x10 1xFDR 48
normal_binf binf 512 - 1536 2x Intel E5-2690 v4 @ 2.60GHz 2x14 1xFDR 17
GPU (see later)
* default partition, QDR: Intel Truescale Infinipath (40Gbit/s), FDR: Mellanox ConnectX-3 (56Gbit/s)
effective: 10/2018

VSC hardware – display information

VSC-4:

VSC-4 >  sinfo
VSC-4 >  sinfo -o %P
VSC-4 >  scontrol show partition mem_0096
VSC-4 >  scontrol show node n401-001


VSC-3:

VSC-3 >  sinfo
VSC-3 >  sinfo -o %P
VSC-3 >  scontrol show partition vsc3plus_0064
VSC-3 >  scontrol show node n351-001

QOS – account/project assignment

1.+2.:

VSC-4 >  sqos -acc
===================================================..
Your jobs can run with the following account(s) and..

default_account:        p70824
        account:        p70824              

    default_qos:      mem_0096              
            qos:       jupyter              
                      mem_0096              
                      mem_0384              
                      mem_0768


VSC-3 >  sqos -acc   ! ==> much longer list...

QOS – partition assignment

3.:

VSC-4 >  sqos
 qos_name total  used  free     walltime   priority partitions  
==============================================================
  jupyter    20     0    20   3-00:00:00       1000 jupyter     
 mem_0096   654   680   -26   3-00:00:00       1000 mem_0096    
 mem_0384    73    11    62   3-00:00:00       1000 mem_0384    
 mem_0768    10     8     2   3-00:00:00       1000 mem_0768    


VSC-3 >  sqos   ! ==> much longer list...
#SBATCH --account=<account>       ! specify account/project p7....
#SBATCH --qos=<qos>               ! specify quality of service   ! always provide:
#SBATCH --partition=<partition>   ! specify type of hardware     ! qos & partition

Sample batch job

#!/bin/bash

#SBATCH -J <jobname>
#SBATCH -N <number_of_nodes>

#SBATCH --account=<account>       ! default project
#SBATCH --qos=<qos>               ! default @VSC-4: mem_0096 @VSC-3: mem_0064
#SBATCH --partition=<partition>   ! default @VSC-4: mem_0096 @VSC-3: normal_0064

module purge                      # recommended to be done in all jobs !!!!!
# module load <modules>           # load only modules actually needed by job

echo 'Hello from node: '$HOSTNAME
echo 'Number of nodes: '$SLURM_JOB_NUM_NODES
echo 'Tasks per node:  '$SLURM_TASKS_PER_NODE
echo 'Partition used:  '$SLURM_JOB_PARTITION
echo 'Using the nodes: '$SLURM_JOB_NODELIST
# <do_my_work>

Single (few) core(s) jobs – VSC-4 shared comp. nodes

#!/bin/bash

#SBATCH -J test
#SBATCH -n 1                      ! specify number of cores 
#SBATCH --mem=2G                  ! memory limit in Gigabytes

#SBATCH --account=<account>       ! default project

module purge                      # recommended to be done in all jobs !!!!!
# module load <modules>           # load only modules actually needed by job

echo 
echo 'Hello from node: '$HOSTNAME
echo 'Number of nodes: '$SLURM_JOB_NUM_NODES
echo 'Tasks per node:  '$SLURM_TASKS_PER_NODE
echo 'Partition used:  '$SLURM_JOB_PARTITION
echo 'Using the nodes: '$SLURM_JOB_NODELIST
echo 
# <do_my_work>

Job submission

sbatch job.sh
sbatch <SLURM_PARAMETERS> job.sh <JOB_PARAMETERS>

parameters are specified as in job script, command-line overrides job-script

squeue -u $USER     ! alias sq='squeue -u $USER'
slurm-<job_id>.out
scancel <job_id>
scancel <job_name>
scancel -u $USER

Exercises (1/2)

VSC-4 >  source ~training/start_using_training     ! during course only
VSC-3 >  source ~training/start_using_devel_0128   ! during course only

 

VSC-3/4: script examples/05_submitting_batch_jobs/job.sh

sbatch job.sh
squeue -u $USER   ! sq
scancel <job_id>
! output in: slurm-<job_id>.out
sqos -acc
sqos
sinfo
sinfo -o %P
scontrol show partition ...
scontrol show node ...

Exercises (2/2)

source ~training/switch_2_default     ! during course only

 

VSC-4: script examples/05_submitting_batch_jobs/job_single_core_vsc4.sh

hostname
free 

 

VSC-4 >  source ~training/start_using_training     ! during course only
VSC-3 >  source ~training/start_using_devel_0128   ! during course only

Back to Agenda




AGENDA – VSC-Intro