SLURM setup

overview of services

compilation

module purge
module load gnu7/7.2.0
cd /opt/install/src/slurm/slurm-17.11.0
./configure --prefix=/opt/ohpc/pub/slurm
make -j 8 
make install

Prerequisites:

adjustments:

slurm config directory:

/opt/ohpc/pub/slurm/etc

db setup

Mysql/MariaDB:

create database slurm_acct_db;
create user 'slurm'@'localhost' identified by 'password';
grant all on slurm_acct_db.* TO 'slurm'@'localhost';

config files

update procedure

backuping

 mysqldump --all-databases | /bin/gzip > slurm_complete-$(date +\%Y\%m\%d\%H\%M).sql.gz

recovery of database

create user 'slurm'@'localhost' identified by 'password';
grant all on slurm_acct_db.* TO 'slurm'@'localhost';
zcat slurm_complete-xxxxxxx | mysql

pam slurm

account    required     pam_slurm.so

pam user add

multifactor priority plugin

basic setup

PriorityParameters      = (null)
PriorityDecayHalfLife   = 14-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           = FAIR_TREE
PriorityMaxAge          = 14-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 1000
PriorityWeightFairShare = 1000
PriorityWeightJobSize   = 1000
PriorityWeightPartition = 1000
PriorityWeightQOS       = 1000
PriorityWeightTRES      = (null)
scontrol show config |grep -i prio

FAIR_TREE

sshare

sacctmgr modify account vsctest set RawUsage=0

priority calculation

PRIORITY_MF =  AGE + FAIRSHARE + JOBSIZE + PARTITION + QOS + TRES
  • total priority:
PRIORITY = PRIORITY_MF - NICE

nice

sprio

Only for pending jobs:

sprio -l

Set priority explicitly (overrides nice and multifactor):

scontrol update job <job_id> priority=1

Restore multifactor:

cgroups plugin

slurm.conf: adding nodes

adding nodes needs:

service slurmctld restart

slurm.conf: adding nodes

NodeName=c1-[00-11] Sockets=2 CoresPerSocket=14 ThreadsPerCore=2
NodeName=c2-[00-07] Sockets=2 CoresPerSocket=14 ThreadsPerCore=2
NodeName=c3-[00-07] Sockets=1 CoresPerSocket=64 ThreadsPerCore=4
NodeName=c4-[00-15] Sockets=1 CoresPerSocket=6  ThreadsPerCore=2

Additional:

Feature=MySuperFeature
GresTypes=My_GRES
AccountingStorageTRES = gres/My_GRES
Gres=My_GRES:42
RealMemory=64393

Redundant information in gres.conf and slurm.conf!

slurm.conf: adding partitions

PartitionName=normal Nodes=c3-[00-07],c4-[00-15] Default=YES MaxTime=24:00:00 State=UP TRESBillingWeights=CPU=1.0
PartitionName=test Nodes=c1-[00-11] MaxTime=24:00:00 State=UP TRESBillingWeights=CPU=4.0
PartitionName=test2 Nodes=c2-[00-07] MaxTime=24:00:00 State=UP TRESBillingWeights=CPU=400.0

optional parameters:

SLURM: Accounts and Users

SLURM: Partition and Quality of Service

QOS-Account/Project assignment

1.+2.:

sqos -acc
default_account:              p70824
        account:              p70824                    

    default_qos:         normal_0064                    
            qos:          devel_0128                    
                            goodluck                    
                      gpu_gtx1080amd                    
                    gpu_gtx1080multi                    
                   gpu_gtx1080single                    
                            gpu_k20m                    
                             gpu_m60                    
                                 knl                    
                         normal_0064                    
                         normal_0128                    
                         normal_0256                    
                         normal_binf                    
                       vsc3plus_0064                    
                       vsc3plus_0256

QOS-Partition assignment

3.:

sqos
            qos_name total  used  free     walltime   priority partitions  
=========================================================================
         normal_0064  1782  1173   609   3-00:00:00       2000 mem_0064    
         normal_0256    15    24    -9   3-00:00:00       2000 mem_0256    
         normal_0128    93    51    42   3-00:00:00       2000 mem_0128    
          devel_0128    10    20   -10     00:10:00      20000 mem_0128    
            goodluck     0     0     0   3-00:00:00       1000 vsc3plus_0256,vsc3plus_0064,amd
                 knl     4     1     3   3-00:00:00       1000 knl         
         normal_binf    16     5    11   1-00:00:00       1000 binf        
    gpu_gtx1080multi     4     2     2   3-00:00:00       2000 gpu_gtx1080multi
   gpu_gtx1080single    50    18    32   3-00:00:00       2000 gpu_gtx1080single
            gpu_k20m     2     0     2   3-00:00:00       2000 gpu_k20m    
             gpu_m60     1     1     0   3-00:00:00       2000 gpu_m60     
       vsc3plus_0064   800   781    19   3-00:00:00       1000 vsc3plus_0064
       vsc3plus_0256    48    44     4   3-00:00:00       1000 vsc3plus_0256
      gpu_gtx1080amd     1     0     1   3-00:00:00       2000 gpu_gtx1080amd

naming convention:

QOS Partition
*_0064 mem_0064

Specification in job script

#SBATCH --account=xxxxxx
#SBATCH --qos=xxxxx_xxxx
#SBATCH --partition=mem_xxxx

For omitted lines corresponding defaults are used. See previous slides, default partition is “mem_0064”

Licenses

slic

Within the job script add the flags as shown with ‘slic’, e.g. for using both Matlab and Mathematica:

#SBATCH -L matlab@vsc,mathematica@vsc

Intel licenses are needed only for compiling code, not for running it!

Reservations of compute nodes

  • check for reservations:
scontrol show reservations
  • use it:
#SBATCH --reservation=

QOS

job submission

sbatch <sbatch_params> job.sh <job_script_params>

in job.sh

#!/bin/bash
#SBATCH -J myjobname
#SBATCH -N 2 
#SBATCH --qos=test_qos
#SBATCH --account=vsctest

...

hold and release job

only admin can release:

scontrol hold id_0

user can release:

scontrol uhold id_0 id_1 
scontrol release id_0

account and user setup

sacctmgr add acount account_name parent=root share=1
sacctmgr add user 70032 account=account_name

adding qos

sacctmgr add qos test2 GrpTRES=cpu=300
sacctmgr update qos test2 set MaxTRESPU=node=4

Remove a constraint:

sacctmgr update qos test2 set MaxTRESPU=node=-1

trackable resources (tres)

sacctmgr show tres

reservations

scontrol create reservation starttime=$START \
        endtime=$END \
        partitionname=$PARTITION \
        Accounts=$ACCOUNT \
        Users=$Users \
        Nodecnt=$NODECOUNT 
scontrol show res
Flags=MAINT
    

sreport

show usage statistics

sreport cluster utilization

accounting

job stats daily

  Column   |            Type             | 
-----------+-----------------------------+
 project   | text                        | 
 username  | text                        | 
 day       | timestamp without time zone | 
 usage     | numeric                     | 
 partition | text                        | 
 qos       | text                        | 
 licenses  | text                        | 

hardware availability

    Column     |  Type   |
---------------+---------+
 id            | integer |
 date          | date    |
 cpus          | integer | 
 entitytype_id | integer | 

cluster utilization

    Column     |  Type   |
---------------+---------+
 id            | integer |
 date          | date    |
 trescount     | bigint  |
 allocated     | bigint  |
 down          | bigint  |
 idle          | bigint  |
 overcommited  | bigint  |
 planneddown   | bigint  |
 reserved      | bigint  |
 reported      | bigint  |
 entitytype_id | integer |