Job script examples

General blueprint for a jobscript

You can save the following example to a file (e.g. run.sh) on scicluster. Comment the two cp commands that are just for illustratory purpose (lines 46 and 55) and change the SBATCH directives where applicable. You can then run the script by typing:

sbatch run.sh

Please note that all values that you define with SBATCH directives are hard values. When you, for example, ask for 6000 MB of memory (--mem=6000MB) and your job uses more than that, the job will be automatically killed by the manager.

Important: Please note that the standard out and err streams from the code are redirected to a file despite the specification of standard out and err for the job. This is very important unless stdout/stderr from your code is less than a few MB. The job output is spooled locally on the execution node and copied to the user working directory only after the job completes. Since the spool size is small (a few GB) you can overfill the disk and crash all the jobs on the node. With redirection approach you avoid this and in addition you can monitor out.txt during runtime.

#!/bin/bash -l

##############################
#       Job blueprint        #
##############################

# Give your job a name, so you can recognize it in the queue overview
#SBATCH --job-name=example

# Define, how many nodes you need. Here, we ask for 1 node.
#SBATCH --nodes=1
# You can further define the number of tasks with --ntasks-per-*
# See "man sbatch" for details. e.g. --ntasks=4 will ask for 4 cpus.

# Define, how long the job will run in real time. This is a hard cap meaning
# that if the job runs longer than what is written here, it will be
# force-stopped by the server. If you make the expected time too long, it will
# take longer for the job to start. Here, we say the job will take 5 minutes.
#              d-hh:mm:ss
#SBATCH --time=0-00:05:00

# Define the partition on which the job shall run, e.g. long
#SBATCH --partition long

# How much memory you need.
# --mem will define memory per node and
# --mem-per-cpu will define memory per CPU/core. Choose one of those.
#SBATCH --mem-per-cpu=1500MB
##SBATCH --mem=5GB    # this one is not in effect, due to the double hash

#SBATCH --output="stdout.txt" # standard output
#SBATCH --error="stderr.txt"  # standard error <-- This is our last SBATCH directive

# You may not place any commands before the last SBATCH directive

# Define and create a unique scratch directory for this job
SCRATCH_DIRECTORY=/scratch1/${USER}/${SLURM_JOBID}
mkdir -p ${SCRATCH_DIRECTORY}
cd ${SCRATCH_DIRECTORY}

# You can copy everything you need to the scratch directory
# ${SLURM_SUBMIT_DIR} points to the path where this script was submitted from
cp ${SLURM_SUBMIT_DIR}/myfile.txt ${SCRATCH_DIRECTORY}

ml purge # it's a good practice to first unload all modules
ml your_modules # then load what module you need, if any

# This is where the actual work is done.
./my_code >& out.txt

# After the job is done we copy our output back to $SLURM_SUBMIT_DIR
cp ${SCRATCH_DIRECTORY}/my_output ${SLURM_SUBMIT_DIR}


# After everything is saved to your home directory, it's recommended to delete the work directory to
# save space on /scratch1
cd ${SLURM_SUBMIT_DIR}
rm -rf ${SCRATCH_DIRECTORY}

# Finish the script
exit 0

Job arrays

Running many sequential jobs in parallel using job arrays

In this example we wish to run many similar sequential jobs in parallel using job arrays. We take Python as an example but this does not matter for the job arrays:

#!/usr/bin/env python

import time

print('start at ' + time.strftime('%H:%M:%S'))

print('sleep for 10 seconds ...')
time.sleep(10)

print('stop at ' + time.strftime('%H:%M:%S'))

Save this to a file called "test.py" and try it out: python test.py

start at 15:23:48
sleep for 10 seconds ...
stop at 15:23:58

Good. Now we would like to run this script 8 times at the same time. For this we use the following script:

#!/bin/bash -l

#####################
# job-array example #
#####################

#SBATCH --job-name=example

# 8 jobs will run in this array at the same time
#SBATCH --array=1-8

# run for five minutes
#              d-hh:mm:ss
#SBATCH --time=0-00:05:00

# determine the partition
#SBATCH --partition=short

# 100MB memory per core
#SBATCH --mem-per-cpu=100MB

#SBATCH --output="stdout.txt"
#SBATCH --error="stderr.txt"

# you may not place bash commands before the last SBATCH directive

ml purge # it's a good practice to first unload all modules
ml your_modules # then load what module you need, if any

# define and create a unique scratch directory
SCRATCH_DIRECTORY=/scratch1/${USER}/job-array-example/${SLURM_JOBID}
mkdir -p ${SCRATCH_DIRECTORY}
cd ${SCRATCH_DIRECTORY}

cp ${SLURM_SUBMIT_DIR}/test.py ${SCRATCH_DIRECTORY}

# each job will see a different ${SLURM_ARRAY_TASK_ID}
echo "now processing task id:: " ${SLURM_ARRAY_TASK_ID}

ml purge # unload all modules
ml Anaconda3 # load Anaconda3 module to use python

python test.py > output_${SLURM_ARRAY_TASK_ID}.txt

# after the job is done we copy our output back to $SLURM_SUBMIT_DIR
cp output_${SLURM_ARRAY_TASK_ID}.txt ${SLURM_SUBMIT_DIR}

# we step out of the scratch directory and remove it
cd ${SLURM_SUBMIT_DIR}
rm -rf ${SCRATCH_DIRECTORY}

# happy end
exit 0

Submit the script and after a short while you should see 8 output files in your submit directory:

ls -l output*.txt

-rw------- 1 user user 60 Oct 14 14:44 output_1.txt
-rw------- 1 user user 60 Oct 14 14:44 output_10.txt
-rw------- 1 user user 60 Oct 14 14:44 output_11.txt
-rw------- 1 user user 60 Oct 14 14:44 output_12.txt
-rw------- 1 user user 60 Oct 14 14:44 output_13.txt
-rw------- 1 user user 60 Oct 14 14:44 output_14.txt
-rw------- 1 user user 60 Oct 14 14:44 output_15.txt
-rw------- 1 user user 60 Oct 14 14:44 output_16.txt

Packaging smaller parallel jobs into one large parallel job

There are several ways to package smaller parallel jobs into one large parallel job. The preferred way is to use Job Arrays. Browse the web for many examples on how to do it. Here we want to present a more pedestrian alternative which can give a lot of flexibility.

In this example we imagine that we wish to run 2 MPI jobs at the same time, each using 4 tasks, thus totalling to 8 tasks. Once they finish, we wish to do a post-processing step and then resubmit another set of 2 jobs with 4 tasks each:

#!/bin/bash

#SBATCH --job-name=example
#SBATCH --ntasks=8
#SBATCH --time=0-00:05:00
#SBATCH --mem-per-cpu=100MB
#SBATCH --output="stdout.txt"
#SBATCH --error="stderr.txt"

# determine the partition
#SBATCH --partition=para

cd ${SLURM_SUBMIT_DIR}

# first set of parallel runs

ml purge
ml OpenMPI

mpirun -n 4 ./my-binary &
mpirun -n 4 ./my-binary &

wait

# here a post-processing step
# ...

# another set of parallel runs
mpirun -n 4 ./my-binary &
mpirun -n 4 ./my-binary &

wait

exit 0

The wait commands are important here - the run script will only continue once all commands started with & have completed.

OpenMP and MPI

You can copy and paste the examples given here to a file (e.g. run.sh) and start it with:

sbatch run.sh

Example for an OpenMP job

#!/bin/bash -l

#############################
# example for an OpenMP job #
#############################

#SBATCH --job-name=example

# we ask for 1 task with 12 cores
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=12

# ask for 16GB memory
#SBATCH --mem=16G

# run for five minutes
#              d-hh:mm:ss
#SBATCH --time=0-00:05:00

# determine the partition
#SBATCH --partition=para

#SBATCH --output="stdout_%j"
#SBATCH --error="stderr_%j"

# you may not place bash commands before the last SBATCH directive

ml purge # it's a good practice to first unload all modules
ml your_modules # then load what module you need, if any

# define and create a unique scratch directory
SCRATCH_DIRECTORY=/scratch1/${USER}/example/${SLURM_JOBID}
mkdir -p ${SCRATCH_DIRECTORY}
cd ${SCRATCH_DIRECTORY}

# we copy everything we need to the scratch directory
# ${SLURM_SUBMIT_DIR} points to the path where this script was submitted from
cp ${SLURM_SUBMIT_DIR}/my_binary.x ${SCRATCH_DIRECTORY}

# we set OMP_NUM_THREADS to the number of available cores
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

# we execute the job and time it
time ./my_binary.x > my_output

# after the job is done we copy our output back to $SLURM_SUBMIT_DIR
cp ${SCRATCH_DIRECTORY}/my_output ${SLURM_SUBMIT_DIR}

# we step out of the scratch directory and remove it
cd ${SLURM_SUBMIT_DIR}
rm -rf ${SCRATCH_DIRECTORY}

Example for an MPI job

#!/bin/bash -l

##########################
# example for an MPI job #
##########################

#SBATCH --job-name=example

# 20 MPI tasks in total
#SBATCH --ntasks=20

# run for five minutes
#              d-hh:mm:ss
#SBATCH --time=0-00:05:00

# 500MB memory per core
#SBATCH --mem-per-cpu=500MB

# determine the partition
#SBATCH --partition=para

#SBATCH --output="stdout_%j"
#SBATCH --error="stderr_%j"

# you may not place bash commands before the last SBATCH directive

# define and create a unique shared directory
SHARED_DIRECTORY=/work8/${USER}/${SLURM_JOBID} # please note it's vital to use /work8 for shared drectory
mkdir -p ${SHARED_DIRECTORY}
cd ${SHARED_DIRECTORY}

# unload all modules then load your OMP modules
ml purge
ml OpenMPI

# we execute the job and time it
time mpirun -np $SLURM_NTASKS ./my_binary.x &> my_output

# happy end
exit 0

Example for a hybrid MPI/OpenMP job

#!/bin/bash -l

#######################################
# example for a hybrid MPI OpenMP job #
#######################################

#SBATCH --job-name=example

# we ask for 2 MPI tasks with 18 cores each on 1 node
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=18

# run for five minutes
#              d-hh:mm:ss
#SBATCH --time=0-00:05:00

# 500MB memory per core
#SBATCH --mem-per-cpu=500MB

# determine the partition
#SBATCH --partition=para

#SBATCH --output="stdout_%j"
#SBATCH --error="stderr_%j"

# you may not place bash commands before the last SBATCH directive

# define and create a unique shared directory
SHARED_DIRECTORY=/work8/${USER}/${SLURM_JOBID} # please note it's vital to use /work8 for shared drectory
mkdir -p ${SHARED_DIRECTORY}
cd ${SHARED_DIRECTORY}

# unload all modules then load your OMP modules
ml purge
ml OpenMPI

# we set OMP_NUM_THREADS to the number cpu cores per MPI task
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

# we execute the job and time it
time mpirun -np $SLURM_NTASKS ./my_binary.x &> my_output

# happy end
exit 0

If you want to start more than one MPI rank per node you can use --ntasks-per-node in combination with --nodes:

#SBATCH --nodes=2 --ntasks-per-node=2 --cpus-per-task=8

This will start 2 MPI tasks each on 2 nodes, where each task can use up to 8 threads.

Example for a GPU job

run-p4000.shrun-k80.shstats.cu

#!/bin/bash -l
#############################
# example for a GPU job #
#############################

#SBATCH --job-name=example

# we can ask for 1 p4000, 1 k80 or 2 k80 gpus.
#SBATCH --gpus=p4000

#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --mem=32G
#SBATCH --time=0-00:05:00
#SBATCH --partition=para

#SBATCH --output="p4000.out"
#SBATCH --error="p4000.err"

# you may not place bash commands before the last SBATCH directive

ml purge # it's a good practice to first unload all modules
ml CUDA  # then load what module you need

##Compile the cuda code using the nvcc compiler
nvcc -o stats.exe stats.cu

## Run the code
./stats.exe
rm ./stats.exe

#!/bin/bash -l
#############################
# example for a GPU job #
#############################

#SBATCH --job-name=example

# we can ask for 1 p4000, 1 k80 or 2 k80 gpus.
#SBATCH --gpus=k80:2

#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=36
#SBATCH --mem=32G
#SBATCH --time=0-00:05:00
#SBATCH --partition=para

#SBATCH --output="k80.out"
#SBATCH --error="k80.err"

# you may not place bash commands before the last SBATCH directive

ml purge # it's a good practice to first unload all modules
ml CUDA  # then load what module you need

##Compile the cuda code using the nvcc compiler
nvcc -o stats.exe stats.cu

## Run the code
./stats.exe
rm ./stats.exe

#include <stdio.h>
#include <cuda_runtime.h>

void printDeviceInfo(cudaDeviceProp prop) {

   printf("Name                         - %s\n",  prop.name);
   printf("Total global memory          - %lu MB \n", prop.totalGlobalMem/(1024*1024));
   printf("Total constant memory        - %lu KB \n", prop.totalConstMem/1024);

   printf("Shared memory per block      - %lu KB \n", prop.sharedMemPerBlock/1024);
   printf("Total registers per block    - %d\n", prop.regsPerBlock);
   printf("Maximum threads per block    - %d\n", prop.maxThreadsPerBlock);

   printf("Clock rate                   - %d\n",  prop.clockRate);
   printf("Number of multi-processors   - %d\n",  prop.multiProcessorCount);

  }

int main( ) {

    int deviceCount;
    cudaGetDeviceCount(&deviceCount);
    printf("Available CUDA devices - %d\n", deviceCount);
    for (int i=0;i<deviceCount;i++){

        // Device informatioon
        printf("\nCUDA Device #%d\n", i);
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, i);
        printDeviceInfo(prop);

    }
}

General tips

Example on how to allocate entire memory on one node

#!/bin/bash -l

###################################################
# Example for a job that consumes a lot of memory #
###################################################

#SBATCH --job-name=example

# we ask for 1 node
#SBATCH --nodes=1

# run for five minutes
#              d-hh:mm:ss
#SBATCH --time=0-00:05:00

# determine the partition, e.g. long
#SBATCH --partition=long 

# total memory for this job
# this is a hard limit
# note that if you ask for more than one CPU has, your account gets
# charged for the other (idle) CPUs as well
#SBATCH --mem=16000MB

#SBATCH --output="stdout.txt"
#SBATCH --error="stderr.txt"

# you may not place bash commands before the last SBATCH directive

ml purge # it's a good practice to first unload all modules
ml your_modules # then load what module you need, if any

SCRATCH_DIRECTORY=/scratch1/${USER}/example/${SLURM_JOBID}
mkdir -p ${SCRATCH_DIRECTORY}
cd ${SCRATCH_DIRECTORY}

# we copy everything we need to the scratch directory
# ${SLURM_SUBMIT_DIR} points to the path where this script was submitted from
cp ${SLURM_SUBMIT_DIR}/my_binary.x ${SCRATCH_DIRECTORY}

# we execute the job and time it
time ./my_binary.x > my_output

# after the job is done we copy our output back to $SLURM_SUBMIT_DIR
cp ${SCRATCH_DIRECTORY}/my_output ${SLURM_SUBMIT_DIR}

# we step out of the scratch directory and remove it
cd ${SLURM_SUBMIT_DIR}
rm -rf ${SCRATCH_DIRECTORY}

# happy end
exit 0

How to recover files before a job times out

Possibly you would like to clean up the work directory or recover files for restart in case a job times out. In this example we ask Slurm to send a signal to our script 120 seconds before it times out to give us a chance to perform clean-up actions.

#!/bin/bash -l

# job name
#SBATCH --job-name=example

# one core only
#SBATCH --ntasks=1

# we give this job 4 minutes
#SBATCH --time=0-00:04:00

# asks SLURM to send the USR1 signal 120 seconds before end of the time limit
#SBATCH --signal=B:USR1@120

# define the handler function
# note that this is not executed here, but rather
# when the associated signal is sent
your_cleanup_function()
{
    echo "function your_cleanup_function called at $(date)"
    # do whatever cleanup you want here
}

# call your_cleanup_function once we receive USR1 signal
trap 'your_cleanup_function' USR1

echo "starting calculation at $(date)"

# the calculation "computes" (in this case sleeps) for 1000 seconds
# but we asked slurm only for 240 seconds so it will not finish
# the "&" after the compute step and "wait" are important
sleep 1000 &
wait