First Batch Job with Slurm

University of Exeter logo

First Batch Job with Slurm

sbatch, squeue, sacct —
the submit-check-read-output loop

Resources

Section 3 — 25 min

GW4 logo

Isambard 3 exterior

Why batch?

Compute nodes are shared — the scheduler is how you ask for a slice

  • Login node — where you type ssh into. Shared by everyone. No heavy compute here.
  • Compute nodes — what Slurm hands out when you submit a job. Your work runs here.
  • Slurm — the scheduler. You write a script that says how much and for how long, and it finds you a slot.

The loop we will practise: submit → check → read output. Repeat until you get what you want.

Anatomy of a batch script

Shebang + #SBATCH directives + normal shell commands

#!/bin/bash
#SBATCH --job-name=hello_world
#SBATCH --output=hello_world.out
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:01:00

echo "Job ID: ${SLURM_JOB_ID}"
echo "Host:   $(hostname)"
date
free -h
lscpu
env | sort > hello_world_${HOSTNAME}.env
  • #SBATCH lines are comments to the shell, directives to Slurm
  • Below the directives is just bash — anything you can run interactively
  • Think reproducibility: the script documents exactly how the job was set up — do not rely on modules you loaded earlier or files you happened to have in $PWD; put everything the job needs inside the script

Submit, check, read

Three commands you will use every day

sbatch

sbatch sbatch_hello_world.sh

Returns a job ID.

squeue --me

squeue --me
watch -n 15 squeue --me

States: PD (pending), R (running), CG (completing).

read / cancel

cat hello_world.out
scancel <jobid>

Output is just a file. Cancel any job you submitted by mistake.

watch -n 15 squeue --me polls every 15 seconds — do not go lower. The scheduler is shared.

Interactive jobs

A shell on a compute node — fastest way to poke at something

srun --ntasks=1 --cpus-per-task=1 --time=00:10:00 --pty bash

Lands you on a compute node with a prompt. Run the same commands you would put in a batch script.

  • Quick checks: “does this module load cleanly?”, “does my script find its data?”
  • Short-lived: exit when you are done — do not idle
  • Some workflows genuinely need interactivity: steering a solver, inspecting intermediate results, manually nudging a parameter to escape a local minimum. If your science requires a human in the loop, this is the tool.

The point is not “never use interactive jobs” — it is use resources responsibly. Idle allocations block others. Batch is better when there is no human in the loop.

Workshop reservation

We have a reservation for today — add these flags to sbatch and srun commands that should use the reservation

Inspect the reservation and QOS before you use them:

scontrol show res exeter-workshop-260421
sacctmgr show qos brics.e6c_qos

On the command line:

sbatch --reservation=exeter-workshop-260421 --qos=brics.e6c_qos sbatch_hello_world.sh

Or inside your batch script (add these two #SBATCH lines):

#SBATCH --reservation=exeter-workshop-260421
#SBATCH --qos=brics.e6c_qos

Gotchas:

  • Reservation alone — works fine
  • Reservation and QOS — works fine
  • QOS without reservation — sbatch: error: Batch job submission failed: Invalid qos specification
  • More than 2 nodes — QOSMaxNodePerUserLimit (the reservation caps per-user node count)

Hands-on

~15 minutes — work in order, skip an exercise if it feels too easy

  • ex01 ex01_hello_world/01-hello-world.md — first sbatch submission
  • ex02 ex02_multi_task/02-multi-task.md — ntasks + srun + sacct
  • ex03 ex03_interactive/03-interactive.md — interactive shell
  • ex04 ex04_matmul/04-matmul.md — compile + run (build first: bash make.sh)

Each .md has the commands to run, questions to think about, and things to try if you finish early.

Do not do these today

Out of scope for this section — defer to docs or later sections

Stay in scope

  • sbatch, squeue --me, scancel, sacct
  • --ntasks, --cpus-per-task, --time, --output
  • srun for fan-out, srun --pty bash for a quick shell

Out of scope

  • Partitions / QOS / reservations
  • --mail-type=END and email notifications
  • MPI launches (Section 5 stretch)
  • Container launches (follow-up only)

We cover debugging failed jobs properly in Section 6. If a job misbehaves, note it and bring it there.

Discussion

Discussion

Did your first job run? Anything unexpected in the output? Anything you want to try next?