debugging_failed

Debugging Failed Jobs

University of Exeter logo

Debugging Failed Jobs

Read the error, form a hypothesis, fix it —
hands-on practice with broken Slurm scripts

Resources

Slurm guide
sacct docs
Helpers are in the room — raise a hand

Section 6 — 20 min

GW4 logo

Isambard 3 exterior

Exercise menu

Pick what is most relevant to your work — you do not need to do all of them

Exercise	Topic
ex01	Oversubscription — srun uses more CPUs than allocated
ex02	Wrong env (module) — `module load PrgEnv-gnu` missing
ex03	Wrong env (pixi missing) — pixi not activated on compute node
ex04	Wrong env (pixi wrong) — wrong pixi env (`default` vs `hpc`)
ex05	OOM — huge matrix, only 1 CPU, job killed by OOM killer
ex06	MPI topology — uneven rank distribution across nodes
ex07	Race condition — missing OpenMP reduction clause (C + Numba)

Each exercise directory has a broken script and a walkthrough .md. Start with the one closest to your own work.

Before you look at the hints…

The struggle is the learning

Do this first

Read the error message carefully
Form a hypothesis about what is wrong
Try one fix and observe what changes
Hints are at the bottom of each exercise .md

Resist the urge to

Immediately scroll to the hint
Paste the error straight into an LLM
Ask a helper before you have a hypothesis

LLMs and helpers are fine — but only after you have spent a few minutes thinking. The mental model you build by struggling is the point.

When a job fails

Four steps, in order — do not skip to step 4

1. Read output

cat job.out
cat job.err

The error message is almost always here. Look at the last few lines.

2. sacct

sacct --format=JobID,State,\
ExitCode,MaxRSS -j <jobid>

State, exit code, and peak memory — the scheduler’s view.

3. squeue state

PD pending
R running
CG completing

squeue --start --me

4. Hypothesis

Write down what you think is wrong before editing the script.

Blind edits waste time.

Environment issues

ex01–ex04 — four different ways the environment can be wrong

ex01 Oversubscription

srun launches N threads but --cpus-per-task is not set (defaults to 1). All threads compete for one CPU.

ex02 Module missing

Compiled binary needs PrgEnv-gnu; the script forgot module load PrgEnv-gnu. Job fails with a linker or shared-library error.

ex03 Pixi not activated

Compute node starts a fresh shell. Without pixi run or eval "$(pixi shell-hook)", the Python and packages from the pixi environment are not on PATH.

ex04 Wrong pixi environment

The script activates the default environment; MPI-enabled code needs the hpc environment (Cray MPICH build).

Resource issues

ex05–ex06 — memory and topology

ex05 Out of memory

A large matrix is allocated, but the job only requests 1 CPU and the default memory. The OOM killer terminates the process. sacct shows State=OUT_OF_MEMORY or ExitCode=137.

ex06 MPI topology

Ranks are spread unevenly across nodes — e.g. 3 ranks on node A, 1 on node B. Collective operations stall or produce incorrect timings because the load is not balanced.

Correctness issues

ex07 — the job finishes but gives the wrong answer

ex07 Race condition

An OpenMP parallel loop accumulates into a shared variable without a reduction clause. Multiple threads write to the same memory simultaneously — the result is non-deterministic and usually wrong.

The C version and the Numba @numba.njit(parallel=True) variant both have this bug.

The job exits 0 and produces output — but the output is wrong. This is the hardest category to catch.

Scheduler etiquette

The scheduler is shared — poll responsibly

Do this

squeue --me — run manually when you need it
watch -n 15 squeue --me — minimum 15-second interval
squeue --start --me — check ETA for pending jobs once
scancel <jobid> — cancel jobs you no longer need

Do not do this

watch -n 1 squeue — hammers the scheduler
Looping squeue in a shell script without sleep
Leaving idle interactive allocations running
Submitting dozens of test jobs to check a small fix

A good rule: if you would not open-close a door 60 times a minute, do not poll the scheduler that fast either.

Discussion

Which exercise matched a problem you have seen before? Anything you want to dig into further?