
Debugging Failed Jobs
Read the error, form a hypothesis, fix it —
hands-on practice with broken Slurm scripts
Resources
sacct
docsSection 6 — 20 min


Pick what is most relevant to your work — you do not need to do all of them
| Exercise | Topic |
|---|---|
| ex01 | Oversubscription — srun uses more CPUs than allocated |
| ex02 | Wrong env (module) —
module load PrgEnv-gnu missing |
| ex03 | Wrong env (pixi missing) — pixi not activated on compute node |
| ex04 | Wrong env (pixi wrong) — wrong pixi env
(default vs hpc) |
| ex05 | OOM — huge matrix, only 1 CPU, job killed by OOM killer |
| ex06 | MPI topology — uneven rank distribution across nodes |
| ex07 | Race condition — missing OpenMP reduction clause (C + Numba) |
Each exercise directory has a broken script and a walkthrough
.md. Start with the one closest to your own work.
The struggle is the learning
Do this first
.mdResist the urge to
LLMs and helpers are fine — but only after you have spent a few minutes thinking. The mental model you build by struggling is the point.
Four steps, in order — do not skip to step 4
1. Read output
The error message is almost always here. Look at the last few lines.
2. sacct
State, exit code, and peak memory — the scheduler’s view.
4. Hypothesis
Write down what you think is wrong before editing the script.
Blind edits waste time.
ex01–ex04 — four different ways the environment can be wrong
ex01 Oversubscription
srun launches N threads but --cpus-per-task
is not set (defaults to 1). All threads compete for one CPU.
ex02 Module missing
Compiled binary needs PrgEnv-gnu; the script forgot
module load PrgEnv-gnu. Job fails with a linker or
shared-library error.
ex03 Pixi not activated
Compute node starts a fresh shell. Without pixi run or
eval "$(pixi shell-hook)", the Python and packages from the
pixi environment are not on PATH.
ex04 Wrong pixi environment
The script activates the default environment;
MPI-enabled code needs the hpc environment (Cray MPICH
build).
ex05–ex06 — memory and topology
ex05 Out of memory
A large matrix is allocated, but the job only requests 1 CPU and the
default memory. The OOM killer terminates the process.
sacct shows State=OUT_OF_MEMORY or
ExitCode=137.
ex06 MPI topology
Ranks are spread unevenly across nodes — e.g. 3 ranks on node A, 1 on node B. Collective operations stall or produce incorrect timings because the load is not balanced.
ex07 — the job finishes but gives the wrong answer
ex07 Race condition
An OpenMP parallel loop accumulates into a shared variable without a
reduction clause. Multiple threads write to the same memory
simultaneously — the result is non-deterministic and usually wrong.
The C version and the Numba @numba.njit(parallel=True)
variant both have this bug.
The job exits 0 and produces output — but the output is wrong. This is the hardest category to catch.
The scheduler is shared — poll responsibly
Do this
squeue --me — run manually when you need itwatch -n 15 squeue --me — minimum 15-second
intervalsqueue --start --me — check ETA for pending jobs
oncescancel <jobid> — cancel jobs you no longer
needDo not do this
watch -n 1 squeue — hammers the schedulersqueue in a shell script without
sleepA good rule: if you would not open-close a door 60 times a minute, do not poll the scheduler that fast either.
Discussion
Which exercise matched a problem you have seen before? Anything you want to dig into further?