Debugging Failed Jobs

University of Exeter logo

Debugging Failed Jobs

Read the error, form a hypothesis, fix it —
hands-on practice with broken Slurm scripts

Resources

Section 6 — 20 min

GW4 logo

Isambard 3 exterior

Exercise menu

Pick what is most relevant to your work — you do not need to do all of them

Exercise Topic
ex01 Oversubscription — srun uses more CPUs than allocated
ex02 Wrong env (module) — module load PrgEnv-gnu missing
ex03 Wrong env (pixi missing) — pixi not activated on compute node
ex04 Wrong env (pixi wrong) — wrong pixi env (default vs hpc)
ex05 OOM — huge matrix, only 1 CPU, job killed by OOM killer
ex06 MPI topology — uneven rank distribution across nodes
ex07 Race condition — missing OpenMP reduction clause (C + Numba)

Each exercise directory has a broken script and a walkthrough .md. Start with the one closest to your own work.

Before you look at the hints…

The struggle is the learning

Do this first

  • Read the error message carefully
  • Form a hypothesis about what is wrong
  • Try one fix and observe what changes
  • Hints are at the bottom of each exercise .md

Resist the urge to

  • Immediately scroll to the hint
  • Paste the error straight into an LLM
  • Ask a helper before you have a hypothesis

LLMs and helpers are fine — but only after you have spent a few minutes thinking. The mental model you build by struggling is the point.

When a job fails

Four steps, in order — do not skip to step 4

1. Read output

cat job.out
cat job.err

The error message is almost always here. Look at the last few lines.

2. sacct

sacct --format=JobID,State,\
ExitCode,MaxRSS -j <jobid>

State, exit code, and peak memory — the scheduler’s view.

3. squeue state

PD pending
R running
CG completing

squeue --start --me

4. Hypothesis

Write down what you think is wrong before editing the script.

Blind edits waste time.

Environment issues

ex01–ex04 — four different ways the environment can be wrong

ex01 Oversubscription

srun launches N threads but --cpus-per-task is not set (defaults to 1). All threads compete for one CPU.

ex02 Module missing

Compiled binary needs PrgEnv-gnu; the script forgot module load PrgEnv-gnu. Job fails with a linker or shared-library error.

ex03 Pixi not activated

Compute node starts a fresh shell. Without pixi run or eval "$(pixi shell-hook)", the Python and packages from the pixi environment are not on PATH.

ex04 Wrong pixi environment

The script activates the default environment; MPI-enabled code needs the hpc environment (Cray MPICH build).

Resource issues

ex05–ex06 — memory and topology

ex05 Out of memory

A large matrix is allocated, but the job only requests 1 CPU and the default memory. The OOM killer terminates the process. sacct shows State=OUT_OF_MEMORY or ExitCode=137.

ex06 MPI topology

Ranks are spread unevenly across nodes — e.g. 3 ranks on node A, 1 on node B. Collective operations stall or produce incorrect timings because the load is not balanced.

Correctness issues

ex07 — the job finishes but gives the wrong answer

ex07 Race condition

An OpenMP parallel loop accumulates into a shared variable without a reduction clause. Multiple threads write to the same memory simultaneously — the result is non-deterministic and usually wrong.

The C version and the Numba @numba.njit(parallel=True) variant both have this bug.

The job exits 0 and produces output — but the output is wrong. This is the hardest category to catch.

Scheduler etiquette

The scheduler is shared — poll responsibly

Do this

  • squeue --me — run manually when you need it
  • watch -n 15 squeue --me — minimum 15-second interval
  • squeue --start --me — check ETA for pending jobs once
  • scancel <jobid> — cancel jobs you no longer need

Do not do this

  • watch -n 1 squeue — hammers the scheduler
  • Looping squeue in a shell script without sleep
  • Leaving idle interactive allocations running
  • Submitting dozens of test jobs to check a small fix

A good rule: if you would not open-close a door 60 times a minute, do not poll the scheduler that fast either.

Discussion

Discussion

Which exercise matched a problem you have seen before? Anything you want to dig into further?