parallel_strategies

Python Example + Array Jobs + Parallelism Strategies

University of Exeter logo

Python Example + Array Jobs + Parallelism Strategies

From a single job to hybrid MPI —
and what to do when the work is embarrassingly parallel

Resources

Slurm arrays
GNU parallel manual
Helpers are in the room — raise a hand

Section 5 — 25 min

GW4 logo

Isambard 3 exterior

The example: Monte Carlo π

A parallelisable computation we can use to compare strategies

Draw random points in a square. Count how many fall inside the quarter-circle.

\[\hat{\pi} = 4 \times \frac{\text{hits}}{N}\]

Why this example?

Trivially parallelisable: each sample is independent
Same code runs in pure Python, NumPy, Numba, MPI+Numba, and C+OpenMP
Results are checkable — we know the answer

The scripts are in ex01_monte_carlo_pi/. We use this one problem to illustrate every strategy in the section.

Single job

ex01 — run all Python variants side by side

From ex01_monte_carlo_pi/:

sbatch sbatch_monte_carlo_pi_single.sh
squeue --me
cat mc_pi_single_<jobid>.out

The script runs monte-carlo-pi-summary: pure Python → NumPy → Numba → Numba-parallel, one table.

Look for: the time[s] column. Same total samples, different implementations. How much faster is Numba than pure Python? Does --cpus-per-task=4 help Numba-parallel?

Open the script and try: change -n (samples per thread) or -t (threads). Note: -n is per thread, so increasing -t also increases total samples (weak scaling). To keep total samples fixed, scale -n down proportionally.

Hybrid MPI + Numba (single node)

ex01 — scale to all 144 cores on one Grace node

sbatch sbatch_monte_carlo_pi_mpi_hybrid.sh
cat mc_pi_py_<jobid>.out

The script sweeps every valid (nproc, nthreads) decomposition on 144 cores:

nproc=1, nthreads=144 — one MPI rank, pure threading (shared memory only)
nproc=144, nthreads=1 — 144 MPI ranks, no threading (pure MPI)
Everything in between: hybrid

Parallelism hierarchy:

Numba threads — share memory within a rank (fast, no network)
MPI — explicit message passing between ranks (needed to go beyond one node)

Which decomposition is fastest? Open the output and look.

Hybrid MPI + Numba (multi-node)

ex01 — weak scaling across 4 nodes (576 cores)

sbatch sbatch_monte_carlo_pi_mpi_hybrid_multinode.sh
cat mc_pi_py_mn_<jobid>.out

Weak scaling: resources grow 4× (144 → 576 cores), per-thread work stays constant.

Ideal outcome: wall-clock time stays the same as the single-node run — you have 4× more work and 4× more hardware.

Why MPI is needed beyond one node: shared memory does not reach across a network interconnect. MPI is the standard way to coordinate distributed-memory computation on an HPC cluster.

This is the canonical HPC pattern — use it when your computation scales and the work justifies multi-node resources.

MPI + OpenMP in C

ex02 — the same hybrid pattern in a compiled language

From ex02_monte_carlo_pi_c/:

bash make.sh
sbatch sbatch_monte_carlo_pi_mpi_hybrid_c.sh        # single node
sbatch sbatch_monte_carlo_pi_mpi_hybrid_c_multinode.sh  # 4 nodes

The C version replaces Numba threads with OpenMP #pragma omp parallel. Everything else is the same:

Same MPI reduction across ranks
Same weak-scaling design (-n = samples per thread)
Same decomposition sweep

Compiled code is faster (no JIT overhead, better compiler optimisations), but the programming model and parallelism hierarchy are identical.

This is a stretch exercise — come back to it if you finish early.

A different problem: many independent tasks

When you have N tasks that don’t need to talk to each other

MPI is the right tool when tasks communicate during execution.

But many real workflows are different:

Run the same analysis on 130 imaging cases
Sweep 1000 parameter combinations
Repeat a simulation 500 times with different seeds

Each task is independent. No communication needed. This is called embarrassingly parallel.

The pattern: map → run → reduce

Map — split work into N independent tasks
Run — execute all tasks (possibly in parallel)
Reduce — combine results into one answer

Using HPC this way is sometimes called HTC (High-Throughput Computing).

ex03: Slurm job array

One sbatch command, 36 independent jobs

From ex03_job_array/:

bash run_array_pipeline.sh

What this does:

Pre — create results/ directory
Main — --array=1-36%36: 36 tasks, each with its own seed ($SLURM_ARRAY_TASK_ID)
Post — reduce-mc-pi-results combines all 36 results/mc_pi_<jobid>_<taskid>.txt files

Monitor: squeue --me shows <jobid>_1, <jobid>_2, … — one entry per task.

Throttling: add %M to cap concurrency: --array=1-1000%50

This example uses 4 threads per task and 2^29 samples per thread, so the array can occupy all 144 Grace CPU cores when all 36 tasks are running.

Open sbatch_monte_carlo_pi_array.sh and change the array size or seed range.

ex04: GNU parallel

All tasks on one node — no per-task scheduler overhead

From ex04_gnu_parallel/:

bash run_gnu_parallel_pipeline.sh

What this does:

Pre — generate tasks.txt (one complete command per line, with taskset and /usr/bin/time -v)
Main — parallel --jobs 36 < tasks.txt on an exclusive 144-core node
Post — reduce-mc-pi-results combines results/mc_pi_gnu_*.txt

After the pre job: cat tasks.txt to see the commands. The pre-job log also prints a concrete slot-1 preview (taskset -c 0-3 ...). Each line pins its process to a disjoint core range using $PARALLEL_JOBSLOT — no oversubscription.

Oversubscription check: wall-clock time per task (from /usr/bin/time -v) should match the array job.

Open generate_tasks.py and tweak N_TASKS, N_THREADS, or N_CONCURRENT.

ex05: mpi4py.futures

Map-reduce in one Python script — no job chaining needed

From ex05_mpi4py_futures/:

sbatch sbatch_mpi4py_futures.sh
cat mc_pi_futures_<jobid>.out

MPIPoolExecutor distributes tasks across MPI ranks:

Rank 0 — controller: submits tasks, collects results, reduces
Ranks 1..N-1 — workers: each runs one task at a time

With --ntasks=36 and --cpus-per-task=4 you get 35 worker ranks plus 1 controller rank. Pre, map, and reduce are all in the same Python script — no Slurm job chaining needed.

from mpi4py.futures import MPIPoolExecutor

with MPIPoolExecutor() as executor:
    results = list(executor.map(_worker, task_args))

Launch: srun -n 36 -c 4 python -m mpi4py.futures -m <module> [args]

ex06: multiprocessing

Single-node Pool.map — spawn, sched_getaffinity, no nested threads

From ex06_multiprocessing/:

sbatch sbatch_multiprocessing.sh
cat mc_pi_mp_<jobid>.out

Three NERSC-recommended patterns for HPC:

import multiprocessing as mp, os

ctx = mp.get_context("spawn")           # safe: no fork of MPI/Numba state
available = sorted(os.sched_getaffinity(0))
n_workers = len(available) // 4         # 36 workers on a 144-core node

with ctx.Pool(processes=n_workers, initializer=_init_worker, ...) as pool:
    results = pool.map(_worker, task_args)

Why sched_getaffinity, not cpu_count? cpu_count() returns all 144 cores on a Grace node; sched_getaffinity(0) returns only the CPUs Slurm allocated (--cpus-per-task).

Why spawn? fork copies MPI communicators and Numba JIT caches — causes hangs on HPC.

Workflow managers

When pipelines get complex — use a workflow manager

Job arrays and GNU parallel work well for simple independent-task pipelines. For more complex workflows:

Tool	Model	Good for
Parsl	Python-native, HPC-aware	Python workflows, dynamic graphs
Nextflow	DSL, container-first	Bioinformatics, reproducibility
Snakemake	Makefile-inspired	Data science, file-based dependencies
Dask	Python, in-process	Array/dataframe workloads

Common features: dependency graphs, retry on failure, provenance tracking, cluster-aware submission.

These are follow-up tools — start with job arrays and grow into a workflow manager when your pipeline outgrows them.