
Python Example + Array Jobs + Parallelism Strategies
From a single job to hybrid MPI —
and what to do when the work is embarrassingly parallel
Resources
Section 5 — 25 min


A parallelisable computation we can use to compare strategies
Draw random points in a square. Count how many fall inside the quarter-circle.
\[\hat{\pi} = 4 \times \frac{\text{hits}}{N}\]
Why this example?
The scripts are in ex01_monte_carlo_pi/. We use this one
problem to illustrate every strategy in the section.
ex01 — run all Python variants side by side
From ex01_monte_carlo_pi/:
The script runs monte-carlo-pi-summary: pure Python →
NumPy → Numba → Numba-parallel, one table.
Look for: the time[s] column. Same
total samples, different implementations. How much faster is Numba than
pure Python? Does --cpus-per-task=4 help
Numba-parallel?
Open the script and try: change -n
(samples per thread) or -t (threads).
Note: -n is per thread, so increasing -t also
increases total samples (weak scaling). To keep total samples fixed,
scale -n down proportionally.
ex01 — scale to all 144 cores on one Grace node
The script sweeps every valid (nproc, nthreads) decomposition on 144 cores:
Parallelism hierarchy:
Which decomposition is fastest? Open the output and look.
ex01 — weak scaling across 4 nodes (576 cores)
Weak scaling: resources grow 4× (144 → 576 cores), per-thread work stays constant.
Ideal outcome: wall-clock time stays the same as the single-node run — you have 4× more work and 4× more hardware.
Why MPI is needed beyond one node: shared memory does not reach across a network interconnect. MPI is the standard way to coordinate distributed-memory computation on an HPC cluster.
This is the canonical HPC pattern — use it when your computation scales and the work justifies multi-node resources.
ex02 — the same hybrid pattern in a compiled language
From ex02_monte_carlo_pi_c/:
bash make.sh
sbatch sbatch_monte_carlo_pi_mpi_hybrid_c.sh # single node
sbatch sbatch_monte_carlo_pi_mpi_hybrid_c_multinode.sh # 4 nodesThe C version replaces Numba threads with OpenMP
#pragma omp parallel. Everything else is the same:
-n = samples per thread)Compiled code is faster (no JIT overhead, better compiler optimisations), but the programming model and parallelism hierarchy are identical.
This is a stretch exercise — come back to it if you finish early.
When you have N tasks that don’t need to talk to each other
MPI is the right tool when tasks communicate during execution.
But many real workflows are different:
Each task is independent. No communication needed. This is called embarrassingly parallel.
The pattern: map → run → reduce
Using HPC this way is sometimes called HTC (High-Throughput Computing).
One sbatch command, 36 independent jobs
From ex03_job_array/:
What this does:
results/ directory--array=1-36%36: 36 tasks, each
with its own seed ($SLURM_ARRAY_TASK_ID)reduce-mc-pi-results combines
all 36 results/mc_pi_<jobid>_<taskid>.txt
filesMonitor: squeue --me shows <jobid>_1,
<jobid>_2, … — one entry per task.
Throttling: add %M to cap concurrency:
--array=1-1000%50
This example uses 4 threads per task and
2^29 samples per thread, so the array can occupy all
144 Grace CPU cores when all 36 tasks are running.
Open sbatch_monte_carlo_pi_array.sh and change the array
size or seed range.
All tasks on one node — no per-task scheduler overhead
From ex04_gnu_parallel/:
What this does:
tasks.txt (one complete
command per line, with taskset and
/usr/bin/time -v)parallel --jobs 36 < tasks.txt on an exclusive 144-core
nodereduce-mc-pi-results combines
results/mc_pi_gnu_*.txtAfter the pre job: cat tasks.txt to see the commands.
The pre-job log also prints a concrete slot-1 preview
(taskset -c 0-3 ...). Each line pins its process to a
disjoint core range using $PARALLEL_JOBSLOT — no
oversubscription.
Oversubscription check: wall-clock time per task
(from /usr/bin/time -v) should match the array job.
Open generate_tasks.py and tweak N_TASKS,
N_THREADS, or N_CONCURRENT.
Map-reduce in one Python script — no job chaining needed
From ex05_mpi4py_futures/:
MPIPoolExecutor distributes tasks across MPI ranks:
With --ntasks=36 and --cpus-per-task=4 you
get 35 worker ranks plus 1 controller
rank. Pre, map, and reduce are all in the same Python script —
no Slurm job chaining needed.
from mpi4py.futures import MPIPoolExecutor
with MPIPoolExecutor() as executor:
results = list(executor.map(_worker, task_args))Launch:
srun -n 36 -c 4 python -m mpi4py.futures -m <module> [args]
Single-node Pool.map — spawn, sched_getaffinity, no nested threads
From ex06_multiprocessing/:
Three NERSC-recommended patterns for HPC:
import multiprocessing as mp, os
ctx = mp.get_context("spawn") # safe: no fork of MPI/Numba state
available = sorted(os.sched_getaffinity(0))
n_workers = len(available) // 4 # 36 workers on a 144-core node
with ctx.Pool(processes=n_workers, initializer=_init_worker, ...) as pool:
results = pool.map(_worker, task_args)Why sched_getaffinity, not
cpu_count? cpu_count() returns all
144 cores on a Grace node; sched_getaffinity(0) returns
only the CPUs Slurm allocated (--cpus-per-task).
Why spawn? fork copies MPI
communicators and Numba JIT caches — causes hangs on HPC.
When pipelines get complex — use a workflow manager
Job arrays and GNU parallel work well for simple independent-task pipelines. For more complex workflows:
| Tool | Model | Good for |
|---|---|---|
| Parsl | Python-native, HPC-aware | Python workflows, dynamic graphs |
| Nextflow | DSL, container-first | Bioinformatics, reproducibility |
| Snakemake | Makefile-inspired | Data science, file-based dependencies |
| Dask | Python, in-process | Array/dataframe workloads |
Common features: dependency graphs, retry on failure, provenance tracking, cluster-aware submission.
These are follow-up tools — start with job arrays and grow into a workflow manager when your pipeline outgrows them.