
NVIDIA Grace CPU Superchip
The hardware powering every Isambard 3 node
GW4 Isambard 3 Practical Workshop — 21 April 2026


What “Superchip” means
A Grace CPU Superchip packages two NVIDIA Grace CPUs on a single compact module.
This tight coupling is why the Superchip behaves more like a single processor than a conventional dual-socket server.

Two NUMA nodes, one Superchip
Within each Grace CPU, cores, cache, memory, and I/O are connected by the NVIDIA Scalable Coherency Fabric (SCF) — a high-bandwidth mesh.
Conventional dual-socket servers may expose four or more NUMA nodes with slow inter-socket interconnects. Grace is notably simpler.
Practical rule: treat each node as two NUMA zones, each with 72 cores and ~120 GB of memory.
72 cores, a mesh fabric, and co-packaged LPDDR5X
72 × Arm Neoverse V2 cores
3.2 TB/s NVIDIA Scalable Coherency Fabric connecting cores, L3, memory, and I/O
900 GB/s NVLink-C2C to the second Grace CPU
240 GB at up to 1 TB/s
Grace uses LPDDR5X with ECC, physically co-packaged with the CPU dies on the same module.
240 GB total on this Superchip, split as 2 × 120 GB — one 120 GB NUMA node per Grace CPU.
| Scope | Peak bandwidth |
|---|---|
| Per Grace CPU | up to 512 GB/s |
| Per Grace CPU Superchip | up to 1 TB/s |
Co-packaging eliminates the off-module interconnect bottleneck. The result is unusually high bandwidth for a CPU platform — competitive with some HBM-equipped accelerators.
Memory-bandwidth-sensitive codes (FFTs, sparse solvers, molecular dynamics) often benefit the most from Grace.
Four 128-bit SIMD units per core
Each Neoverse V2 core contains four 128-bit SIMD units supporting two instruction sets.
Fixed 128-bit width; the standard Arm SIMD set. Widely supported across compilers and libraries.
Armv9-A feature; also runs at 128 bits on V2, but written length-agnostically so it can target future wider implementations without recompilation.
Use -mcpu=neoverse-v2 with the GNU
compiler (the recommended path on Isambard 3):
-mcpu=neoverse-v2cc, CC,
ftn): add -mcpu=neoverse-v2 to your flags-mcpu sets both the architecture target and the tuning
in one flag — it is the correct flag for Arm, unlike -march
which is the x86 convention.
Back-of-the-envelope from first principles
Per core, per cycle — FP64:
\[\text{FLOPS/cycle per core} = (\text{elements per vector}) \times (\text{vector units}) \times (\text{ops per FMA})\]
\[\text{FLOPS/cycle per core} = 2 \times 4 \times 2 = 16\]
Scaling to the full Superchip at 3.1 GHz base frequency:
\[\text{Total FP64 Peak} = 144 \times 3.1 \times 10^{9} \times 16 \approx 7.1 \text{ TFLOPS}\]
NVIDIA’s published figure is 7.1 TFLOPS FP64 peak, consistent with the 3.1 GHz base frequency. At the 3.0 GHz all-core SIMD frequency the same calculation gives ≈ 6.9 TFLOPS — the difference is simply which frequency NVIDIA chose to publish.
What to remember when planning your jobs
144 Arm Neoverse V2 cores per node
2 NUMA nodes per node (72 cores + 120 GB each)
240 GB LPDDR5X memory per node, with ECC
1 TB/s peak memory bandwidth per node
900 GB/s NVLink-C2C between the two CPUs
7.1 TFLOPS FP64 peak per node
Each node in Isambard 3 is one Grace CPU Superchip. Across 384 nodes: 55,296 cores and ~92 TB of total memory.