Technical Deep DiveApril 2026

CPU, NUMA, and Data Locality: Hidden Bottlenecks in Multi-GPU Training

When local fine-tuning scales from single-GPU experiments to multi-GPU training, most performance discussions remain centered on VRAM capacity and raw compute throughput. Those factors matter, but once a workload spans multiple devices, CPU topology becomes a major determinant of efficiency.

In multi-GPU AI workstation architectures, CPU selection affects more than core count. It defines PCIe layout, available memory bandwidth, and NUMA (Non-Uniform Memory Access) boundaries. These factors shape whether data reaches each GPU through local, low-latency paths or through more expensive system interconnects. In practice, a system may report high GPU utilization while still losing substantial throughput to remote memory access, host-side staging delays, and synchronization variance.

The CPU as the orchestration layer

GPUs perform the bulk of tensor computation, but the CPU still coordinates key stages of the training pipeline:

data ingestion and preprocessing
batch staging from storage to system memory
host-to-device transfer preparation
scheduling of data-loader workers
orchestration of multi-process and collective communication behavior

In single-GPU workflows, CPU topology issues often remain partially hidden. In multi-GPU setups, they become easier to observe because each additional GPU increases pressure on host memory bandwidth, PCIe routing, storage locality, and process scheduling.

As a result, high GPU utilization alone is not a reliable measure of system efficiency. A system can appear fully loaded while still underdelivering on training throughput if the host side is forcing data to traverse remote or congested paths.

NUMA boundaries: the cost of remote memory access

On high-core-count platforms such as Threadripper Pro, EPYC, and Xeon, system memory is not a single uniform pool. Instead, memory is divided across NUMA domains, where certain CPU cores have lower-latency access to specific memory controllers and DIMMs.

A simplified distinction looks like this:

Local access: CPU core → local memory controller → local RAM
Remote access: CPU core → system interconnect → remote memory controller → remote RAM

The latency difference between these paths is not always large in isolation, but under sustained training load it can become visible. If a GPU worker process is scheduled near one NUMA domain while its data-loader threads or memory allocations reside in another, each batch may cross the interconnect before it even reaches the GPU-facing side of the system.

System-level effect

This often shows up as:

slower population of host-side staging buffers
more variable batch delivery times
inconsistent step timing
weaker scaling as more GPUs are added

These penalties are easy to miss because they rarely look like outright failure. More often, they appear as “good but not great” scaling behavior, where adding GPUs increases utilization but not throughput as much as expected.

PCIe root complexes and I/O locality

PCIe bandwidth is usually discussed in terms of lane counts and generation speed, but slot locality matters as well. In many workstation and server designs, specific PCIe slots are mapped to specific CPU root complexes or I/O domains.

This means the placement of GPUs and NVMe drives can materially affect the path that data takes through the system.

A common mismatch

If the dataset resides on NVMe attached near one CPU root complex, while the target GPU is attached near another, the host-side transfer path becomes less direct. Instead of remaining largely local to one CPU/memory/I/O domain, the transfer may cross the broader system topology before reaching the GPU.

System-level effect

Poor I/O locality can lead to:

lower effective host-to-device bandwidth
higher latency in batch staging
more variable transfer timing
additional pressure on shared interconnect resources

This is one reason a platform can look well provisioned on paper and still underperform in practice. Total PCIe lane count matters, but device placement relative to CPU topology matters too.

Clarification · NVLink

High-speed GPU–GPU interconnects such as NVLink can ease some PCIe topology limits for peer traffic between GPUs—direct transfers can bypass part of the host routing story. They do not replace the usual CPU-to-GPU data path this article emphasizes: batches still travel from storage and system memory through the CPU, drivers, and PCIe into device memory for loading and staging. If the bottleneck is NUMA crossing, loader behavior, or host memory bandwidth, NVLink does not address that part of the pipeline.

Data-loader placement and CPU affinity

A common attempt to improve throughput is to increase num_workers in PyTorch or TensorFlow. That can help, but only when the additional loader workers are scheduled efficiently.

Operating systems generally schedule processes for overall balance, not for GPU-local data delivery. In NUMA systems, this means data-loader workers may be spread across cores and memory domains without regard to where the target GPU is attached.

Why this matters

If loader workers are physically distant from the memory controller or PCIe root that serves the target GPU, the system does more work to prepare each batch:

preprocessing becomes less locality-aware
pinned-memory buffers may still be populated remotely
host-side contention increases
inter-step idle gaps can become more frequent

Practical implication

Increasing worker count can improve throughput, but it can also reduce it if those workers amplify cross-node memory traffic. More parallelism on the host side is not always better when topology is working against locality.

Pinned memory helps improve transfer efficiency, but it does not by itself solve poor CPU affinity or remote memory allocation. Those remain topology problems.

Synchronization and the slowest-participant effect

Distributed training is governed by the slowest participant at each synchronization boundary. Small per-step delays on a single GPU can propagate across the entire training group.

If one rank is slowed by remote memory access, poor CPU affinity, or less efficient host-side staging, the other GPUs may complete their work earlier and wait at communication barriers. This reduces effective parallel efficiency even though all devices remain active over the course of the run.

System-level effect

Topology-induced asymmetry can result in:

increased waiting time at synchronization points
reduced all-reduce efficiency
lower aggregate throughput across the GPU group
diminished scaling returns as devices are added

This is one of the key reasons locality becomes more important in multi-GPU systems than in single-device workflows. Minor host-side inefficiencies that are tolerable on one GPU can become system-wide bottlenecks once synchronization enters the critical path.

The missing link in multi-GPU scaling: VRAM bandwidth vs GPU-to-GPU paths

Spec sheets emphasize device memory bandwidth—how fast each GPU can read and write its own VRAM (HBM or GDDR), often quoted in TB/s. That number describes on-device tensor traffic: moving activations and weights that already live in GPU memory. It is not the same quantity as GPU-to-GPU throughput—the rate at which tensors move between GPUs during collectives, gradient sync, or peer copies over NVLink, PCIe peer-to-peer, or paths that involve the host.

A system can have excellent per-GPU VRAM bandwidth and still scale poorly across devices when the interconnect is the limiting factor. Conversely, a fast link between GPUs does not fix host-side staging, NUMA crossings, or slow storage—the bottlenecks earlier sections of this article describe.

Practical distinction

When reasoning about multi-GPU performance, keep three layers separate:

Device VRAM bandwidth: on-package memory to SMs (per-GPU tensor work on resident data)
GPU-to-GPU bandwidth: links and topology between devices (collectives, P2P, NVLink where present)
Host memory bandwidth: DDR channels and NUMA—how the CPU feeds the pipeline (next section)

Training runs can shift which layer dominates. If you only optimize for the first, you may miss the second—often the missing link when scaling from one GPU to several.

Memory bandwidth and channel architecture

Core count is often overemphasized in workstation planning. For many multi-GPU training workloads, host memory bandwidth and channel architecture are more consequential than raw CPU compute capacity.

Loader workers, pinned-memory staging, decompression, and I/O coordination all generate sustained traffic through the CPU memory subsystem. Platforms with more memory channels are generally better able to feed dense GPU configurations without starving the host side of bandwidth.

System-level effect

When memory bandwidth becomes constrained:

loader throughput can flatten
staging buffers refill less consistently
GPUs spend more time waiting for prepared batches
scaling degrades as more devices are added

This is why CPU selection in AI systems should be evaluated partly as a memory and I/O design decision, not only as a compute choice. A processor with many cores but limited memory bandwidth can support light workloads adequately while still underfeeding a dense multi-GPU configuration.

Clarification · GPUDirect Storage

GPUDirect Storage (GDS) and related approaches can, on supported stacks, route NVMe-to-GPU data with far less CPU involvement than traditional copy paths—reducing CPU-mediated staging. Many workstation setups still depend on CPU-side preprocessing, loader workers, filesystems, or training flows that are not fully GDS-optimized. Unless you have validated an end-to-end GDS pipeline on your exact hardware and software, the CPU and NUMA locality guidance above remains the most practical baseline for host-side bottlenecks.

2026 Technical Sidebar · Modern context

2026 Update: Locality at the Silicon Level

While the principles of system-level topology remain constant, two recent advancements have moved these bottlenecks closer to the chip:

NCCL 2.28+ and topology-aware orchestration

Recent NCCL releases increasingly feature topology-aware orchestration that better utilizes hardware copy engines (DMA paths) to automate path selection where the stack supports it. However, the hardware layout still dictates the physical speed limit—software cannot optimize away a missing PCIe lane or a bad route.

On-die locality and MIG-style partitioning

On high-end architectures (such as Blackwell-class GPUs), the push toward on-die locality—including patterns familiar from Multi-Instance GPU (MIG) partitioning—illustrates the same mental model as system NUMA: when work and data are kept close to the silicon that consumes them, latency drops. That does not mean MIG is the right knob for every training job (it can add overhead when misapplied); it means the industry is rediscovering locality inside the package, parallel to the host-side story this article covers.

Why this matters

Locality inside the GPU validates the same principle as NUMA on the host: placement still matters, even when marketing focuses on raw TFLOPS or VRAM size.
Runtimes like NCCL orchestrate transfers within the topology you have—they do not remove missing PCIe lanes, poor routing, or host-side ceilings.
Useful context for the 2025/2026 workstation cycle: newer silicon and software can mask complexity, but the physical machine still sets the envelope.

While these software and silicon-level features help mask complexity, they ultimately hit a ceiling defined by the physical workstation topology—the core focus of this guide.

Practical inspection and topology-aware execution

NUMA and locality problems are difficult to reason about if the topology is treated as opaque. In practice, the first step is to make the layout visible.

Useful tools include:

Shell — topology inspection

nvidia-smi topo -m
numactl --hardware
lscpu

These tools help answer questions such as:

which GPUs share closer paths to one another
how NUMA nodes are exposed to the OS
how CPUs, memory domains, and devices are arranged

For execution, NUMA-aware binding can reduce unnecessary cross-node traffic. For example:

Command

numactl --cpunodebind=0 --membind=0 python train.py

This does not guarantee optimal placement for every workload, but it provides a controlled starting point by keeping execution and allocation local to one NUMA node.

What to watch for

Common warning signs of topology-related inefficiency include:

uneven step times across GPUs
poor scaling when moving from one GPU to several
high reported utilization with disappointing throughput
one CPU domain much busier than another
transfer behavior that becomes more erratic under sustained load

Common pitfalls in NUMA-constrained training setups

Several patterns recur in underperforming multi-GPU systems:

treating system RAM as if it were uniformly local to all workers
assuming PCIe slot population alone guarantees efficient device locality
increasing loader worker counts without checking affinity or memory placement
selecting CPUs primarily by core count while ignoring memory channels
equating high per-GPU VRAM bandwidth (TB/s) with strong multi-GPU scaling without checking GPU-to-GPU paths
using GPU utilization as the main measure of training efficiency

These mistakes do not always produce obvious errors. More often, they show up as lower-than-expected throughput, unstable scaling, or synchronization overhead that is difficult to explain from GPU metrics alone.

Summary: the topology takeaway

In multi-GPU AI training, CPU topology defines the quality of the data path.

NUMA locality, PCIe root-complex placement, memory bandwidth, and process affinity all influence whether GPUs are fed through local, consistent paths or through more expensive system routes. As systems scale, locality becomes a throughput constraint in its own right. For larger-model training, it is useful to treat the workstation not as a collection of independent components, but as a coordinated memory and I/O system whose efficiency depends on proximity, placement, and balance.

Closing note

Host-side topology is not a tuning detail—it is part of the multi-GPU training envelope. When scaling locally, measure the path data takes, not only the GPU meter.