Optimizing Data Transfer in AI/ML Workloads

a , a deep learning model is executed on a dedicated GPU accelerator using input data batches it receives from a CPU host. Ideally, the GPU — the more expensive resource — should be maximally utilized, with minimal periods of idle time. In particular, this means that every time it completes its execution on a batch, the subsequent batch will be “ripe and ready” for processing. When this does not happen, the GPU idles while waiting for input data — a common performance bottleneck often referred to as GPU starvation.

In previous posts, (e.g., see A Caching Strategy for Identifying Bottlenecks on the Data Input Pipeline), we discussed common causes of this issue, including: inefficient storage retrieval, CPU resource exhaustion, and host-to-device transfer bottlenecks. In this post, we zoom in on data transfer bottlenecks and revisit their identification and resolution — this time with the help of NVIDIA Nsight™ Systems (nsys), a performance profiler designed for analyzing the system-wide activity of workloads running on NVIDIA GPUs.

NVIDIA Nsight vs. PyTorch Profiler

Readers familiar with our work may be surprised at the mention of NVIDIA Nsight profiler rather than PyTorch Profiler. In our previous posts we have advocated strongly for the use of PyTorch Profiler in AI/ML model development as a tool for identifying and optimizing runtime performance. Time and again, we have demonstrated its application to a wide variety of performance issues. Its use does not require any special installations and can be run without special OS permissions. NVIDIA Nsight profiler, on the other hand, requires a dedicated system setup (or a dedicated NVIDIA container) and — for some of its features — elevated permissions, making its use less accessible and more complicated than PyTorch Profiler.

The two profilers differ in their focus: PyTorch profiler is a framework profiler tightly coupled with PyTorch and heavily focused on how models use the PyTorch software stack and supporting libraries. NVIDIA Nsight profiler is a system-level profiler; it does not know the details of the model being run or which framework is being used, but rather how the components of the entire system are being used and utilized. While PyTorch Profiler excels at tracing the low-level operations of a PyTorch model execution, nsys provides a detailed view of the activities of the entire system (GPU hardware, CUDA streams, OS interrupts, Network, PCIe, etc.). For many performance issues PyTorch profiler is sufficient for identifying and solving the source of the bottleneck; But some situations call for nsys profiler, the “big guns”, for deriving deeper insights into the inner workings of the underlying system.

In this post we intend to demonstrate some of the unique capabilities of nsys profiler and their application to the common data-transfer bottleneck.

Outline

To facilitate our discussion we will define a toy ML workload with a data-transfer performance bottleneck and proceed to introduce a number of successive optimizations in an attempt to solve it. Throughout the process, we will use the nsys profiler in order to analyze the system performance and assess the impact of the code modifications.

Setup

We will run our experiments on an Amazon EC2 g6e.2xlarge instance with an NVIDIA L40S GPU running an AWS Deep Learning (Ubuntu 24.04) AMI with PyTorch (2.8). To install the nsys-cli profiler (version 2025.6.1) we follow the official NVIDIA guidelines:

wget https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2025_6/NsightSystems-linux-cli-public-2025.6.1.190-3689520.deb
sudo apt install ./NsightSystems-linux-cli-public-2025.6.1.190-3689520.deb

The NVIDIA Tools Extension (NVTX) library allows us to annotate our code with human-readable labels to increase the readability and comprehension of the performance trace. While PyTorch offers built-in NVTX support via its torch.cuda.nvtx APIs, we will use the standalone nvtx package (version 0.2.14) which supports color-coding the trace timeline for better visual analysis:

pip install nvtx

Disclaimers

The code we will share is intended for demonstrative purposes; please do not rely on its correctness or optimality. Please do not interpret our use of any library, tool, or platform, as an endorsement of its use. The impact of the optimizations we will cover can vary greatly based on the details of the model and the runtime environment. Please be sure to assess their effect on your own use case before integrating their use.

Many thanks to Yitzhak Levi and Gilad Wasserman for their contributions to this post.

A Toy PyTorch Model

We introduce a training script intentionally designed to consist of a bottleneck on the data-input pipeline.

In the code block below we define a simple image classification model with a ResNet-18 backbone.

import time, torch, torchvision

DEVICE = "cuda"
model = torchvision.models.resnet18().to(DEVICE).train()
optimizer = torch.optim.Adam(model.parameters())

Next, we define a synthetic dataset which we will use to train our toy model.

from torch.utils.data import Dataset, DataLoader

WARMUP_STEPS = 10
PROFILE_STEPS = 3
COOLDOWN_STEPS = 1
TOTAL_STEPS = WARMUP_STEPS + PROFILE_STEPS + COOLDOWN_STEPS
BATCH_SIZE = 64
TOTAL_SAMPLES = TOTAL_STEPS * BATCH_SIZE
IMG_SIZE = 512

# A synthetic Dataset with random images and labels
class FakeDataset(Dataset):

    def __len__(self):
        return TOTAL_SAMPLES

    def __getitem__(self, index):
        img = torch.randn((3, IMG_SIZE, IMG_SIZE))
        label = torch.tensor(index % 10)
        return img, label

train_loader = DataLoader(
    FakeDataset(),
    batch_size=BATCH_SIZE
)

Lastly, we define a standard training step programmed to run nsys-profiler for three steps using the torch.cuda.profiler.start and stop commands — intended for use in conjunction with the nsys cli. We highlight the components of the training step using the nvtx.annotate utility. Please refer to the official documentation for more details on profiling with nsys in PyTorch.

import nvtx
from torch.cuda import profiler

def copy_data(batch):
    data, targets = batch
    data_gpu = data.to(DEVICE)
    targets_gpu = targets.to(DEVICE)
    return data_gpu, targets_gpu


def compute_step(model, batch, optimizer):
    data, targets = batch
    output = model(data)
    loss = torch.nn.functional.cross_entropy(output, targets)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    return loss


data_iter = iter(train_loader)

for i in range(TOTAL_STEPS):

    if i == WARMUP_STEPS:
        # start nsys profiler
        torch.cuda.synchronize()
        start_time = time.perf_counter()
        profiler.start()
    elif i == WARMUP_STEPS + PROFILE_STEPS:
        # stop nsys profiler
        torch.cuda.synchronize()
        profiler.stop()
        end_time = time.perf_counter()

    with nvtx.annotate(f"Batch {i}", color="blue"):
        with nvtx.annotate("get batch", color="red"):
            batch = next(data_iter)
        with nvtx.annotate("copy batch", color="yellow"):
            batch = copy_data(batch)
        with nvtx.annotate("Compute", color="green"):
            compute_step(model, batch, optimizer)

total_time = end_time - start_time
throughput = PROFILE_STEPS / total_time
print(f"Throughput: {throughput:.2f} steps/sec")

We run our script using the cudaProfilerApi option to start and stop the profiler programmatically. Please see the official documentation for full details on profiling from the nsys cli.

nsys profile \
  --capture-range=cudaProfilerApi \
  --trace=cuda,nvtx,osrt \
  --output=baseline \
  python train.py

This results in a baseline.nsys-rep trace file that we copy over to our development machine for analysis.

In order to draw a comparison to PyTorch profiler, we define an alternative training loop programmed with PyTorch Profiler and annotated with the torch.profiler.record_function utility:

from torch.profiler import (
    profile, record_function, schedule, tensorboard_trace_handler
)

with profile(
    schedule=schedule(wait=0, warmup=WARMUP_STEPS, 
                      active=PROFILE_STEPS, repeat=1),
    on_trace_ready=tensorboard_trace_handler('./baseline'),
    record_shapes=True,
    with_stack=True
) as prof:
    for i in range(TOTAL_STEPS):
        with record_function("get batch"):
            batch = next(data_iter)
        with record_function("copy batch"):
            batch = copy_data(batch)
        with record_function("compute"):
            compute_step(model, batch, optimizer)
        prof.step()

The throughput of our baseline experiment is 2.97 steps-per-second. In the next sections we will use the profile traces to identify performance bottlenecks in our training step and try to improve on this result.

Baseline Performance Analysis

To analyze the resultant nsys trace file, we open it in the Nsight Systems GUI application. In the image below we zoom in on the timeline of two of the training steps captured by the profiler:

Baseline Nsight Systems Profiler Trace (by Author)

The trace contains a wealth of information, just a subset of which we will touch on in this post. Please see the nsys documentation for additional functionalities and features.

The timeline is divided into two parts: the CUDA section which reports GPU activity and the threads section which reports the CPU activity. The CUDA section makes a clear distinction between the GPU kernel (compute) activity (90.9%) and memory activity (9.1%). The top bars in each section report the utilization of each of the resources and both sections include an NVTX section with the colored annotations we included in our training step. We note the following observations:

The GPU is idle for roughly 50% of each training step. This can be seen by the portion of time taken by each batch (in blue) in the GPU NVTX bar and the large blocks of whitespace in between them.
The GPU activity for each batch starts immediately after the “get batch” activity has completed on the CPU. It starts with the host-to-device memory copy, marked in light green and continues with the kernel computations, marked in light blue.
Once the CPU has launched the GPU memory and compute commands for batch N, it proceeds to the next batch in the training loop — leading to a partial overlap of batch N+1 on the CPU with batch N on the GPU.
The vast majority of the CPU thread is spent on the “get batch” activity. This constitutes the primary bottleneck in our baseline experiment.

The profiling trace points to a clear culprit — the dataloader. By default, PyTorch performs single process data loading — a single CPU process is used to load the next data input batch, copy it to the GPU, and launch the compute kernels — all in a sequential manner. This typically results in severe under-utilization of the CPU resources by: 1) limiting dataloading to just a single process, and 2) making the loading of the next batch contingent on the completion of the CPU processing (i.e., kernel loading) of the previous batch. Our irresponsible use of our CPU resources has resulted in our GPU being starved for input data.

The same conclusion could have been reached using PyTorch Profiler trace shown below:

Baseline PyTorch Profiler Trace (by Author)

Here too, we can see long periods of GPU underutilization that are caused by the long “get batch” blocks on the CPU side.

Optimization 1: Multi-Process Data Loading

The first step is to modify the data input pipeline to use multi-process data loading. We set the number of workers to match the 8 vCPUs available on our Amazon EC2 g6e.2xlarge instance. In a real-world scenario, this value should be tuned for optimal throughput:

NUM_WORKERS = 8

train_loader = DataLoader(
    FakeDataset(),
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS
)

Following this change our throughput jumps to 4.81 steps per second — a 62% improvement over our baseline result. The corresponding nsys profiler trace is shown below:

Multiproc Dataloading Nsight Systems Profiler Timeline (by Author)

Note that the red “get batch” segment has become just a tiny sliver of each step in the NVTX bar. In its place, the yellow “copy batch” block now takes center stage. As a result of our use of multi-process dataloading, there is now always a new batch ready for processing — but can we do better?

Taking a closer look at the GPU section we see that there is still a significant portion (~290 milliseconds) of idle time in between the memory operation and the kernel compute. This idle time is perfectly aligned with an “munmap” operation in the OS runtime bar. The “munmap” block is a CPU-side memory cleanup operation performed just after the CUDA memory copy is complete. It occurs at the tail-end of the long yellow “copy batch” operation. The compute kernels are launched onto the GPU only after the memory cleanup has completed. This is a clear pattern of synchronous host-to-device memory copy: The CPU cannot proceed with kernel loading until the data copy operation has been fully completed and the GPU stays idle until the CPU loads the kernels.

The PyTorch profiler trace shows the same GPU idle time but it does not provide the same “munmap” hint. This is our first example of the advantage of the system-wide visibility of the nsys profiler.

Multiproc Dataloading PyTorch Profiler Trace (by Author)

With our finding of the data-copy performance bottleneck in hand, we proceed to our next optimization.

Optimization 2: Asynchronous Data Transfer

The solution to the bottleneck we have found is to program our training step to load data asynchronously. This enables the CPU to launch the compute kernels immediately after sending the memory copy command — without waiting for the memory copy to be completed. This way the GPU can begin processing the kernels as soon as the CUDA memory copy is done. Enabling asynchronous data copy requires two changes: First we must program the dataloader to use pinned memory (instead of pageable memory), and second, we must pass non_blocking=True argument to the to() operations:

NUM_WORKERS = 8
ASYNC_DATATRANSFER = True


train_loader = DataLoader(
    FakeDataset(),
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS,
    pin_memory=ASYNC_DATATRANSFER
)

def copy_data(batch):
    data, targets = batch
    data_gpu = data.to(DEVICE, non_blocking=ASYNC_DATATRANSFER)
    targets_gpu = targets.to(DEVICE, non_blocking=ASYNC_DATATRANSFER)
    return data_gpu, targets_gpu

Using asynchronous dataloading results in a throughput of 5.91 steps per second — an additional 23% improvement and 99% improvement overall. The resultant profiling trace is shown below:

Async Dataloading Nsight Systems Profiler Timeline (by Author)

We now see all of the CPU operations bunched together at the beginning of the trace. We have removed all performance obstacles on the CPU side allowing it to freely load the data and kernels to the GPU. In the GPU section, we see continuous activity without any idle time. We do, however, see a clear separation between CUDA memory activities (in light green) and CUDA kernel activities (in light blue). PyTorch profiler, in contrast, does not make this distinction clear. This is another advantage of the hardware-centric profiler and, in the case of our toy experiment, is what informs the next steps of our optimization.

Async Dataloading PyTorch Profiler Trace (by Author)

Optimization 3: Pipelining With CUDA Streams

Our final optimizations derive from the fact that modern GPUs, such as the NVIDIA L40S, use independent engines for copying memory (the DMA) and executing compute kernels (the SMs). We can take advantage of this by parallelizing the distinct memory and kernel activities we saw in the nsys profiler trace. We will program this through the use of CUDA streams.

In a previous post, we expanded on the opportunity for optimizing AI/ML workloads using CUDA Streams. Here, we apply a similar pipelining strategy: We define two distinct “copy” and “compute” CUDA streams and program the “copy” stream to copy batch N+1 at the same time that the “compute” stream is processing batch N:

# define two CUDA streams
compute_stream = torch.cuda.Stream()
copy_stream = torch.cuda.Stream()


# extract first batch
next_batch = next(data_iter)
with torch.cuda.stream(copy_stream):
    next_batch = copy_data(next_batch)

for i in range(TOTAL_STEPS):

    if i == WARMUP_STEPS:
        torch.cuda.synchronize()
        start_time = time.perf_counter()
        profiler.start()
    elif i == WARMUP_STEPS + PROFILE_STEPS:
        torch.cuda.synchronize()
        profiler.stop()
        end_time = time.perf_counter()

    with nvtx.annotate(f"Batch {i}", color="blue"):
        # wait for copy stream to complete copy of batch N
        compute_stream.wait_stream(copy_stream)
        batch = next_batch

        # execute model on batch N+1 compute stream
        try:
            with nvtx.annotate("get batch", color="red"):
                next_batch = next(data_iter)
            with torch.cuda.stream(copy_stream):
                with nvtx.annotate("copy batch", color="yellow"):
                    next_batch = copy_data(next_batch)
        except:
            # reached end of dataset
            next_batch = None

        # execute model on batch N compute stream
        with torch.cuda.stream(compute_stream):
            with nvtx.annotate("Compute", color="green"):
                compute_step(model, batch, optimizer)

total_time = end_time - start_time
throughput = PROFILE_STEPS / total_time
print(f"Throughput: {throughput:.2f} steps/sec")

This optimization results in a throughput of 6.44 steps per second — a 9% improvement over our previous experiment. We note that the impact of this optimization is capped by the duration of the longer of the two operation types. In our previous profile trace, the memory block took 15.5 milliseconds and the kernel block took 155 milliseconds. In the current profile trace, the entire GPU steps takes 155 milliseconds, which means that the memory copy time is completed hidden by the kernel compute time and that our optimization reaches the maximum possible result.

The use of the CUDA streams and its impact on GPU utilization can be seen in the traces of both profilers:

Pipelined Nsight Systems Profiler Timeline (by Author)

Pipelined PyTorch Profiler Trace (by Author)

Optimization 4: Prefetching to CUDA

For our final step, we move the data copying from the main training loop process to the data loading process: Rather than explicitly calling the copy function inside the training loop, we assume that the batches returned from the data iterator are already placed on the GPU.

In the code block below, we wrap our dataloader with a CUDA-prefetching iterator class. Note, that this is a simplified implementation intended for the purposes of demonstration. More work may be required for more complex scenarios (e.g., DDP training). Alternatively, you may consider a third-party implementation such as torchtnt.utils.data.data_prefetcher.CudaDataPrefetcher:

class DataPrefetcher:
    def __init__(self, loader):
        self.loader = iter(loader)
        self.stream = torch.cuda.Stream()
        self.next_batch = None
        self.preload()

    def preload(self):
        try:
            data, targets = next(self.loader)

            with torch.cuda.stream(self.stream):
                with nvtx.annotate("copy batch", color="yellow"):
                    next_data = data.to(DEVICE, non_blocking=True)
                    next_targets = targets.to(DEVICE, non_blocking=True)
            self.next_batch = (next_data, next_targets)        
        except:
            self.next_batch = (None, None)

    def __iter__(self):
        return self

    def __next__(self):
        torch.cuda.current_stream().wait_stream(self.stream)
        data, targets = self.next_batch
        self.preload()
        return data, targets


data_iter = DataPrefetcher(train_loader)

for i in range(TOTAL_STEPS):
    if i == WARMUP_STEPS:
        torch.cuda.synchronize()
        start_time = time.perf_counter()
        profiler.start()
    elif i == WARMUP_STEPS + PROFILE_STEPS:
        torch.cuda.synchronize()
        profiler.stop()
        end_time = time.perf_counter()

    with nvtx.annotate(f"Batch {i}", color="blue"):
        with nvtx.annotate("get batch", color="red"):
            batch = next(data_iter)
        with nvtx.annotate("Compute", color="green"):
            loss = compute_step(model, batch, optimizer)

total_time = end_time - start_time
throughput = PROFILE_STEPS / total_time
print(f"Throughput: {throughput:.2f} steps/sec")

This optimization results in a throughput of 6.44 steps per second — the same as our previous experiment. This should not surprise us since we have already seen that the throughput is bound by the 155 millisecond GPU compute and our optimization has not done anything to reduce the kernel compute time.

More generally, despite the removal of the copy call from the main loop, you may have a hard time finding a situation where this will have a meaningful impact on performance since the call is already being called asynchronously. However, given the minimal changes to the training loop, you may find this solution to be cleaner and/or to be more applicable for use with high-level libraries that do not enable fine-grained control of the training loop.

Unsurprisingly, the profile traces for this experiment appear nearly identical to the previous ones. The main difference is the placement of the yellow “copy data” block in the NVTX row of the CPU section.

Data Prefetching Nsight Systems Profiler Timeline (by Author)

Data Prefetching PyTorch Profiler Trace (by Author)

Results

The table below summarizes the results of our experiments:

The optimizations, which were driven by the use of Nsight Systems profiler, resulted in an overall increase of 2.17X to the runtime performance.

Summary

GPU starvation is a common performance bottleneck that can have a devastating impact on the efficiency and costs of AI/ML workloads. In this post, we demonstrated how to use the Nsight Systems profiler to study the causes of the performance bottleneck and take informed steps towards their resolution. Along the way, we emphasized the unique capabilities of Nsight Systems profiler when compared to the built-in framework-centric PyTorch Profiler — specifically its deep system-level visibility.

Our focus, in this post has been on the host-to-device data copy that typically occurs at the beginning of the training step. However, data-transfer bottlenecks can appear at different stages of training. In a sequel to this post we intend to repeat our nsys profiling analysis on data copies going in the opposite direction — from the device to the host. Stay tuned!

Source link

#Optimizing #Data #Transfer #AIML #Workloads