...

Understanding Application Performance with Roofline Modeling


with calculating an application’s performance is that the real-world performance and theoretical performance can differ. With an ecosystem of products that is growing with high performance needs such as High Performance Computing (HPC), gaming, or in the current landscape – Large Language Models (LLMs), it is essential to calculate accurately the performance of an application.

Simply measuring theoretical GFLOPs (Floating-Point Operations Per Second) is not enough, as applications rarely reach these maximums in the real world. This is where the Roofline Model comes in, offering a clear visual method to estimate an application’s performance and highlighting the critical role of hardware-specific optimizations.

Why simple metrics aren’t enough

When we think about measuring performance, there are a few metrics that come to mind:

  • Execution time: This tells you how long a task took but offers no insight into why.
  • Cycles per Instructions (CPI): This only measures the processor’s compute performance.
  • Serial vs Parallel execution: Measures compute performance overlooking any hardware optimizations.
  • Floating Point Operations Per Second (FLOP/s): This only represents a theoretical maximum which is often not achievable in a real-world scenario.

While these are good metrics, they generally do not provide enough information. For instance, using the Floating Point Operations Per Seconds is a theoretical limit which is not often achieved. So using that as the only metric is not enough since it ignores a common performance limiter – data movement.

Roofline Modeling

The Roofline Model is a powerful tool that visually maps an application’s performance against the capabilities of a specific hardware architecture, such as a CPU or GPU. The model gets its name from the shape of the graph it produces, which features a “roof” composed of a slanted line and a flat, horizontal line. This shape represents the ultimate performance limits imposed by the hardware.

From this modeling technique, there are two parameters which define the achievable limits with hardware:

  • Data movement: The time it takes to move data, calculated as the total data size divided by the system’s peak memory bandwidth.
  • Computation: The time required for calculations, determined by dividing the total number of floating-point operations by the system’s peak compute performance (commonly measured in GFLOP/s).

The total execution time of an application is determined by the greater of these two values: max {data_movement, computation}.

Despite the hardware having better compute performance, data movement can often become the bottleneck. Roofline Modeling introduces the concept of Arithmetic Intensity (AI). AI is the ratio of floating-point operations performed for every byte of data moved from memory.

  • An algorithm with high Arithmetic Intensity is considered compute-hungry. Its performance is limited by how quickly calculations can be performed.
  • An algorithm with low Arithmetic Intensity is considered data-hungry. Its performance is limited by how quickly data can be moved.

Understanding the graph

Roofline Model Graph
https://commons.wikimedia.org/wiki/File:Example_of_a_naive_Roofline_model.svg
Creative Commons Attribution-Share Alike 4.0 International

A Roofline graph plots the Attainable FLOP/s (y-axis) against the Arithmetic Intensity (x-axis). The “roof” itself shows the hardware’s limitations. The slanted part of the roof represents the peak data bandwidth (in GB/s), while the flat part represents the peak computational performance (in GFLOPS). Note that everything in the image is in a logarithmic scale.

  • Points below the roof: Indicate suboptimal performance indicating scope of improvement.
  • Points hitting the slanted line: Data hungry application. Its performance is limited by data bandwidth.
  • Points hitting the flat line: Compute hungry application. It is using the full computational power of the processor.

Why is Roofline Modeling important?

Roofline Modeling provides a visual, intuitive way to understand application performance, showing key characteristics like Operational Intensity, GPU capabilities, and attainable FLOP/s. This kind of modeling helps the programmer make targeted optimizations to their application for hardware with which better results can be obtained.

  • Bottleneck analysis: Having a visual aid makes it easy for the developer to figure out where the bottleneck is – memory or performance. If the application is memory intensive, a developer can focus on improving data locality with techniques like caching or loop tiling. If it’s compute intensive, the focus can shift to enabling more parallel computations or leveraging compiler optimizations.
  • Hardware and software design: Software engineers should not fear the underlying hardware. Instead, the hardware design should be embraced and optimized. Software engineers can use insights from Roofline Modeling to embrace and optimize for the specific architecture they are using.

Roofline Modeling in Action

To perform Roofline Modeling, we need to profile the application to understand the performance. From profiling, we can get metrics such as Floating Point Operations (FLOPs) and memory bandwidth usage, both of which are required for Roofline Modeling. This article explores two of these tools – Nvidia’s ncu which is the Nsight Compute CLI for GPU analysis and PyTorch’s profiler, specifically for applications using PyTorch.

For detailed CUDA kernel optimization and precise FLOP/byte calculations, ncu provides direct GPU hardware counter information. In contrast, torch.profiler.profile offers a higher-level perspective within PyTorch, helping in the understanding of operator-level performance, tensor memory usage, and the overall application behavior encompassing both CPU and GPU activities.

Profiling with ncu

ncu is the command line interface which is used for profiling CUDA kernels [2]. It can display results directly in the terminal or save them to a log file for later analysis. To build a Roofline model, we need to capture the specific metrics that will allow us to calculate Arithmetic Intensity.

We’ll use the PyTorch ImageNet repository [3] as our example. It’s a good choice because it’s easy to understand, well-documented by PyTorch, and works with their profiler, so we can really dig into the performance.

Step 1: Run the ncu command to collect metrics

The first step is to run the application through ncu to collect the necessary hardware-level data. The command looks like this:

ncu --log-file  \
    --metrics  \
    --target-processes all \
    python3 
  • log-file: The log file in which we want to store the results.
  • metrics: This is the most important parameter and depicts the metrics that we want to capture. For calculating Arithmetic Intensity, we consider:
    • dram__sectors_write.sum : sum of DRAM sectors written
    • dram__sectors_read.sum : sum of DRAM sectors read
    • smsp__sass_thread_inst_executed_op_fadd_pred_on.sum : sum of floating-point additions
    • smsp__sass_thread_inst_executed_op_fmul_pred_on.sum : sum of floating-point multiplications
    • smsp__sass_thread_inst_executed_op_ffma_pred_on.sum : sum of floating-point fused multiply add operations
  • target-process: all flag ensures that we profile the entire application.

Our ncu command changes to:

ncu --log-file logs_example --metrics dram__sectors_write.sum, \
dram__sectors_read.sum, \
smsp__sass_thread_inst_executed_op_fadd_pred_on.sum, \ 
smsp__sass_thread_inst_executed_op_fmul_pred_on.sum, \
smsp__sass_thread_inst_executed_op_ffma_pred_on.sum \
--target-processes all python3 \
main.py /imagenet --arch resnet50 --epochs 1 --batch-size 10 \
--print-freq 10 --seed 42

Step 2: Calculating FLOPs from the metrics

Once the profiler has run, we can aggregate the collected metrics to calculate the total floating-point operations. The formula is:

\[FLOPs = 2 * FMA\_count + FADD\_count + FMUL\_count\]

  • FLOPs: Count of Floating Point Operations.
  • FMA_count: Fused Multiply-Add (FMA) operations typically count as 2 FLOPs (one multiplication and one addition). This is represented by the smsp__sass_thread_inst_executed_op_ffma_pred_on.sum metric.
  • FADD_count: This is represented by the smsp__sass_thread_inst_executed_op_fadd_pred_on.sum metric.
  • FMUL_count: This is represented by the smsp__sass_thread_inst_executed_op_fmul_pred_on.sum metric.

Step 3: Calculate the bytes transferred

Next, we calculate the total data transferred to and from DRAM. The ncu metrics provide the number of DRAM sectors read and written. Assuming a common sector size of 32 bytes for modern GPUs:

\[Total\_DRAM\_bytes = (dram\_\_sectors\_read.sum + dram\_\_sectors\_write.sum) * 32\]

Step 4: Calculate the Arithmetic Intensity

With FLOPs and total bytes, we can now calculate the Arithmetic Intensity:

\[AI = FLOPs / Total\_DRAM\_Bytes\]

Step 5: Calculate execution time

To find the application’s performance in FLOP/s, we also need the execution time. For this, we can use NVIDIA Nsight Systems (nsys), a system-wide profiler that can accurately measure the runtime of application segments. We run our application again, this time with nsys, to generate a time-based report. From this report, we can extract the total GPU running time.

nsys profile -f true -o  python3 \

Our nsys command changes to:

nsys profile -f true -o time.qdrep python3 main.py /imagenet \
--arch resnet50 --epochs 1 --batch-size 10 --print-freq 10 \
--seed 42

After running this command, we can get the GPU_RUNNING_TIME.

Step 6: Calculate the application performance

Finally, we calculate the achieved performance in FLOP/s by dividing the total FLOPs by the execution time:

\[FLOP/s = FLOPs / GPU\_RUNNING\_TIME\]

This value gives us the “attainable FLOP/s” that we can plot on our Roofline graph.

Profiling with torch

For applications written in PyTorch, the built-in torch.profiler.profile offers a user-friendly way to gather performance data. There are 2 options that are provided to the developers:

  • Use the Profiler Context Manager
  • Targeting Profiling for specific neural network layers

Profiler Context Manager

The part of the code that we want to profile can be wrapped within the with torch.profiler.profile() context manager. In the with statement, you can define the activities to trace (CPU, CUDA, or both), set a schedule to profile specific training steps, and choose whether to record tensor shapes, memory usage, or FLOPs. Once inside the context, you must call prof.step() at the end of each iteration to signal the profiler to advance, especially when a schedule is used.

with profile(
    activities=,
    schedule=torch.profiler.schedule(),
    record_shapes=,
    profile_memory=,
    with_flops=
) as prof:

    ....
    prof.step()
  • activities: Specify whether to profile the CPU, CUDA or both.
  • schedule: Useful for profiling multiple steps in the training loop. If the schedule parameter is used, the profiler needs to call prof.step() to move to the next step.
  • record_shapes: Whether to record the shapes of the tensors.
  • profile_memory: To capture memory usage
  • with_flops: This is experimental but is used to FLOPs with operators.

Our profiler command changes to:

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
    record_shapes=True,
    profile_memory=True,
    with_flops=True
) as prof:

Targeting Profiling for specific neural network layers

The profiler can also be used in a more targeted manner to analyze specific layers of a neural network. This is useful to check whether some specific layer is contributing more to the performance than the other layers giving the developer the option of modifying specific layers. While using this is very easy to use, in most cases, the first option works better. The PyTorch profiler results can also be exported and visualized on a TensorBoard.

profiler.start()
self.conv2(x)
profiler.stop()

LLMs and Roofline Modeling

Coming to the topic everyone has been waiting for – does Roofline Modeling help with LLM performance calculation? The short answer is yes.

LLMs are complex neural network architectures with billions of parameters and the massive datasets that they process. While training is a very resource-intensive task, inference and fine tuning the model also need to be efficient.

  • Bottlenecks: LLMs during inference can suffer from bottlenecks due to the sheer amount of parameters that it is working with. These parameters are the weights of the models and they cause memory bandwidth issues. Using Roofline Modeling, the exact layers can be profiled for the bottlenecks.
  • Hardware selection: As most organizations fine-tune existing models rather than training them from scratch, choosing the right infrastructure is crucial for managing costs. This underscores the importance of choosing optimal infrastructure for training. For example, choosing the hardware according to your LLM architecture or optimizing your model to run on a specific architecture can cut training and inference costs.

Conclusion

The Roofline Model offers a powerful visual analysis of application performance optimization. By visualizing the application performance across memory and compute, a clear guidance is provided in choosing the best way to approach optimizations. While this article only considered Naive Roofline Models, there are more advanced techniques such as Hierarchical Roofline Models or adding ceilings for specific compute optimizations.

References

[1] https://docs.nersc.gov/tools/performance/roofline/

[2] https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html

[3] https://github.com/pytorch/examples/tree/main/imagenet

[4] https://developer.nvidia.com/nsight-systems

Source link

#Understanding #Application #Performance #Roofline #Modeling