Overcoming the Hidden Performance Traps of Variable-Shaped Tensors: Efficient Data Sampling in PyTorch

is the part of a series of posts on the topic of analyzing and optimizing PyTorch models. Throughout the series, we have ...
Read more How to Improve the Efficiency of Your PyTorch Training Loop

models isn’t just about submitting data to the backpropagation algorithm. Often, the key factor determining the success or failure of a ...
Read more Learning Triton One Kernel At a Time: Vector Addition

, a little optimisation goes a long way. Models like GPT4 cost more than $100 millions to train, which makes ...
Read more The Crucial Role of NUMA Awareness in High-Performance Deep Learning

world of deep learning training, the role of the ML developer can be likened to that of the conductor of ...
Read more How to Fine-Tune Small Language Models to Think with Reinforcement Learning

in fashion. DeepSeek-R1, Gemini-2.5-Pro, OpenAI’s O-series models, Anthropic’s Claude, Magistral, and Qwen3 — there is a new one every month. ...
Read more Pipelining AI/ML Training Workloads with CUDA Streams

ninth in our series on performance profiling and optimization in PyTorch aimed at emphasizing the critical role of performance analysis and optimization ...
Read more A Caching Strategy for Identifying Bottlenecks on the Data Input Pipeline

in the data input pipeline of a machine learning model running on a GPU can be particularly frustrating. In most ...
Read more What PyTorch Really Means by a Leaf Tensor and Its Grad

isn’t yet another explanation of the chain rule. It’s a tour through the bizarre side of autograd — where gradients ...
Read more Use PyTorch to Easily Access Your GPU

are lucky enough to have access to a system with an Nvidia Graphical Processing Unit (Gpu). Did you know there ...
Read more 








