If you read the title of this article, you might probably think that ResNeXt is directly derived from ResNet. Well, that’s true, but I think it’s not entirely accurate. In fact, to me ResNeXt is kind of like the combination of ResNet, VGG, and Inception at the same time — I’ll show you the reason in a second. In this article we are going to talk about the ResNeXt architecture, which includes the history, the details of the architecture itself, and the last but not least, the code implementation from scratch with PyTorch.
The History of ResNeXt
The hyperparameter we usually put our concern on when tuning a neural network model is the depth and width, which corresponds to the number of layers and the number of channels, respectively. We see this in VGG and ResNet, where the authors of the two models proposed small-sized kernels and skip-connections so that they can increase the depth of the model easily. In theory, this simple approach is indeed capable of expanding model capacity. However, the two hyperparameter dimensions are always associated with a significant change in the number of parameters, which is definitely a problem since at some point we will have our model becoming too large just to make a slight improvement on accuracy. On the other hand, we knew that in theory Inception is computationally cheaper, yet it has a complex architectural design, which requires us to put more effort to tune the depth and the width of this network. If you have ever learned about Inception, it essentially works by passing a tensor through several convolution layers of different kernel sizes and let the network decide which one is better to represent the features of a specific task.
Xie et al. wondered if they could extract the best part of the three models so that model tuning can be easier like VGG and ResNet while still maintaining the efficiency of Inception. All their ideas are wrapped in a paper titled “Aggregated Residual Transformations for Deep Neural Networks” [1], where they named the network ResNeXt. This is essentially where a new concept referred to as cardinality came from, in which it essentially adopts the idea of Inception, i.e., passing a tensor through multiple branches, yet in a simpler, more scalable way. We can perceive cardinality as a new parameter possible to be tuned in addition to depth and width. By doing so, we now essentially have the next hyperparameter dimension — hence the name, ResNeXt — which allows us to have a higher degree of freedom to perform parameter tuning.
ResNeXt Module
According to the paper, there are three ways we can do to implement cardinality, which you can see in Figure 1 below. The paper also mentions that setting cardinality to 32 is the best practice as it generally provides a good balance between accuracy and computational complexity, so I’ll use this number to explain the following example.
The input of the three modules above is exactly the same, i.e., an image tensor having 256 channels. In variant (a), the input tensor is duplicated 32 times, in which each copy will be processed independently to represent the 32 paths. The first convolution layer in each path is responsible to project the 256-channel image into 4 using 1×1 kernel, which is followed by two more layers: a 3×3 convolution that preserves the number of channels, and a 1×1 convolution that expands the channels back to 256. The tensors from the 32 branches are then aggregated by element-wise summation before eventually being summed again with the original input tensor from the very beginning of the module through skip-connection.
Remember that Inception uses the idea of split-transform-merge. This is exactly what I just explained for the ResNeXt block variant (a), where the split is done before the first 1×1 convolution layer, the transform is performed within each branch, and the merge is the element-wise summation operations. This idea also applies to the ResNeXt module variant (b), in which case the merge operation is performed by channel-wise concatenation resulting in 128-channel image (which comes from 4 channels × 32 paths). The resulting tensor is then projected back to the original dimension by 1×1 convolution layer before eventually summed with the original input tensor.
Notice that there is a word equivalent in the top-left corner of the above figure. This means that these three ResNeXt block variants are basically the similar in terms of the number of parameters, FLOPs, and the resulting accuracy scores. This notion makes sense because they are all basically derived from the same mathematical formulation. I’ll talk more about it later in the subsequent section. Despite this equivalency, I’ll go with option (c) later in the implementation part. This is because this variant employs the so-called group convolution, which is much easier to implement than (a) and (b). In case you’re not yet familiar with the term, it is essentially a technique in a convolution operation where we divide all input channels into several groups in which every single of those is responsible to process channels within the same group before eventually concatenating them. In the case of (c), we reduce the number of channels from 256 to 128 before the splitting is done, allowing us to have 32 convolution kernel groups where each responsible to process 4 channels. We then project the tensor back to the original number of channels so that we can sum it with the original input tensor.
Mathematical Definition
As I mentioned earlier, here’s what the formal mathematical definition of a ResNeXt module looks like.
The above equation encapsulates the entire split-transform-merge operation, where x is the original input tensor, y is the output tensor, C is the cardinality parameter to determine the number of parallel paths used, T is the transformation function applied to each path, and ∑ indicates that we will merge all information from the transformed tensors. However, it is important to note that even though sigma usually denotes summation, only (a) that actually sums the tensors. Meanwhile, both (b) and (c) do the merging through concatenation followed by 1×1 convolution instead, which in fact is still equivalent to (a).
The Entire ResNeXt Architecture
The structure displayed in Figure 1 and the equation in Figure 2 basically only correspond to a single ResNeXt block. In order to construct the entire architecture, we need to stack the block multiple times following the structure shown in Figure 3 below.
Here you can see that the structure of ResNeXt is nearly identical to ResNet. So, I believe you will later find the ResNeXt implementation extremely easy, especially if you have ever implemented ResNet before. The first difference you might notice in the architecture is the number of kernels of the first two convolution layers in each block, where the ResNeXt block generally has twice as many kernels as that of the corresponding ResNet block, especially starting from the conv2 stage all the way to the conv5 stage. Secondly, it is also clearly seen that we have the cardinality parameter applied to the second convolution layer in each ResNeXt block.
The ResNeXt variant implemented above, which is equivalent to ResNet-50, is the one referred to as ResNeXt-50 (32×4d). This naming convention indicates that this variant consists of 50 layers in the main branch with 32 cardinality and 4 number of channels in each path within the conv2 stage. As of this writing, there are three ResNeXt variants already implemented in PyTorch, namely resnext50_32x4d, resnext101_32x8d, and resnext101_64x4d [2]. You can definitely import them easily alongside the pretrained weights if you want. However, in this article we are going to implement the architecture from scratch instead.
ResNeXt Implementation
As we have understood the underlying theory behind ResNeXt, let’s now get our hands dirty with the code! The first thing we do is to import the required modules as shown in Codeblock 1 below.
# Codeblock 1
import torch
import torch.nn as nn
from torchinfo import summary
Here I am going to implement the ResNeXt-50 (32×4d) variant. So, I need to set the parameters in Codeblock 2 according to the architectural details shown back in Figure 3.
# Codeblock 2
CARDINALITY = 32 #(1)
NUM_CHANNELS = [3, 64, 256, 512, 1024, 2048] #(2)
NUM_BLOCKS = [3, 4, 6, 3] #(3)
NUM_CLASSES = 1000 #(4)
The CARDINALITY
variable at line #(1)
is self-explanatory, so I don’t think I need to explain it any further. Next, the NUM_CHANNELS
variable is used to store the number of output channels of each stage, except for index 0 where it corresponds to the number of input channels (#(2)
). At line #(3)
, NUM_BLOCKS
is used to determine how many times we will repeat the corresponding block. Note that we don’t specify any number for the conv1 stage since this stage only consists of a single block. Lastly here we set the NUM_CLASSES
parameter to 1000 since ResNeXt is originally pretrained on ImageNet-1K dataset (#(4)
).
The ResNeXt Module
Since the entire ResNeXt architecture is basically just a bunch of ResNeXt modules, we can basically create a single class to define the module and later use it repeatedly in the main class. In this case, I refer to the module as Block
. The implementation of this class is pretty long, though. So I decided to break it down into several codeblocks. Just ensure that all the codeblocks of the same number are placed within the same notebook cell if you want to run the code.
You can see in the Codeblock 3a below that the __init__()
method of this class accepts several parameters. The in_channels
parameter (#(1)
) is used to set the number of channels of the tensor to be passed into the block. I set it to be adjustable because the blocks in different stage will have different input shapes. Secondly, the add_channel
and downsample
parameters (#(2,4)
) are flags to control whether the block will perform downsampling. If you take a closer look at Figure 3, you’ll notice that every time we move from one stage to another, the number of output channels of the block becomes twice as large as the output from the previous stage while at the same time the spatial dimension is reduced by half. We need to set both add_channel
and downsample
to True
whenever we move from one stage to the next one. Otherwise, we set the two parameters to False
if we only move from one block to another within the same stage. The channel_multiplier
parameter (#(3)
), on the other hand, is used to determine the number of output channels relative to the number of input channels by changing the multiplication factor. This parameter is important because there is a special case where we need to make the number of output channels to be four times larger instead of two, i.e., when we move from conv1 stage (64) to conv2 stage (256).
# Codeblock 3a
class Block(nn.Module):
def __init__(self,
in_channels, #(1)
add_channel=False, #(2)
channel_multiplier=2, #(3)
downsample=False): #(4)
super().__init__()
self.add_channel = add_channel
self.channel_multiplier = channel_multiplier
self.downsample = downsample
if self.add_channel: #(5)
out_channels = in_channels*self.channel_multiplier #(6)
else:
out_channels = in_channels #(7)
mid_channels = out_channels//2 #(8).
if self.downsample: #(9)
stride = 2 #(10)
else:
stride = 1
The parameters we just discussed directly control the if
statements at line #(5)
and #(9)
. The former is going to be executed whenever the add_channel
is True
, in which case the number of input channels will be multiplied by channel_multiplier
to obtain the number of output channels (#(6)
). Meanwhile, if it is False
, we will make input and the output tensor dimension to be the same (#(7)
). Here we set mid_channels
to be half the size of out_channels
(#(8)
). This is because according to Figure 3 the number of channels in the output tensor of the first two convolution layers within each block is half of that of the third convolution layer. Next, the downsample
flag we defined earlier is used to control the if
statement at line #(9)
. Whenever it is set to True
, it will assign the stride
variable to 2 (#(10)
), which will later cause the convolution layer to reduce the spatial dimension of the image by half.
Still inside the __init__()
method, let’s now define the layers within the ResNeXt block. See the Codeblock 3b below for the details.
# Codeblock 3b
if self.add_channel or self.downsample: #(1)
self.projection = nn.Conv2d(in_channels=in_channels, #(2)
out_channels=out_channels,
kernel_size=1,
stride=stride,
padding=0,
bias=False)
nn.init.kaiming_normal_(self.projection.weight, nonlinearity='relu')
self.bn_proj = nn.BatchNorm2d(num_features=out_channels)
self.conv0 = nn.Conv2d(in_channels=in_channels, #(3)
out_channels=mid_channels, #(4)
kernel_size=1,
stride=1,
padding=0,
bias=False)
nn.init.kaiming_normal_(self.conv0.weight, nonlinearity='relu')
self.bn0 = nn.BatchNorm2d(num_features=mid_channels)
self.conv1 = nn.Conv2d(in_channels=mid_channels, #(5)
out_channels=mid_channels,
kernel_size=3,
stride=stride, #(6)
padding=1,
bias=False,
groups=CARDINALITY) #(7)
nn.init.kaiming_normal_(self.conv1.weight, nonlinearity='relu')
self.bn1 = nn.BatchNorm2d(num_features=mid_channels)
self.conv2 = nn.Conv2d(in_channels=mid_channels, #(8)
out_channels=out_channels, #(9)
kernel_size=1,
stride=1,
padding=0,
bias=False)
nn.init.kaiming_normal_(self.conv2.weight, nonlinearity='relu')
self.bn2 = nn.BatchNorm2d(num_features=out_channels)
self.relu = nn.ReLU()
Remember that there are cases where the output dimension of a ResNeXt block is different from the input. In such a case, element-wise summation at the last step cannot be performed (refer to Figure 1). This is the reason that we need to initialize a projection
layer whenever either the add_channel
or downsample
flags are True
(#(1)
). This projection
layer (#(2)
), which is a 1×1 convolution, is used to process the tensor in the skip-connection so that the output shape is going to match the tensor processed by the main flow, allowing them to be summed. Otherwise, if we want the ResNeXt module to preserve the tensor dimension, we need to set both flags to False
so that the projection layer will not be initialized since we can directly sum the skip-connection with the tensor from the main flow.
The main flow of the ResNeXt module itself comprises three convolution layers, which I refer to as conv0
, conv1
and conv2
, as written at line #(3)
, #(5)
and #(8)
respectively. If we take a closer look at these layers, we can see that both conv0
and conv2
are responsible for manipulating the number of channels. At lines #(3)
and #(4)
, we can see that conv0
changes the number of image channels from in_channels
to mid_channels
, while conv2
changes it from mid_channels
to out_channels
(#(8-9)
). On the other hand, the conv1
layer is responsible to control the spatial dimension through the stride
parameter (#(6)
), in which the value is determined according to the dowsample
flag we discussed earlier. Furthermore, this conv1
layer will do the entire split-transform-merge process through group convolution (#(7)
), which in the case of ResNeXt it corresponds to cardinality.
Additionally, here we initialize batch normalization layers named bn_proj
, bn0
, bn1
, and bn2
. Later in the forward()
method, we are going to place them right after the corresponding convolution layers following the Conv-BN-ReLU structure, which is a standard practice when it comes to constructing a CNN-based model. Not only that, notice that here we also write nn.init.kaiming_normal_()
after the initialization of each convolution layer. This is essentially done so that the initial layer weights follow the Kaiming normal distribution as mentioned in the paper.
That was everything about the __init__()
method, now that we are going to move on to the forward()
method to actually define the flow of the ResNeXt module. See the Codeblock 3c below.
# Codeblock 3c
def forward(self, x):
print(f'original\t\t: {x.size()}')
if self.add_channel or self.downsample: #(1)
residual = self.bn_proj(self.projection(x)) #(2)
print(f'after projection\t: {residual.size()}')
else:
residual = x #(3)
print(f'no projection\t\t: {residual.size()}')
x = self.conv0(x) #(4)
x = self.bn0(x)
x = self.relu(x)
print(f'after conv0-bn0-relu\t: {x.size()}')
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
print(f'after conv1-bn1-relu\t: {x.size()}')
x = self.conv2(x) #(5)
x = self.bn2(x)
print(f'after conv2-bn2\t\t: {x.size()}')
x = x + residual
x = self.relu(x) #(6)
print(f'after summation\t\t: {x.size()}')
return x
Here you can see that this function accepts x
as the only input, in which it is basically a tensor produced by the previous ResNeXt block. The if
statement I write at line #(1)
checks whether we are about to perform downsampling. If so, the tensor in the skip-connection is going to be passed through the projection
layer and the corresponding batch normalization layer before eventually stored in the residual
variable (#(2)
). But if downsampling is not performed, we are going to set residual
to be exactly the same as x
(#(3)
). Next, we will process the main tensor x
using the stack of convolution layers starting from conv0
(#(4)
) all the way to conv2
(#(5)
). It is important to note that the Conv-BN-ReLU structure of the conv2
layer is slightly different, where the ReLU activation function is applied after element-wise summation is performed (#(6)
).
Now let’s test the ResNeXt block we just created to find out whether we have implemented it correctly. There are three conditions I am going to test here, namely when we move from one stage to another (setting both add_channel
and downsample
to True
), when we move from one block to another within the same stage (both add_channel
and downsample
are False
), and when we move from conv1 stage to conv2 stage (setting downsample
to False
and add_channel
to True
with 4 channel multiplier).
Test Case 1
The Codeblock 4 below demonstrates the first test case, in which here I simulate the first block of the conv3 stage. If you go back to Figure 3, you will see that the output from the previous stage is a 256-channel image. Thus, we need to set the in_channels
parameter according to this number. Meanwhile, the output of the ResNeXt block in the stage has 512 channels with 28×28 spatial dimension. This tensor shape transformation is actually the reason that we set both flags to True
. Here we assume that the x
tensor passed through the network is a dummy image produced by the conv2 stage.
# Codeblock 4
block = Block(in_channels=256, add_channel=True, downsample=True)
x = torch.randn(1, 256, 56, 56)
out = block(x)
And below is what the output looks like. It is seen at line #(1)
that our projection
layer successfully projected the tensor to 512×28×28, exactly matching the shape of the output tensor from the main flow (#(4)
). The conv0
layer at line #(2)
does not alter the tensor dimension at all since in this case our in_channels
and mid_channels
are the same. The actual spatial downsampling is performed by the conv1
layer, where the image resolution is reduced from 56×56 to 28×28 (#(3)
) thanks to the stride which is set to 2 for this case. The process is then continued by the conv2
layer which doubles the number of channels from 256 to 512 (#(4)
). Lastly, this tensor will be element-wise summed with the projected skip-connection tensor (#(5)
). And with that, we successfully converted our tensor from 256×56×56 to 512×28×28.
# Codeblock 4 Output
original : torch.Size([1, 256, 56, 56])
after projection : torch.Size([1, 512, 28, 28]) #(1)
after conv0-bn0-relu : torch.Size([1, 256, 56, 56]) #(2)
after conv1-bn1-relu : torch.Size([1, 256, 28, 28]) #(3)
after conv2-bn2 : torch.Size([1, 512, 28, 28]) #(4)
after summation : torch.Size([1, 512, 28, 28]) #(5)
Test Case 2
In order to demonstrate the second test case, here I will simulate the block inside the conv3 stage which the input is a tensor produced by the previous block within the same stage. In such a case, we want the input and output dimension of this ResNeXt module to be the same, hence we need to set both add_channel
and downsample
to False
. See the Codeblock 5 and the resulting output below for the details.
# Codeblock 5
block = Block(in_channels=512, add_channel=False, downsample=False)
x = torch.randn(1, 512, 28, 28)
out = block(x)
# Codeblock 5 Output
original : torch.Size([1, 512, 28, 28])
no projection : torch.Size([1, 512, 28, 28]) #(1)
after conv0-bn0-relu : torch.Size([1, 256, 28, 28]) #(2)
after conv1-bn1-relu : torch.Size([1, 256, 28, 28])
after conv2-bn2 : torch.Size([1, 512, 28, 28]) #(3)
after summation : torch.Size([1, 512, 28, 28])
As I’ve mentioned earlier, the projection layer is not going to be used if the input tensor is not downsampled. This is the reason that at line #(1)
we have our skip-connection tensor shape unchanged. Next, we have our channel count reduced to 256 by the conv0
layer since in this case mid_channels
is half the size of out_channels
(#(2)
). We eventually expand this number of channels back to 512 using the conv2 layer (#(3)
). Additionally, this kind of structure is commonly known as bottleneck since it follows the wide-narrow-wide pattern, which was first introduced in the original ResNet paper [3].
Test Case 3
The third test is actually a special case since we are about to simulate the first block in the conv2 stage, where we need to set the add_channel
flag to True
while the downsample
to False
. Here we don’t want to perform spatial downsampling in the convolution layer because it is already done by a maxpooling layer. Furthermore, you can also see in Figure 3 that the conv1 stage returns an image of 64 channels. Due to this reason, we need to set the channel_multiplier
parameter to 4 since we want the subsequent conv2 stage to return 256 channels. See the details in the Codeblock 6 below.
# Codeblock 6
block = Block(in_channels=64, add_channel=True, channel_multiplier=4, downsample=False)
x = torch.randn(1, 64, 56, 56)
out = block(x)
# Codeblock 6 Output
original : torch.Size([1, 64, 56, 56])
after projection : torch.Size([1, 256, 56, 56]) #(1)
after conv0-bn0-relu : torch.Size([1, 128, 56, 56]) #(2)
after conv1-bn1-relu : torch.Size([1, 128, 56, 56])
after conv2-bn2 : torch.Size([1, 256, 56, 56]) #(3)
after summation : torch.Size([1, 256, 56, 56])
It is seen in the resulting output above that the ResNeXt module automatically utilize the projection
layer, which in this case it successfully converted the 64×56×56 tensor into 256×56×56 (#(1)
). Here you can see that the number of channels expanded to be 4 times larger while the spatial dimension remained the same. Afterwards, we shrink the channel count to 128 (#(2)
) and expand it back to 256 (#(3)
) to simulate the bottleneck mechanism. Thus, we can now perform summation between the tensor from the main flow and the one produced by the projection
layer.
At this point we already got our ResNeXt module works properly to handle the three cases. So, I believe this module is now ready to be assembled to actually construct the entire ResNeXt architecture.
The Entire ResNeXt Architecture
Since the following ResNeXt class is pretty long, I break it down into two codeblocks to make things easier to follow. What we basically need to do in the __init__()
method in Codeblock 7a is to initialize the ResNeXt modules using the Block
class we created earlier. The way to implement the conv3 (#(9)
), conv4 (#(12)
) and conv5 (#(15)
) stages are pretty straightforward since what we basically need to do is just to initialize the blocks inside nn.ModuleList
. Remember that the first block within each stage is a downsampling block, while the rest them are not intended to perform downsampling. Due to this reason, we need to initialize the first block manually by setting both add_channel
and downsample
flags to True
(#(10,13,16)
) whereas the remaining blocks are initialized using loops which iterate according to the numbers stored in the NUM_CHANNELS
list (#(11,14,17)
).
# Codeblock 7a
class ResNeXt(nn.Module):
def __init__(self):
super().__init__()
# conv1 stage #(1)
self.resnext_conv1 = nn.Conv2d(in_channels=NUM_CHANNELS[0],
out_channels=NUM_CHANNELS[1],
kernel_size=7, #(2)
stride=2, #(3)
padding=3,
bias=False)
nn.init.kaiming_normal_(self.resnext_conv1.weight,
nonlinearity='relu')
self.resnext_bn1 = nn.BatchNorm2d(num_features=NUM_CHANNELS[1])
self.relu = nn.ReLU()
self.resnext_maxpool1 = nn.MaxPool2d(kernel_size=3, #(4)
stride=2,
padding=1)
# conv2 stage #(5)
self.resnext_conv2 = nn.ModuleList([
Block(in_channels=NUM_CHANNELS[1],
add_channel=True, #(6)
channel_multiplier=4,
downsample=False) #(7)
])
for _ in range(NUM_BLOCKS[0]-1): #(8)
self.resnext_conv2.append(Block(in_channels=NUM_CHANNELS[2]))
# conv3 stage #(9)
self.resnext_conv3 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[2], #(10)
add_channel=True,
downsample=True)])
for _ in range(NUM_BLOCKS[1]-1): #(11)
self.resnext_conv3.append(Block(in_channels=NUM_CHANNELS[3]))
# conv4 stage #(12)
self.resnext_conv4 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[3], #(13)
add_channel=True,
downsample=True)])
for _ in range(NUM_BLOCKS[2]-1): #(14)
self.resnext_conv4.append(Block(in_channels=NUM_CHANNELS[4]))
# conv5 stage #(15)
self.resnext_conv5 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[4], #(16)
add_channel=True,
downsample=True)])
for _ in range(NUM_BLOCKS[3]-1): #(17)
self.resnext_conv5.append(Block(in_channels=NUM_CHANNELS[5]))
self.avgpool = nn.AdaptiveAvgPool2d(output_size=(1,1)) #(18)
self.fc = nn.Linear(in_features=NUM_CHANNELS[5], #(19)
out_features=NUM_CLASSES)
As we discussed earlier, the conv2 stage (#(5)
) is a bit special since the first block within this stage does increase the number of channels yet it does not reduce the spatial dimension. This is essentially the reason that I set the add_channel
parameter to True
(#(6)
) while the downsample
parameter is set to False
(#(7)
). The initialization of the remaining blocks is the same as the other stages we discussed earlier, where we can just do it with a simple loop (#(8)
).
The conv1 stage (#(1)
) on the other hand, does not utilize the Block
class since the structure is completely different from the other stages. According to Figure 3, this stage only comprises a single 7×7 convolution layer (#(2)
), which allows us to capture a larger context from the input image. The tensor produced by this layer will have half the spatial dimensions of the input thanks to the stride
parameter which is set to 2 (#(3)
). Further downsampling is performed using maxpooling layer with the same stride, which again, reduces the spatial dimension by half (#(4)
). — In fact, this maxpooling layer should be inside the conv2 stage instead, but in this implementation I put it outside the nn.ModuleList
of that stage for the sake of simplicity.
Lastly, we need to initialize a global average pooling layer (#(18)
) which works by taking the average value of each channel in the tensor produced by the last convolution layer. By doing this, we are going to have a single number representing each channel. This tensor will then be connected to the output layer that produces NUM_CLASSES
(1000) neurons (#(19)
), in which every single of them corresponds to each class in the dataset.
Now look at the Codeblock 7b below to see how I define the forward()
method. I think there is not much thing I need to explain since what we basically do here is just to pass the tensor from one layer to the subsequent one sequentially.
# Codeblock 7b
def forward(self, x):
print(f'original\t\t: {x.size()}')
x = self.relu(self.resnext_bn1(self.resnext_conv1(x)))
print(f'after resnext_conv1\t: {x.size()}')
x = self.resnext_maxpool1(x)
print(f'after resnext_maxpool1\t: {x.size()}')
for i, block in enumerate(self.resnext_conv2):
x = block(x)
print(f'after resnext_conv2 #{i}\t: {x.size()}')
for i, block in enumerate(self.resnext_conv3):
x = block(x)
print(f'after resnext_conv3 #{i}\t: {x.size()}')
for i, block in enumerate(self.resnext_conv4):
x = block(x)
print(f'after resnext_conv4 #{i}\t: {x.size()}')
for i, block in enumerate(self.resnext_conv5):
x = block(x)
print(f'after resnext_conv5 #{i}\t: {x.size()}')
x = self.avgpool(x)
print(f'after avgpool\t\t: {x.size()}')
x = torch.flatten(x, start_dim=1)
print(f'after flatten\t\t: {x.size()}')
x = self.fc(x)
print(f'after fc\t\t: {x.size()}')
return x
Next, let’s test our ResNeXt class using the following code. Here I am going to test it by passing a dummy tensor of size 3×224×224 which simulates a single RGB image of size 224×224.
# Codeblock 8
resnext = ResNeXt()
x = torch.randn(1, 3, 224, 224)
out = resnext(x)
# Codeblock 8 Output
original : torch.Size([1, 3, 224, 224])
after resnext_conv1 : torch.Size([1, 64, 112, 112]) #(1)
after resnext_maxpool1 : torch.Size([1, 64, 56, 56]) #(2)
after resnext_conv2 #0 : torch.Size([1, 256, 56, 56]) #(3)
after resnext_conv2 #1 : torch.Size([1, 256, 56, 56]) #(4)
after resnext_conv2 #2 : torch.Size([1, 256, 56, 56]) #(5)
after resnext_conv3 #0 : torch.Size([1, 512, 28, 28])
after resnext_conv3 #1 : torch.Size([1, 512, 28, 28])
after resnext_conv3 #2 : torch.Size([1, 512, 28, 28])
after resnext_conv3 #3 : torch.Size([1, 512, 28, 28])
after resnext_conv4 #0 : torch.Size([1, 1024, 14, 14])
after resnext_conv4 #1 : torch.Size([1, 1024, 14, 14])
after resnext_conv4 #2 : torch.Size([1, 1024, 14, 14])
after resnext_conv4 #3 : torch.Size([1, 1024, 14, 14])
after resnext_conv4 #4 : torch.Size([1, 1024, 14, 14])
after resnext_conv4 #5 : torch.Size([1, 1024, 14, 14])
after resnext_conv5 #0 : torch.Size([1, 2048, 7, 7])
after resnext_conv5 #1 : torch.Size([1, 2048, 7, 7])
after resnext_conv5 #2 : torch.Size([1, 2048, 7, 7])
after avgpool : torch.Size([1, 2048, 1, 1]) #(6)
after flatten : torch.Size([1, 2048]) #(7)
after fc : torch.Size([1, 1000]) #(8)
We can see in the above output that our conv1 stage correctly reduce the spatial dimension from 224×224 to 112×112 while at the same time also increasing the number of channels to 64 (#(1)
). The downsapling is continued by the maxpooling layer, where it makes the spatial dimension of the image reduced to 56×56 (#(2)
). Moving on to the conv2 stage, we can see that our first block in the stage successfully converted the 64-channel image into 256 (#(3)
), in which the subsequent blocks in the same stage preserve the dimension of this tensor (#(4–5)
). The same thing is also done by the next stages until we reach the global average pooling layer (#(6)
). It is important to note that we need to perform tensor flattening (#(7)
) to drop the empty axes before eventually connecting it to the output layer (#(8)
). And that concludes how a tensor flows through the ResNeXt architecture.
Additionally, you can use the summary()
function that we previously loaded from torchinfo
if you want to get even deeper into the architectural details. You can see at the end of the output below that we got 25,028,904 parameters in total. In fact, this number of params matches exactly with the one belongs to the ResNeXt-50 32x4d model from PyTorch, so I believe our implementation here is correct. You can verify this in the link at reference number [4].
# Codeblock 9
resnext = ResNeXt()
summary(resnext, input_size=(1, 3, 224, 224))
# Codeblock 9 Output
==========================================================================================
Layer (type:depth-idx) Output Shape Param #
==========================================================================================
ResNeXt [1000] --
├─Conv2d: 1-1 [1, 64, 112, 112] 9,408
├─BatchNorm2d: 1-2 [1, 64, 112, 112] 128
├─ReLU: 1-3 [1, 64, 112, 112] --
├─MaxPool2d: 1-4 [1, 64, 56, 56] --
├─ModuleList: 1-5 -- --
│ └─Block: 2-1 [1, 256, 56, 56] --
│ │ └─Conv2d: 3-1 [1, 256, 56, 56] 16,384
│ │ └─BatchNorm2d: 3-2 [1, 256, 56, 56] 512
│ │ └─Conv2d: 3-3 [1, 128, 56, 56] 8,192
│ │ └─BatchNorm2d: 3-4 [1, 128, 56, 56] 256
│ │ └─ReLU: 3-5 [1, 128, 56, 56] --
│ │ └─Conv2d: 3-6 [1, 128, 56, 56] 4,608
│ │ └─BatchNorm2d: 3-7 [1, 128, 56, 56] 256
│ │ └─ReLU: 3-8 [1, 128, 56, 56] --
│ │ └─Conv2d: 3-9 [1, 256, 56, 56] 32,768
│ │ └─BatchNorm2d: 3-10 [1, 256, 56, 56] 512
│ │ └─ReLU: 3-11 [1, 256, 56, 56] --
│ └─Block: 2-2 [1, 256, 56, 56] --
│ │ └─Conv2d: 3-12 [1, 128, 56, 56] 32,768
│ │ └─BatchNorm2d: 3-13 [1, 128, 56, 56] 256
│ │ └─ReLU: 3-14 [1, 128, 56, 56] --
│ │ └─Conv2d: 3-15 [1, 128, 56, 56] 4,608
│ │ └─BatchNorm2d: 3-16 [1, 128, 56, 56] 256
│ │ └─ReLU: 3-17 [1, 128, 56, 56] --
│ │ └─Conv2d: 3-18 [1, 256, 56, 56] 32,768
│ │ └─BatchNorm2d: 3-19 [1, 256, 56, 56] 512
│ │ └─ReLU: 3-20 [1, 256, 56, 56] --
│ └─Block: 2-3 [1, 256, 56, 56] --
│ │ └─Conv2d: 3-21 [1, 128, 56, 56] 32,768
│ │ └─BatchNorm2d: 3-22 [1, 128, 56, 56] 256
│ │ └─ReLU: 3-23 [1, 128, 56, 56] --
│ │ └─Conv2d: 3-24 [1, 128, 56, 56] 4,608
│ │ └─BatchNorm2d: 3-25 [1, 128, 56, 56] 256
│ │ └─ReLU: 3-26 [1, 128, 56, 56] --
│ │ └─Conv2d: 3-27 [1, 256, 56, 56] 32,768
│ │ └─BatchNorm2d: 3-28 [1, 256, 56, 56] 512
│ │ └─ReLU: 3-29 [1, 256, 56, 56] --
├─ModuleList: 1-6 -- --
│ └─Block: 2-4 [1, 512, 28, 28] --
│ │ └─Conv2d: 3-30 [1, 512, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-31 [1, 512, 28, 28] 1,024
│ │ └─Conv2d: 3-32 [1, 256, 56, 56] 65,536
│ │ └─BatchNorm2d: 3-33 [1, 256, 56, 56] 512
│ │ └─ReLU: 3-34 [1, 256, 56, 56] --
│ │ └─Conv2d: 3-35 [1, 256, 28, 28] 18,432
│ │ └─BatchNorm2d: 3-36 [1, 256, 28, 28] 512
│ │ └─ReLU: 3-37 [1, 256, 28, 28] --
│ │ └─Conv2d: 3-38 [1, 512, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-39 [1, 512, 28, 28] 1,024
│ │ └─ReLU: 3-40 [1, 512, 28, 28] --
│ └─Block: 2-5 [1, 512, 28, 28] --
│ │ └─Conv2d: 3-41 [1, 256, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-42 [1, 256, 28, 28] 512
│ │ └─ReLU: 3-43 [1, 256, 28, 28] --
│ │ └─Conv2d: 3-44 [1, 256, 28, 28] 18,432
│ │ └─BatchNorm2d: 3-45 [1, 256, 28, 28] 512
│ │ └─ReLU: 3-46 [1, 256, 28, 28] --
│ │ └─Conv2d: 3-47 [1, 512, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-48 [1, 512, 28, 28] 1,024
│ │ └─ReLU: 3-49 [1, 512, 28, 28] --
│ └─Block: 2-6 [1, 512, 28, 28] --
│ │ └─Conv2d: 3-50 [1, 256, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-51 [1, 256, 28, 28] 512
│ │ └─ReLU: 3-52 [1, 256, 28, 28] --
│ │ └─Conv2d: 3-53 [1, 256, 28, 28] 18,432
│ │ └─BatchNorm2d: 3-54 [1, 256, 28, 28] 512
│ │ └─ReLU: 3-55 [1, 256, 28, 28] --
│ │ └─Conv2d: 3-56 [1, 512, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-57 [1, 512, 28, 28] 1,024
│ │ └─ReLU: 3-58 [1, 512, 28, 28] --
│ └─Block: 2-7 [1, 512, 28, 28] --
│ │ └─Conv2d: 3-59 [1, 256, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-60 [1, 256, 28, 28] 512
│ │ └─ReLU: 3-61 [1, 256, 28, 28] --
│ │ └─Conv2d: 3-62 [1, 256, 28, 28] 18,432
│ │ └─BatchNorm2d: 3-63 [1, 256, 28, 28] 512
│ │ └─ReLU: 3-64 [1, 256, 28, 28] --
│ │ └─Conv2d: 3-65 [1, 512, 28, 28] 131,072
│ │ └─BatchNorm2d: 3-66 [1, 512, 28, 28] 1,024
│ │ └─ReLU: 3-67 [1, 512, 28, 28] --
├─ModuleList: 1-7 -- --
│ └─Block: 2-8 [1, 1024, 14, 14] --
│ │ └─Conv2d: 3-68 [1, 1024, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-69 [1, 1024, 14, 14] 2,048
│ │ └─Conv2d: 3-70 [1, 512, 28, 28] 262,144
│ │ └─BatchNorm2d: 3-71 [1, 512, 28, 28] 1,024
│ │ └─ReLU: 3-72 [1, 512, 28, 28] --
│ │ └─Conv2d: 3-73 [1, 512, 14, 14] 73,728
│ │ └─BatchNorm2d: 3-74 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-75 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-76 [1, 1024, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-77 [1, 1024, 14, 14] 2,048
│ │ └─ReLU: 3-78 [1, 1024, 14, 14] --
│ └─Block: 2-9 [1, 1024, 14, 14] --
│ │ └─Conv2d: 3-79 [1, 512, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-80 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-81 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-82 [1, 512, 14, 14] 73,728
│ │ └─BatchNorm2d: 3-83 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-84 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-85 [1, 1024, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-86 [1, 1024, 14, 14] 2,048
│ │ └─ReLU: 3-87 [1, 1024, 14, 14] --
│ └─Block: 2-10 [1, 1024, 14, 14] --
│ │ └─Conv2d: 3-88 [1, 512, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-89 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-90 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-91 [1, 512, 14, 14] 73,728
│ │ └─BatchNorm2d: 3-92 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-93 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-94 [1, 1024, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-95 [1, 1024, 14, 14] 2,048
│ │ └─ReLU: 3-96 [1, 1024, 14, 14] --
│ └─Block: 2-11 [1, 1024, 14, 14] --
│ │ └─Conv2d: 3-97 [1, 512, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-98 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-99 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-100 [1, 512, 14, 14] 73,728
│ │ └─BatchNorm2d: 3-101 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-102 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-103 [1, 1024, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-104 [1, 1024, 14, 14] 2,048
│ │ └─ReLU: 3-105 [1, 1024, 14, 14] --
│ └─Block: 2-12 [1, 1024, 14, 14] --
│ │ └─Conv2d: 3-106 [1, 512, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-107 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-108 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-109 [1, 512, 14, 14] 73,728
│ │ └─BatchNorm2d: 3-110 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-111 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-112 [1, 1024, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-113 [1, 1024, 14, 14] 2,048
│ │ └─ReLU: 3-114 [1, 1024, 14, 14] --
│ └─Block: 2-13 [1, 1024, 14, 14] --
│ │ └─Conv2d: 3-115 [1, 512, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-116 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-117 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-118 [1, 512, 14, 14] 73,728
│ │ └─BatchNorm2d: 3-119 [1, 512, 14, 14] 1,024
│ │ └─ReLU: 3-120 [1, 512, 14, 14] --
│ │ └─Conv2d: 3-121 [1, 1024, 14, 14] 524,288
│ │ └─BatchNorm2d: 3-122 [1, 1024, 14, 14] 2,048
│ │ └─ReLU: 3-123 [1, 1024, 14, 14] --
├─ModuleList: 1-8 -- --
│ └─Block: 2-14 [1, 2048, 7, 7] --
│ │ └─Conv2d: 3-124 [1, 2048, 7, 7] 2,097,152
│ │ └─BatchNorm2d: 3-125 [1, 2048, 7, 7] 4,096
│ │ └─Conv2d: 3-126 [1, 1024, 14, 14] 1,048,576
│ │ └─BatchNorm2d: 3-127 [1, 1024, 14, 14] 2,048
│ │ └─ReLU: 3-128 [1, 1024, 14, 14] --
│ │ └─Conv2d: 3-129 [1, 1024, 7, 7] 294,912
│ │ └─BatchNorm2d: 3-130 [1, 1024, 7, 7] 2,048
│ │ └─ReLU: 3-131 [1, 1024, 7, 7] --
│ │ └─Conv2d: 3-132 [1, 2048, 7, 7] 2,097,152
│ │ └─BatchNorm2d: 3-133 [1, 2048, 7, 7] 4,096
│ │ └─ReLU: 3-134 [1, 2048, 7, 7] --
│ └─Block: 2-15 [1, 2048, 7, 7] --
│ │ └─Conv2d: 3-135 [1, 1024, 7, 7] 2,097,152
│ │ └─BatchNorm2d: 3-136 [1, 1024, 7, 7] 2,048
│ │ └─ReLU: 3-137 [1, 1024, 7, 7] --
│ │ └─Conv2d: 3-138 [1, 1024, 7, 7] 294,912
│ │ └─BatchNorm2d: 3-139 [1, 1024, 7, 7] 2,048
│ │ └─ReLU: 3-140 [1, 1024, 7, 7] --
│ │ └─Conv2d: 3-141 [1, 2048, 7, 7] 2,097,152
│ │ └─BatchNorm2d: 3-142 [1, 2048, 7, 7] 4,096
│ │ └─ReLU: 3-143 [1, 2048, 7, 7] --
│ └─Block: 2-16 [1, 2048, 7, 7] --
│ │ └─Conv2d: 3-144 [1, 1024, 7, 7] 2,097,152
│ │ └─BatchNorm2d: 3-145 [1, 1024, 7, 7] 2,048
│ │ └─ReLU: 3-146 [1, 1024, 7, 7] --
│ │ └─Conv2d: 3-147 [1, 1024, 7, 7] 294,912
│ │ └─BatchNorm2d: 3-148 [1, 1024, 7, 7] 2,048
│ │ └─ReLU: 3-149 [1, 1024, 7, 7] --
│ │ └─Conv2d: 3-150 [1, 2048, 7, 7] 2,097,152
│ │ └─BatchNorm2d: 3-151 [1, 2048, 7, 7] 4,096
│ │ └─ReLU: 3-152 [1, 2048, 7, 7] --
├─AdaptiveAvgPool2d: 1-9 [1, 2048, 1, 1] --
├─Linear: 1-10 [1, 1000] 2,049,000
==========================================================================================
Total params: 25,028,904
Trainable params: 25,028,904
Non-trainable params: 0
Total mult-adds (Units.GIGABYTES): 6.28
==========================================================================================
Input size (MB): 0.60
Forward/backward pass size (MB): 230.42
Params size (MB): 100.12
Estimated Total Size (MB): 331.13
==========================================================================================
Ending
I think that’s everything about ResNeXt and its implementation. You can also find the entire code used in this article on my GitHub repo [5].
I hope you learn something new today, and thank you very much for reading! See you in my next article.
References
[1] Saining Xie et al. Aggregated Residual Transformations for Deep Neural Networks. Arxiv. https://arxiv.org/abs/1611.05431 [Accessed March 1, 2025].
[2] ResNeXt. PyTorch. https://pytorch.org/vision/main/models/resnext.html [Accessed March 1, 2025].
[3] Kaiming He et al. Deep Residual Learning for Image Recognition. Arxiv. https://arxiv.org/abs/1512.03385 [Accessed March 1, 2025].
[4] resnext50_32x4d. PyTorch. https://pytorch.org/vision/main/models/generated/torchvision.models.resnext50_32x4d.html#torchvision.models.resnext50_32x4d [Accessed March 1, 2025].
[5] MuhammadArdiPutra. Taking ResNet to the NeXt Level — ResNeXt. GitHub. https://github.com/MuhammadArdiPutra/medium_articles/blob/main/Taking%20ResNet%20to%20the%20NeXt%20Level%20-%20ResNeXt.ipynb [Accessed April 7, 2025].
Source link
#ResNet #NextLevel