segmentation is a popular task in computer vision, with the goal of partitioning an input image into multiple regions, where each region represents a separate object.
Several classic approaches from the past involved taking a model backbone (e.g., U-Net) and fine-tuning it on specialized datasets. While fine-tuning works well, the emergence of GPT-2 and GPT-3 prompted the machine learning community to gradually shift focus toward the development of zero-shot learning solutions.
Zero-shot learning refers to the ability of a model to perform a task without having explicitly received any training examples for it.
The zero-shot concept plays an important role by allowing the fine-tuning phase to be skipped, with the hope that the model is intelligent enough to solve any task on the go.
In the context of computer vision, Meta released the widely known general-purpose “Segment Anything Model” (SAM) in 2023, which enabled segmentation tasks to be performed with decent quality in a zero-shot manner.
While the large-scale results of SAM were impressive, several months later, the Chinese Academy of Sciences Image and Video Analysis (CASIA IVA) group released the FastSAM model. As the adjective “fast” suggests, FastSAM addresses the speed limitations of SAM by accelerating the inference process by up to 50 times, while maintaining high segmentation quality.
In this article, we will explore the FastSAM architecture, possible inference options, and examine what makes it “fast” compared to the standard SAM model. In addition, we will look at a code example to help solidify our understanding.
As a prerequisite, it is highly recommended that you are familiar with the basics of computer vision, the YOLO model, and understand the goal of segmentation tasks.
Architecture
The inference process in FastSAM takes place in two steps:
- All-instance segmentation. The goal is to produce segmentation masks for all objects in the image.
- Prompt-guided selection. After obtaining all possible masks, prompt-guided selection returns the image region corresponding to the input prompt.
Let us start with the all instance segmentation.
All instance segmentation
Before visually examining the architecture, let us refer to the original paper:
“FastSAM architecture is based on YOLOv8-seg — an object detector equipped with the instance segmentation branch, which utilizes the YOLACT method” — Fast Segment Anything paper
The definition might seem complex for those who are not familiar with YOLOv8-seg and YOLACT. In any case, to better clarify the meaning behind these two models, I will provide a simple intuition about what they are and how they are used.
YOLACT (You Only Look at CoefficienTs)
YOLACT is a real-time instance segmentation convolutional model that focuses on high-speed detection, inspired by the YOLO model, and achieves performance comparable to the Mask R-CNN model.
YOLACT consists of two main modules (branches):
- Prototype branch. YOLACT creates a set of segmentation masks called prototypes.
- Prediction branch. YOLACT performs object detection by predicting bounding boxes and then estimates mask coefficients, which tell the model how to linearly combine the prototypes to create a final mask for each object.
To extract initial features from the image, YOLACT uses ResNet, followed by a Feature Pyramid Network (FPN) to obtain multi-scale features. Each of the P-levels (shown in the image) processes features of different sizes using convolutions (e.g., P3 contains the smallest features, while P7 captures higher-level image features). This approach helps YOLACT account for objects at various scales.
YOLOv8-seg
YOLOv8-seg is a model based on YOLACT and incorporates the same principles regarding prototypes. It also has two heads:
- Detection head. Used to predict bounding boxes and classes.
- Segmentation head. Used to generate masks and combine them.
The key difference is that YOLOv8-seg uses a YOLO backbone architecture instead of the ResNet backbone and FPN used in YOLACT. This makes YOLOv8-seg lighter and faster during inference.
Both YOLACT and YOLOv8-seg use the default number of prototypes k = 32, which is a tunable hyperparameter. In most scenarios, this provides a good trade-off between speed and segmentation performance.
In both models, for every detected object, a vector of size k = 32 is predicted, representing the weights for the mask prototypes. These weights are then used to linearly combine the prototypes to produce the final mask for the object.
FastSAM architecture
FastSAM’s architecture is based on YOLOv8-seg but also incorporates an FPN, similar to YOLACT. It includes both detection and segmentation heads, with k = 32 prototypes. However, since FastSAM performs segmentation of all possible objects in the image, its workflow differs from that of YOLOv8-seg and YOLACT:
- First, FastSAM performs segmentation by producing k = 32 image masks.
- These masks are then combined to produce the final segmentation mask.
- During post-processing, FastSAM extracts regions, computes bounding boxes, and performs instance segmentation for each object.
Note
Although the paper does not mention details about post-processing, it can be observed that the official FastSAM GitHub repository uses the method cv2.findContours() from OpenCV in the prediction stage.
# The use of cv2.findContours() method the during prediction stage.
# Source: FastSAM repository (FastSAM / fastsam / prompt.py)
def _get_bbox_from_mask(self, mask):
mask = mask.astype(np.uint8)
contours, hierarchy = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
x1, y1, w, h = cv2.boundingRect(contours[0])
x2, y2 = x1 + w, y1 + h
if len(contours) > 1:
for b in contours:
x_t, y_t, w_t, h_t = cv2.boundingRect(b)
# Merge multiple bounding boxes into one.
x1 = min(x1, x_t)
y1 = min(y1, y_t)
x2 = max(x2, x_t + w_t)
y2 = max(y2, y_t + h_t)
h = y2 - y1
w = x2 - x1
return [x1, y1, x2, y2]
In practice, there are several methods to extract instance masks from the final segmentation mask. Some examples include contour detection (used in FastSAM) and connected component analysis (cv2.connectedComponents()).
Training
FastSAM researchers used the same SA-1B dataset as the SAM developers but trained the CNN detector on only 2% of the data. Despite this, the CNN detector achieves performance comparable to the original SAM, while requiring significantly fewer resources for segmentation. As a result, inference in FastSAM is up to 50 times faster!
For reference, SA-1B consists of 11 million diverse images and 1.1 billion high-quality segmentation masks.
What makes FastSAM faster than SAM? SAM uses the Vision Transformer (ViT) architecture, which is known for its heavy computational requirements. In contrast, FastSAM performs segmentation using CNNs, which are much lighter.
Prompt guided selection
The “segment anything task” involves producing a segmentation mask for a given prompt, which can be represented in different forms.
Point prompt
After obtaining multiple prototypes for an image, a point prompt can be used to indicate that the object of interest is located (or not) in a specific area of the image. As a result, the specified point influences the coefficients for the prototype masks.
Similar to SAM, FastSAM allows selecting multiple points and specifying whether they belong to the foreground or background. If a foreground point corresponding to the object appears in multiple masks, background points can be used to filter out irrelevant masks.
However, if several masks still satisfy the point prompts after filtering, mask merging is applied to obtain the final mask for the object.
Additionally, the authors apply morphological operators to smooth the final mask shape and remove small artifacts and noise.
Box prompt
The box prompt involves selecting the mask whose bounding box has the highest Intersection over Union (IoU) with the bounding box specified in the prompt.
Text prompt
Similarly, for the text prompt, the mask that best corresponds to the text description is selected. To achieve this, the CLIP model is used:
- The embeddings for the text prompt and the k = 32 prototype masks are computed.
- The similarities between the text embedding and the prototypes are then calculated. The prototype with the highest similarity is post-processed and returned.
In general, for most segmentation models, prompting is usually applied at the prototype level.
FastSAM repository
Below is the link to the official repository of FastSAM, which includes a clear README.md file and documentation.
If you plan to use a Raspberry Pi and want to run the FastSAM model on it, be sure to check out the GitHub repository: Hailo-Application-Code-Examples. It contains all the necessary code and scripts to launch FastSAM on edge devices.
In this article, we have looked at FastSAM — an improved version of SAM. Combining the best practices from YOLACT and YOLOv8-seg models, FastSAM maintains high segmentation quality while achieving a significant boost in prediction speed, accelerating inference by several dozen times compared to the original SAM.
The ability to use prompts with FastSAM provides a flexible way to retrieve segmentation masks for objects of interest. Furthermore, it has been shown that decoupling prompt-guided selection from all-instance segmentation reduces complexity.
Below are some examples of FastSAM usage with different prompts, visually demonstrating that it still retains the high segmentation quality of SAM:
Resources
All images are by the author unless noted otherwise.
Source link
#FastSAM #Image #Segmentation #Tasks #Explained #Simply