An Interactive Guide to 4 Fundamental Computer Vision Tasks Using Transformers

and Vision Model?

Computer Vision is a subdomain in artificial intelligence with a wide range of applications focusing on image processing and understanding. Traditionally addressed through Convolutional Neural Networks (CNNs), this field has been revolutionized by the emergence of transformer architecture. While transformers are well-known for their applications in language processing, they can be effectively adapted to form the backbone of many vision models. In this article, we will explore state-of-the-art vision and multimodal models, such as ViT (Vision Transformer), DETR (Detection Transformer), BLIP (Boostrapping Language-Image Pretraining), and ViLT (Vision Language Transformer), that specialize in various computer vision tasks including image classification, segmentation, image-to-text conversion, and visual question answering. These tasks have a variety of real-world applications, from annotating images at scale, detecting abnormalities in medical images to extracting text from documents and generating text responses based on visual data.

Comparisons with CNNs

Before the wide adoption of foundation models, CNNs were the dominant solutions for most computer vision tasks. In a nutshell, CNNs form a hierarchical deep learning architecture that consists of feature maps, pooling, linear layers and fully connected layers. In contrast, vision transformers leverage the self-attention mechanism that allows image patches to attend to each other. They also have less inductive bias, meaning they are less constrained by specific model assumptions as CNNs, but consequently require significantly more training data to achieve strong performance on generalized tasks.

Comparisons with LLMs

Transformer-based vision models adapt the architecture used by LLMs (Large Language Models), adding extra layers that convert image data into numerical embeddings. In an NLP task, text sequences undergo the process of tokenization and embedding before they are consumed by the transformer encoder. Similarly, image/visual data go through the procedure of patching, position encoding, image embedding before feeding into the vision transformer encoder. Throughout this article, we will further explore how the vision transformer and its variants build upon the transformer backbone and extend capabilities from language processing to image understanding and image generation.

Extensions to Multimodal Models

Advancements in vision models have driven the interest in developing multimodal models capable of process both image and text data simultaneously. While vision models focus on uni-directional transformation of image data to numerical representation and typically produce score-based output for classification or object detection (i.e. image-classification and image-segmentation task), multimodal models require bidirectional processing and integration between different data types. For example, an image-text multimodal model can generate coherent text sequences from image input for image captioning and visual question answering tasks.

4 Types of Fundamental Computer Vision Tasks

0. Project Overview

We will explore the details of these 4 fundamental computer vision tasks and the corresponding transformer models specialized for each task. These models differ primarily in their encoder and decoder architectures, which give them distinct capabilities for interpreting, processing, and translating across different textual or visual modality.

To make this guide more interactive, I have designed a Streamlit web app to illustrate and compare outputs of these computer vision tasks and models. We will introduce the end to end app development at the end of this article.

Below is a sneak peek of output based on the uploaded image, displaying task name, output, runtime, model name, model type, by running the default models from Hugging Face pipelines.

Streamlit Web App for Computer Vision Tasks

1. Image Classification

Firstly, let’s introduce image classification — a basics computer vision task that assigns images to a predefined set of labels, which can be achieved by a basic Vision Transformer.

ViT (Vision Transformer)

Vision Transformer (ViT) serves as the cornerstone for many computer vision models later introduced in this article. It consistently outperforms CNN on image classification tasks through its encoder-only transformer architecture. It processes image inputs and outputs probability scores for candidate labels. Since image classification is purely an image understanding task without generation requirements, ViT’s encoder-only architecture is well-suited for this purpose.

A ViT architecture is composed of following components:

Patching: break down input images into small, fixed size patches of pixels (typically 16×16 pixels per patch) so that local features are preserved for downstream processing.
Embedding: convert image patches into numerical representations, also known as vector embeddings, so that images with similar features are projected as embeddings with closer proximity in the vector space.
Classification Token (CLS): extract and aggregate information from all image patches into one numeric representation, making it particularly effective for classification.
Position Encoding: preserve the relative positions of the original image pixels. CLS token is always at position 0.
Transformer Encoder: process the embeddings through layers of multi-headed attention and feed-forward networks.

The mechanism behind ViT results in its efficiency in capturing global dependencies, whereas CNN primarily relies on local processing through convolutional kernels. On the other hand, ViT has the drawback of requiring a massive amount of training data (usually millions of images) to iteratively adjust model parameters in attention layers to achieve strong performance.

Implementation

Hugging Face pipeline significantly simplifies the implementation of image classification task by abstracting away the low-level image processing steps.

from transformers import pipeline
from PIL import Image

image = Image.open(image_url)
pipe = pipeline(task="image-classification", model=model_id)
output = pipe(image=image)

input parameters:
- model: you can choose your own model or use the default model (i.e. “google/vit-base-patch16-224”) when the model parameter is not specified.
- task: provide a task name (e.g. “image-classification”, “image-segmentation”)
- image: provide an image object through an URL or an image file path.
output: the model generates scores for the candidate labels.

We compared results of the default image classification model “google/vit-base-patch16-224” by providing two similar images with different compositions. As we can see, this baseline model is easily confused, producing significantly different outputs (“espresso” vs. “mircowave”), despite both images containing the same main object.

“Coffee Mug” Image Output

[
  { "label": "espresso", "score": 0.40687331557273865 },
  { "label": "cup", "score": 0.2804579734802246 },
  { "label": "coffee mug", "score": 0.17347976565361023 },
  { "label": "desk", "score": 0.01198530849069357 },
  { "label": "eggnog", "score": 0.00782513152807951 }
]

“Coffee Mug with Background” Image Output

[
  { "label": "microwave, microwave oven", "score": 0.20218633115291595 },
  { "label": "dining table, board", "score": 0.14855517446994781 },
  { "label": "stove", "score": 0.1345038264989853 },
  { "label": "sliding door", "score": 0.10262308269739151 },
  { "label": "shoji", "score": 0.07306522130966187 }
]

Try a different model yourself using our Streamlit web app and see if it generates better results.

2. Image Segmentation

Image segmentation is another common computer vision task that requires a vision-only model. The objective is similar to object detection but requires higher precision at the pixel level, producing masks for object boundaries instead of drawing bounding boxes as required for object detection.

There are three main types of image segmentation:

Semantic segmentation: predict a mask for each object class.
Instance segmentation: predict a mask for each instance of the object class.
Panoptic segmentation: combine instance segmentation and semantic segmentation by assigning each pixel an object class and an instance of that class.

DETR (Detection Transformer)

Although DETR is widely used for object detection, it can be extended to perform panoptic segmentation task by adding a segmentation mask head. As shown in the diagram, it utilizes the encoder-decoder transformer architecture with a CNN backbone for feature map extraction. DETR model learns a set of object queries and it is trained to predict bounding boxes for these queries, followed by a mask prediction head to perform precise pixel-level segmentation.

Mask2Former

Mask2Former is also a common choice for image segmentation task. Developed by Facebook AI Research, Mask2Former generally outperforms DETR models with better precision and computational efficiency. It is achieved by applying a masked attention mechanism instead of global cross-attention to focus specifically on foreground information and main objects in an image.

Implementation

We use the pipeline implementation just like image classification, by simply swapping the task parameter to “image-segmentation”. To process the output, we extract the object labels and masks, then display the masked image using st.image()

from transformers import pipeline
from PIL import Image
import streamlit as st

image = Image.open(image_url)
pipe = pipeline(task="image-segmentation", model=model_id)
output = pipe(image=image)

output_labels = [i['label'] for i in output]
output_masks = [i['mask'] for i in output]

for m in output_masks:
		st.image(m)

We compared the performance of DETR (“facebook/detr-resnet-50-panoptic”) and Mask2Former (“facebook/mask2former-swin-base-coco-panoptic”) which are both fine-tuned on panoptic segmentation. As displayed in the segmentation outputs, both DETR and Mask2Former successfully identify and extract the “cup” and the “dining table”. Mask2Former makes inference at a faster speed (2.47s compared to 6.3s for DETR) and also manages to identify “window-other” from the background.

DETR “facebook/detr-resnet-50-panoptic” output

[
	{
		'score': 0.994395, 
		'label': 'dining table', 
		'mask': 
	}, 
	{
		'score': 0.999692, 
		'label': 'cup', 
		'mask': 
	}
]

Mask2Former “facebook/mask2former-swin-base-coco-panoptic” output

[
	{
		'score': 0.999554, 
		'label': 'cup', 
		'mask': 
	}, 
	{
		'score': 0.971946, 
		'label': 'dining table', 
		'mask': 
	}, 
	{
		'score': 0.983782, 
		'label': 'window-other', 
		'mask': 
	}
]

3. Image Captioning

Image Captioning, also known as image to text, translates images into text sequences that describe the image contents. This task requires capabilities of both image understanding and text generation, therefore well suited for a multimodal model that can process image and text data simultaneously.

Visual Encoder-Decoder

Visual Encoder-Decoder is a multimodal architecture that combines a vision model for image understanding with a pretrained language model for text generation. A common example is ViT-GPT2, which chains together the Vision Transformer (introduced in section 1. Image Classification) as the visual encoder and the GPT-2 model as the decoder to perform autoregressive text generation.

BLIP (Boostrapping Language-Image Pretraining)

BLIP, developed by Salesforce Research, leverages 4 core modules – an image encoder, a text encoder, followed by an image-grounded text encoder that fuses visual and textual features via attention mechanisms, as well as an image-grounded text decoder for text sequence generation. The pretraining process involves minimizing image-text contrastive loss, image-text matching loss and language modeling loss, with the objectives of aligning the semantic relationship between visual information and text sequences. It offers higher flexibility in applications and can be applied for VQA (visual question answering), but it also introduces more complexity in the architectural design.

Implementation

We use the code snippet below to generate output from an image captioning pipeline.

from transformers import pipeline
from PIL import Image

image = Image.open(image_url)
pipe = pipeline(task="image-to-text", model=model_id)
output = pipe(image=image)

We tried three different models below and they all generates reasonably accurate image descriptions, with the larger model performs better than the base one.

Visual Encoder-Decoder “ydshieh/vit-gpt2-coco-en” output

[{'generated_text': 'a cup of coffee sitting on a wooden table'}]

BLIP “Salesforce/blip-image-captioning-base” output

[{'generated_text': 'a cup of coffee on a table'}]

BLIP “Salesforce/blip-image-captioning-large” output

[{'generated_text': 'there is a cup of coffee on a saucer on a table'}]

4. Visual Question Answering

Visual Question Answering (VQA) has gained increasing popularity as it enables users to ask questions about an image and receive coherent text responses. It also requires a multimodal model that can extract key information in visual data while also capable of generating text responses. What it differentiates from image captioning is accepting user prompts as input in addition to an image, therefore requiring an encoder that interprets both modalities at the same time.

ViLT (Vision Language Transformer)

ViLT is a computationally efficient model architecture for executing VQA task. ViLT incorporates image patch embeddings and text embeddings into an unified transformer encoder which is pre-trained for three objectives:

image-text matching: learn the semantic relationship between image-text pairs
masked language modeling: learn to predict the masked word/token from the vocabulary based on the text and image input
word patch alignment: learn the associations between words and image patches

ViLT adopts an encoder-only architecture with task specific heads (e.g. classification head, VQA head), with this minimal design achieving ten times faster speed than a VLP (Vision-and-Language Pretraining) model that relies on region supervision for object detection and convolutional architecture for feature extraction. However, this simplified architecture results in suboptimal performance on complex tasks and relies on massive training data for achieving generalized functionality. As demonstrated later, one drawback is that ViLT model produces token-based outputs for VQA rather than coherent sentences, very much like an image classification task with a large amount of candidate labels.

BLIP

As introduced in the section 3. Image Captioning, BLIP is a more extensive model that can also be fine-tuned for performing visual question answering task. As the result of it encoder-decoder architecture, it generates complete text sequences instead of tokens.

Implementation

VQA is implemented using the code snippet below, taking both an image and a text prompt as the model inputs.

from transformers import pipeline
from PIL import Image
import streamlit as st

image = Image.open(image_url)
question='describe this image'
pipe = pipeline(task="image-to-text", model=model_id, question=question)
output = pipe(image=image)

When comparing ViLT and BLIP models for the question “describe this image”, the outputs differ significantly due to their distinct model architectures. ViLT predicts the highest scoring tokens from its existing vocabulary, while BLIP generates more coherent and sensible results.

ViLT “dandelin/vilt-b32-finetuned-vqa” output

[
  { "score": 0.044245753437280655, "answer": "kitchen" },
  { "score": 0.03294338658452034, "answer": "tea" },
  { "score": 0.030773703008890152, "answer": "table" },
  { "score": 0.024886665865778923, "answer": "office" },
  { "score": 0.019653357565402985, "answer": "cup" }
]

BLIP “Salesforce/blip-vqa-capfilt-large” output

[{'answer': 'coffee cup on saucer'}]

End-to-End Computer Vision App Development

Let’s break down the web app development into 6 steps you can easily follow to build your own interactive Streamlit app or customize it for your needs. Check out our GitHub repository for the end-to-end implementation.

1. Initialize the web app and configure the page layout.

def initialize_page():
    """Initialize the Streamlit page configuration and layout"""
    st.set_page_config(
        page_title="Computer Vision",
        page_icon="🤖",
        layout="centered"
    )
    st.title("Computer Vision Tasks")
    content_block = st.columns(1)[0]

    return content_block

2. Prompt the user to upload an image.

def get_uploaded_image():

    uploaded_file = st.file_uploader(
        "Upload your own image", 
        accept_multiple_files=False,
        type=["jpg", "jpeg", "png"]
    )
    if uploaded_file:
        image = Image.open(uploaded_file)
        st.image(image, caption='Preview', use_container_width=False)

    else:
        image = None

    return image

3. Select one or more computer vision tasks using a multi-select dropdown list (also accept user entered options e.g. “document-question-answering”). It will prompt user to enter the question if ‘visual-question-answering’ or ‘document-question-answering’ is selected, because these two tasks require “question” as an additional input parameter.

def get_selected_task():
    options = st.multiselect(
        "Which tasks would you like to perform?",
        [
            "visual-question-answering",
            "image-to-text",
            "image-classification",
            "image-segmentation",
        ],
        max_selections=4,
        accept_new_options=True,
    )

    #prompt for question input if the task is 'VQA' and 'DocVQA' - parameter "question"
    if 'visual-question-answering' in options or 'document-question-answering' in options:
        question = st.text_input(
            "Please enter your question:"
        )
        
    elif "Other (specify task name)" in options:
        task = st.text_input(
            "Please enter the task name:"
        )
        options = task
        question = ""
        
    else:
        question = ""

    return options, question

4. Prompt the user to choose between the default model built into the hugging face pipeline or enter their own model.

def get_selected_model():
    options = ["Use the default model", "Use your selected HuggingFace model"]
    selected_option = st.selectbox("Choose an option:", options)
    if selected_option == "Use your selected HuggingFace model":
        model = st.text_input(
            "Please enter your selected HuggingFace model id:"
        )
    else:
        model = None

    return model

5. Create task pipelines based on the user-entered parameters, then collects the model outputs and processing times. The result is displayed in a table format using st.dataframe() to compare the different task name, output, runtime, model name, and model type. For image segmentation tasks, the segmentation mask is also displayed using st.image().

def display_results(image, task_list, user_question, model):

    results = []
    for task in task_list:
        if task in ['visual-question-answering', 'document-question-answering']:
            params = {'question': user_question}
        else:
            params = {}
            
        row = {
            'task': task,
        }

        try:
            model = i['model']
            row['model'] = model
            pipe = pipeline(task, model=model)

        except Exception as e:
            pipe = pipeline(task)
            row['model'] = pipe.model.name_or_path

        start_time = time.time()
        output = pipe(
            image,
            **params
        )
        execution_time = time.time() - start_time
        
        row['model_type'] = pipe.model.config.model_type
        row['time'] = execution_time
        

        # display image segentation visual output
        if task == 'image-segmentation':
            output_masks = [i['mask'] for i in output]

        row['output'] = str(output)
        
        results.append(row)
        results_df = pd.DataFrame(results)
        
    st.write('Model Responses')
    st.dataframe(results_df)

    if 'image-segmentation' in task_list:
        st.write('Segmentation Mask Output')
        
        for m in output_masks:
            st.image(m)
    
    return results_df

6. Lastly, chain these functions together using the main function. Use a “Generate Response” button to trigger these functions and display the results in the app.

def main():
    initialize_page()
    image = get_uploaded_image()
    task_list, user_question = get_selected_task()
    model = get_selected_model()
    
    # generate reponse spinning wheel
    if st.button("Generate Response", key="generate_button"):
        display_results(image, task_list, user_question, model)

# run the app
if __name__ == "__main__":
    main()

Takeaway Message

We introduced the evolution from traditional CNN-based approaches to transformer architectures, comparing vision models with language models and multimodal models. We also explored 4 fundamental computer vision tasks and their corresponding techniques, providing a practical Streamlit implementation guide to building your own computer vision web applications for further explorations.

The fundamental Computer Vision tasks and models include:

Image Classification: Analyze images and assign them to one or more predefined categories or classes, utilizing model architectures like ViT (Vision Transformer).
Image Segmentation: Classify image pixels into specific categories, creating detailed masks that outline object boundaries, including DETR and Mask2Former model architectures.
Image Captioning: Generates descriptive text for images, demonstrating models like visual encoder-decoder and BLIP that combine visual encoding with language generation capabilities.
Visual Question Answering (VQA): Process both image and text queries to answer open-ended questions based on image content, comparing architectures like ViLT (Vision Language Transformer) with its token-based outputs and BLIP with more coherent responses.

Source link

#Interactive #Guide #Fundamental #Computer #Vision #Tasks #Transformers