[2402.17304] Probing Multimodal Large Language Models for Global and Local Semantic Representations

[Submitted on 27 Feb 2024 (v1), last revised 21 Nov 2024 (this version, v3)]

View a PDF of the paper titled Probing Multimodal Giant Language Fashions for International and Native Semantic Representations, by Mingxu Tao and 5 different authors

View PDF
HTML (experimental)

Summary:The development of Multimodal Giant Language Fashions (MLLMs) has drastically accelerated the event of functions in understanding built-in texts and pictures. Latest works leverage image-caption datasets to coach MLLMs, reaching state-of-the-art efficiency on image-to-text duties. Nevertheless, there are few research exploring which layers of MLLMs take advantage of effort to the worldwide picture data, which performs very important roles in multimodal comprehension and technology. On this examine, we discover that the intermediate layers of fashions can encode extra international semantic data, whose illustration vectors carry out higher on visual-language entailment duties, relatively than the topmost layers. We additional probe fashions concerning native semantic representations by means of object recognition duties. We discover that the topmost layers might excessively deal with native data, resulting in a diminished capability to encode international data. Our code and information are launched by way of this https URL.

Submission historical past

From: Mingxu Tao [view email]
[v1]
Tue, 27 Feb 2024 08:27:15 UTC (841 KB)
[v2]
Wed, 27 Mar 2024 02:59:57 UTC (850 KB)
[v3]
Thu, 21 Nov 2024 07:03:33 UTC (6,899 KB)