...

[2405.01483] MANTIS: Interleaved Multi-Image Instruction Tuning


View a PDF of the paper titled MANTIS: Interleaved Multi-Picture Instruction Tuning, by Dongfu Jiang and 6 different authors

View PDF
HTML (experimental)

Summary:Massive multimodal fashions (LMMs) have proven nice leads to single-image imaginative and prescient language duties. Nevertheless, their skills to resolve multi-image visible language duties is but to be improved. The present LMMs like OpenFlamingo, Emu2, and Idefics acquire their multi-image capacity by means of pre-training on a whole bunch of hundreds of thousands of noisy interleaved image-text knowledge from the online, which is neither environment friendly nor efficient. On this paper, we purpose to construct sturdy multi-image LMMs by way of instruction tuning with academic-level assets. Subsequently, we meticulously assemble Mantis-Instruct containing 721K multi-image instruction knowledge to coach a household of Mantis fashions. The instruction tuning empowers Mantis with completely different multi-image abilities like co-reference, comparability, reasoning, and temporal understanding. We consider Mantis on 8 multi-image benchmarks and 6 single-image benchmarks. Mantis-Idefics2 can obtain SoTA outcomes on all of the multi-image benchmarks and beat the strongest multi-image baseline, Idefics2-8B by a mean of 13 absolute factors. Notably, Idefics2-8B was pre-trained on 140M interleaved multi-image knowledge, which is 200x bigger than Mantis-Instruct. We observe that Mantis performs equivalently effectively on the held-in and held-out benchmarks, which exhibits its generalization capacity. We additional consider Mantis on single-image benchmarks and reveal that Mantis additionally maintains a powerful single-image efficiency on par with CogVLM and Emu2. Our outcomes present that multi-image skills are usually not essentially gained by means of huge pre-training, as a substitute, they are often gained by low-cost instruction tuning. The coaching and analysis of Mantis has paved the street for future work to enhance LMMs’ multi-image skills.

Submission historical past

From: Dongfu Jiang [view email]
[v1]
Thu, 2 Might 2024 17:14:57 UTC (844 KB)
[v2]
Thu, 23 Might 2024 18:57:44 UTC (1,097 KB)
[v3]
Fri, 15 Nov 2024 06:31:44 UTC (969 KB)

Source link

#MANTIS #Interleaved #MultiImage #Instruction #Tuning