...

Context-Aware Attention Modulation for Better Multimodal In-Context Learning


View a PDF of the paper titled Make LVLMs Focus: Context-Aware Attention Modulation for Better Multimodal In-Context Learning, by Yanshu Li and 10 other authors

View PDF
HTML (experimental)

Abstract:Multimodal in-context learning (ICL) is becoming a key capability that allows large vision-language models (LVLMs) to adapt to novel tasks without parameter updates, which expands their usefulness in many real-world applications. However, ICL performance remains unstable even when the in-context demonstrations (ICDs) are well matched, showing that LVLMs still struggle to make full use of the provided context. While existing work mainly focuses on prompt engineering or post-hoc logit calibration, we study the attention mechanisms inside LVLMs to address their inherent limitations. We identify two important weaknesses in their self-attention that hinder effective ICL. To address these weaknesses, we propose \textbf{Context-Aware Modulated Attention} (CAMA), a training-free and plug-and-play method that dynamically adjusts attention logits based on the input in-context sequence. CAMA uses a two-stage modulation process that strengthens attention to semantically important tokens, especially visual ones. Across four LVLMs and seven benchmarks, CAMA consistently outperforms vanilla models and baselines, showing clear effectiveness and generalization. It can also activate the intended benefits of prompt engineering methods and remains robust across different sequence configurations. Therefore, CAMA opens up new directions for improving multimodal reasoning through a deeper understanding of attention dynamics.

Submission history

From: Yanshu Li [view email]
[v1]
Wed, 21 May 2025 04:25:23 UTC (2,313 KB)
[v2]
Fri, 22 Aug 2025 14:44:22 UTC (2,167 KB)
[v3]
Mon, 8 Dec 2025 22:49:50 UTC (1,795 KB)

Source link

#ContextAware #Attention #Modulation #Multimodal #InContext #Learning