arXiv:2411.14725v1 Announce Sort: cross
Summary: As multimodal giant language fashions (MLLMs) advance quickly, rigorous analysis has change into important, offering additional steerage for his or her improvement. On this work, we deal with a unified and sturdy analysis of textbf{imaginative and prescient notion} talents, the foundational ability of MLLMs. We discover that current notion benchmarks, every specializing in totally different query varieties, domains, and analysis metrics, introduce vital analysis variance, complicating complete assessments of notion talents when counting on any single benchmark. To handle this, we introduce textbf{AbilityLens}, a unified benchmark designed to judge MLLMs throughout six key notion talents, specializing in each accuracy and stability, with every capacity encompassing various query varieties, domains, and metrics. With the help of AbilityLens, we: (1) establish the strengths and weaknesses of present fashions, highlighting stability patterns and revealing a notable efficiency hole between open-source and closed-source fashions; (2) introduce a web based analysis mode, which uncovers attention-grabbing capacity battle and early convergence phenomena throughout MLLM coaching; and (3) design a easy ability-specific mannequin merging technique that mixes the perfect capacity checkpoint from early coaching levels, successfully mitigating efficiency decline attributable to capacity battle. The benchmark and on-line leaderboard might be launched quickly.
Source link
#Evaluating #Advancing #Multimodal #Giant #Language #Fashions #Capacity #Lens