...

FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval



arXiv:2411.17454v1 Announce Kind: cross
Summary: Given a question from one modality, few-shot cross-modal retrieval (CMR) retrieves semantically comparable cases in one other modality with the goal area together with lessons which can be disjoint from the supply area. In contrast with classical few-shot CMR strategies, vision-language pretraining strategies like CLIP have proven nice few-shot or zero-shot studying efficiency. Nonetheless, they nonetheless undergo challenges as a result of (1) the characteristic degradation encountered within the goal area and (2) the acute information imbalance. To deal with these points, we suggest FLEX-CLIP, a novel Function-level Era Community Enhanced CLIP. FLEX-CLIP contains two coaching levels. In multimodal characteristic era, we suggest a composite multimodal VAE-GAN community to seize actual characteristic distribution patterns and generate pseudo samples based mostly on CLIP options, addressing information imbalance. For frequent house projection, we develop a gate residual community to fuse CLIP options with projected options, decreasing characteristic degradation in X-shot eventualities. Experimental outcomes on 4 benchmark datasets present a 7%-15% enchancment over state-of-the-art strategies, with ablation research demonstrating enhancement of CLIP options.

Source link

#FLEXCLIP #FeatureLevel #GEneration #Community #Enhanced #CLIP #Xshot #Crossmodal #Retrieval