...

[2407.12259] On the Feasibility of In-Context Probing for Data Attribution


View a PDF of the paper titled On the Feasibility of In-Context Probing for Data Attribution, by Cathy Jiao and Gary Gao and Aditi Raghunathan and Chenyan Xiong

View PDF
HTML (experimental)

Abstract:Data attribution methods are used to measure the contribution of training data towards model outputs, and have several important applications in areas such as dataset curation and model interpretability. However, many standard data attribution methods, such as influence functions, utilize model gradients and are computationally expensive. In our paper, we show in-context probing (ICP) — prompting a LLM — can serve as a fast proxy for gradient-based data attribution for data selection under conditions contingent on data similarity. We study this connection empirically on standard NLP tasks, and show that ICP and gradient-based data attribution are well-correlated in identifying influential training data for tasks that share similar task type and content as the training data. Additionally, fine-tuning models on influential data selected by both methods achieves comparable downstream performance, further emphasizing their similarities. We also examine the connection between ICP and gradient-based data attribution using synthetic data on linear regression tasks. Our synthetic data experiments show similar results with those from NLP tasks, suggesting that this connection can be isolated in simpler settings, which offers a pathway to bridging their differences.

Submission history

From: Cathy Jiao [view email]
[v1]
Wed, 17 Jul 2024 02:06:56 UTC (9,050 KB)
[v2]
Mon, 10 Feb 2025 19:40:01 UTC (9,813 KB)

Source link

#Feasibility #InContext #Probing #Data #Attribution