...

[2406.11944] Transcoders Find Interpretable LLM Feature Circuits


View a PDF of the paper titled Transcoders Discover Interpretable LLM Characteristic Circuits, by Jacob Dunefsky and Philippe Chlenski and Neel Nanda

View PDF
HTML (experimental)

Summary:A key aim in mechanistic interpretability is circuit evaluation: discovering sparse subgraphs of fashions similar to particular behaviors or capabilities. Nevertheless, MLP sublayers make fine-grained circuit evaluation on transformer-based language fashions troublesome. Particularly, interpretable options — akin to these discovered by sparse autoencoders (SAEs) — are sometimes linear combos of extraordinarily many neurons, every with its personal nonlinearity to account for. Circuit evaluation on this setting thus both yields intractably massive circuits or fails to disentangle native and international conduct. To deal with this we discover transcoders, which search to faithfully approximate a densely activating MLP layer with a wider, sparsely-activating MLP layer. We introduce a novel technique for utilizing transcoders to carry out weights-based circuit evaluation by way of MLP sublayers. The ensuing circuits neatly factorize into input-dependent and input-invariant phrases. We then efficiently prepare transcoders on language fashions with 120M, 410M, and 1.4B parameters, and discover them to carry out no less than on par with SAEs by way of sparsity, faithfulness, and human-interpretability. Lastly, we apply transcoders to reverse-engineer unknown circuits within the mannequin, and we receive novel insights concerning the “greater-than circuit” in GPT2-small. Our outcomes counsel that transcoders can show efficient in decomposing mannequin computations involving MLPs into interpretable circuits. Code is accessible at this https URL.

Submission historical past

From: Jacob Dunefsky [view email]
[v1]
Mon, 17 Jun 2024 17:49:00 UTC (657 KB)
[v2]
Wed, 6 Nov 2024 22:37:30 UTC (672 KB)

Source link

#Transcoders #Discover #Interpretable #LLM #Characteristic #Circuits