[2506.17090] Better Language Model Inversion by Compactly Representing Next-Token Distributions

[Submitted on 20 Jun 2025 (v1), last revised 11 Dec 2025 (this version, v3)]

View a PDF of the paper titled Better Language Model Inversion by Compactly Representing Next-Token Distributions, by Murtaza Nazir and 4 other authors

View PDF
HTML (experimental)

Abstract:Language model inversion seeks to recover hidden prompts using only language model outputs. This capability has implications for security and accountability in language model deployments, such as leaking private information from an API-protected language model’s system message. We propose a new method — prompt inversion from logprob sequences (PILS) — that recovers hidden prompts by gleaning clues from the model’s next-token probabilities over the course of multiple generation steps. Our method is enabled by a key insight: The vector-valued outputs of a language model occupy a low-dimensional subspace. This enables us to losslessly compress the full next-token probability distribution over multiple generation steps using a linear map, allowing more output information to be used for inversion. Our approach yields massive gains over previous state-of-the-art methods for recovering hidden prompts, achieving 2–3.5 times higher exact recovery rates across test sets, in one case increasing the recovery rate from 17% to 60%. Our method also exhibits surprisingly good generalization behavior; for instance, an inverter trained on 16 generations steps gets 5–27 points higher prompt recovery when we increase the number of steps to 32 at test time. Furthermore, we demonstrate strong performance of our method on the more challenging task of recovering hidden system messages. We also analyze the role of verbatim repetition in prompt recovery and propose a new method for cross-family model transfer for logit-based inverters. Our findings show that next-token probabilities are a considerably more vulnerable attack surface for inversion attacks than previously known.

Submission history

From: Matthew Finlayson [view email]
[v1]
Fri, 20 Jun 2025 15:53:51 UTC (103 KB)
[v2]
Mon, 23 Jun 2025 14:39:37 UTC (103 KB)
[v3]
Thu, 11 Dec 2025 17:53:55 UTC (104 KB)

Source link

#Language #Model #Inversion #Compactly #Representing #NextToken #Distributions

[2506.17090] Better Language Model Inversion by Compactly Representing Next-Token Distributions

Submission history

Recent Posts

[2506.17090] Better Language Model Inversion by Compactly Representing Next-Token Distributions

Connecting the Dots Across Discovery – with Ben Ninio of Deloitte

Campuses Are Transforming Student and Academic Spaces for Real Impact

Solar geoengineering startups are getting serious

The Machine Learning “Advent Calendar” Day 11: Linear Regression in Excel

OpenAI releases GPT-5.2 after “code red” Google threat alert

Crypto Magnate Do Kwon Sentenced to 15 Years in Prison

The Download: Solar geoengineering’s future, and OpenAI is being sued

New York’s new law forces advertisers to say when they’re using AI avatars

RFK Jr.’s Health Department Is Pondering a National Men’s Health Initiative