[2502.14677] Data-Constrained Synthesis of Training Data for De-Identification

[Submitted on 20 Feb 2025 (v1), last revised 31 May 2025 (this version, v3)]

View a PDF of the paper titled Data-Constrained Synthesis of Training Data for De-Identification, by Thomas Vakili and 2 other authors

View PDF
HTML (experimental)

Abstract:Many sensitive domains — such as the clinical domain — lack widely available datasets due to privacy risks. The increasing generative capabilities of large language models (LLMs) have made synthetic datasets a viable path forward. In this study, we domain-adapt LLMs to the clinical domain and generate synthetic clinical texts that are machine-annotated with tags for personally identifiable information using capable encoder-based NER models. The synthetic corpora are then used to train synthetic NER models. The results show that training NER models using synthetic corpora incurs only a small drop in predictive performance. The limits of this process are investigated in a systematic ablation study — using both Swedish and Spanish data. Our analysis shows that smaller datasets can be sufficient for domain-adapting LLMs for data synthesis. Instead, the effectiveness of this process is almost entirely contingent on the performance of the machine-annotating NER models trained using the original data.

Submission history

From: Thomas Vakili [view email]
[v1]
Thu, 20 Feb 2025 16:09:27 UTC (787 KB)
[v2]
Fri, 21 Feb 2025 16:58:44 UTC (787 KB)
[v3]
Sat, 31 May 2025 10:43:20 UTC (950 KB)