A Comparative Study of LLMs and Humans

[Submitted on 11 Apr 2025 (v1), last revised 13 Dec 2025 (this version, v3)]

View a PDF of the paper titled LLMs as Span Annotators: A Comparative Study of LLMs and Humans, by Zden\v{e}k Kasner and 9 other authors

View PDF
HTML (experimental)

Abstract:Span annotation – annotating specific text features at the span level – can be used to evaluate texts where single-score metrics fail to provide actionable feedback. Until recently, span annotation was done by human annotators or fine-tuned models. In this paper, we study whether large language models (LLMs) can serve as an alternative to human annotators. We compare the abilities of LLMs to skilled human annotators on three span annotation tasks: evaluating data-to-text generation, identifying translation errors, and detecting propaganda techniques. We show that overall, LLMs have only moderate inter-annotator agreement (IAA) with human annotators. However, we demonstrate that LLMs make errors at a similar rate as skilled crowdworkers. LLMs also produce annotations at a fraction of the cost per output annotation. We release the dataset of over 40k model and human span annotations for further research.

Submission history

From: Zdeněk Kasner [view email]
[v1]
Fri, 11 Apr 2025 17:04:51 UTC (742 KB)
[v2]
Tue, 24 Jun 2025 13:11:18 UTC (730 KB)
[v3]
Sat, 13 Dec 2025 11:30:28 UTC (565 KB)