...

A Brazilian Portuguese Dataset of Public Hearing Transcripts for Summarization of Long Documents


View a PDF of the paper titled PublicHearingBR: A Brazilian Portuguese Dataset of Public Hearing Transcripts for Summarization of Long Documents, by Leandro Car\’isio Fernandes and Guilherme Zeferino Rodrigues Dobins and Roberto Lotufo and Jayr Alencar Pereira

View PDF
HTML (experimental)

Abstract:This paper introduces PublicHearingBR, a Brazilian Portuguese dataset designed for summarizing long documents. The dataset consists of transcripts of public hearings held by the Brazilian Chamber of Deputies, paired with news articles and structured summaries containing the individuals participating in the hearing and their statements or opinions. The dataset supports the development and evaluation of long document summarization systems in Portuguese. Our contributions include the dataset, a hybrid summarization system to establish a baseline for future studies, and a discussion of evaluation metrics for summarization involving large language models, addressing the challenge of hallucination in the generated summaries. As a result of this discussion, the dataset also includes annotated data to evaluate natural language inference tasks in Portuguese.

Submission history

From: Leandro Carísio Fernandes [view email]
[v1]
Thu, 10 Oct 2024 00:13:59 UTC (413 KB)
[v2]
Fri, 22 Aug 2025 13:02:16 UTC (514 KB)

Source link

#Brazilian #Portuguese #Dataset #Public #Hearing #Transcripts #Summarization #Long #Documents