...

[2410.02691] On the Proper Treatment of Tokenization in Psycholinguistics


View a PDF of the paper titled On the Correct Therapy of Tokenization in Psycholinguistics, by Mario Giulianelli and 5 different authors

View PDF

Summary:Language fashions are broadly utilized in computational psycholinguistics to check theories that relate the adverse log chance (the surprisal) of a area of curiosity (a substring of characters) underneath a language mannequin to its cognitive value skilled by readers, as operationalized, for instance, by gaze length on the area. Nonetheless, the applying of contemporary language fashions to psycholinguistic research is sophisticated by the observe of utilizing tokenization as an intermediate step in coaching a mannequin. Doing so leads to a language mannequin over token strings fairly than one over character strings. Vexingly, areas of curiosity are usually misaligned with these token strings. The paper argues that token-level language fashions ought to be (roughly) marginalized into character-level language fashions earlier than they’re utilized in psycholinguistic research to compute the surprisal of a area of curiosity; then, the marginalized character-level language mannequin can be utilized to compute the surprisal of an arbitrary character substring, which we time period a focal space, that the experimenter could want to use as a predictor. Our proposal of marginalizing a token-level mannequin right into a character-level one solves this misalignment subject independently of the tokenization scheme. Empirically, we uncover varied focal areas whose surprisal is a greater psychometric predictor than the surprisal of the area of curiosity itself.

Submission historical past

From: Mario Giulianelli [view email]
[v1]
Thu, 3 Oct 2024 17:18:03 UTC (668 KB)
[v2]
Thu, 31 Oct 2024 12:40:33 UTC (630 KB)

Source link

#Correct #Therapy #Tokenization #Psycholinguistics


Unlock the potential of cutting-edge AI options with our complete choices. As a number one supplier within the AI panorama, we harness the facility of synthetic intelligence to revolutionize industries. From machine studying and information analytics to pure language processing and laptop imaginative and prescient, our AI options are designed to reinforce effectivity and drive innovation. Discover the limitless prospects of AI-driven insights and automation that propel your online business ahead. With a dedication to staying on the forefront of the quickly evolving AI market, we ship tailor-made options that meet your particular wants. Be part of us on the forefront of technological development, and let AI redefine the best way you use and achieve a aggressive panorama. Embrace the longer term with AI excellence, the place prospects are limitless, and competitors is surpassed.