View a PDF of the paper titled BABILong: Testing the Limits of LLMs with Lengthy Context Reasoning-in-a-Haystack, by Yuri Kuratov and 6 different authors
Summary:Lately, the enter context sizes of huge language fashions (LLMs) have elevated dramatically. Nonetheless, present analysis strategies haven’t saved tempo, failing to comprehensively assess the effectivity of fashions in dealing with lengthy contexts. To bridge this hole, we introduce the BABILong benchmark, designed to check language fashions’ capability to cause throughout info distributed in extraordinarily lengthy paperwork. BABILong features a numerous set of 20 reasoning duties, together with reality chaining, easy induction, deduction, counting, and dealing with lists/units. These duties are difficult on their very own, and much more demanding when the required info are scattered throughout lengthy pure textual content. Our evaluations present that widespread LLMs successfully make the most of solely 10-20% of the context and their efficiency declines sharply with elevated reasoning complexity. Amongst options to in-context reasoning, Retrieval-Augmented Technology strategies obtain a modest 60% accuracy on single-fact query answering, unbiased of context size. Amongst context extension strategies, the best efficiency is demonstrated by recurrent reminiscence transformers after fine-tuning, enabling the processing of lengths as much as 50 million tokens. The BABILong benchmark is extendable to any size to assist the analysis of latest upcoming fashions with elevated capabilities, and we offer splits as much as 10 million token lengths.
Submission historical past
From: Yuri Kuratov [view email]
[v1]
Fri, 14 Jun 2024 16:00:29 UTC (7,834 KB)
[v2]
Wed, 6 Nov 2024 14:50:40 UTC (2,274 KB)
Source link
#Testing #Limits #LLMs #Lengthy #Context #ReasoninginaHaystack