View a PDF of the paper titled On Designing Efficient RL Reward at Coaching Time for LLM Reasoning, by Jiaxuan Gao and eight different authors
Summary:Reward fashions have been more and more essential for enhancing the reasoning functionality of LLMs. Current analysis has proven {that a} well-trained reward mannequin can considerably enhance mannequin performances at inference time by way of search. Nevertheless, the potential of reward fashions throughout RL coaching time nonetheless stays largely under-explored. It’s at present unclear whether or not these reward fashions can present extra coaching alerts to boost the reasoning capabilities of LLMs in RL coaching that makes use of sparse success rewards, which confirm the correctness of options. On this work, we consider well-liked reward fashions for RL coaching, together with the End result-supervised Reward Mannequin (ORM) and the Course of-supervised Reward Mannequin (PRM), and prepare a set of LLMs for math issues utilizing RL by combining these discovered rewards with success rewards. Surprisingly, though these discovered reward fashions have sturdy inference-time performances, they might NOT assist and even damage RL coaching, producing worse performances than LLMs educated with the success reward solely. Our evaluation reveals that an LLM can obtain excessive rewards from a few of these reward fashions by repeating right however pointless reasoning steps, resulting in a extreme reward hacking subject. Subsequently, we introduce two novel reward refinement strategies, together with Clipping and Delta. The important thing concept is to make sure the accumulative reward of any reasoning trajectory is upper-bounded to maintain a discovered reward mannequin efficient with out being exploited. We consider our strategies with a number of reward fashions over a set of 1.5B and 7B LLMs on MATH and GSM8K benchmarks and display that with a fastidiously designed reward operate, RL coaching with none extra supervised tuning can enhance all of the evaluated LLMs, together with the state-of-the-art 7B LLM Qwen2.5-Math-7B-Instruct on MATH and GSM8K benchmarks.
Submission historical past
From: Jiaxuan Gao [view email]
[v1]
Sat, 19 Oct 2024 13:53:50 UTC (1,880 KB)
[v2]
Fri, 25 Oct 2024 17:34:22 UTC (1,881 KB)
Source link
#Designing #Efficient #Reward #Coaching #Time #LLM #Reasoning
Unlock the potential of cutting-edge AI options with our complete choices. As a number one supplier within the AI panorama, we harness the facility of synthetic intelligence to revolutionize industries. From machine studying and knowledge analytics to pure language processing and laptop imaginative and prescient, our AI options are designed to boost effectivity and drive innovation. Discover the limitless potentialities of AI-driven insights and automation that propel what you are promoting ahead. With a dedication to staying on the forefront of the quickly evolving AI market, we ship tailor-made options that meet your particular wants. Be part of us on the forefront of technological development, and let AI redefine the best way you use and achieve a aggressive panorama. Embrace the long run with AI excellence, the place potentialities are limitless, and competitors is surpassed.