Head over to our on-demand library to view sessions from VB Transform 2023. Register Here
San Francisco-based artificial intelligence (AI) startup Arthur has announced the launch of Arthur Bench, an open-source tool for evaluating and comparing the performance of large language models (LLMs) like OpenAI‘s GPT-3.5 Turbo and Meta’s LLaMA 2.
“With Bench, we’ve created an open-source tool to help teams deeply understand the differences between LLM providers, different prompting and augmentation strategies, and custom training regimes,” said Adam Wenchel, co-founder and CEO of Arthur, in a press release statement.
How Arthur Bench works
Arthur Bench allows companies to test performance of different language models on their specific use cases. It provides metrics to compare models on accuracy, readability, hedging, and other criteria.
For those who have used LLMs on more than a few occasions, “hedging” is an especially noticeable issue — that’s where an LLM provides extraneous language summarizing or alluding to its terms of service, or programming constraints, such as saying “as an AI language model…”, which is typically not germane to a user’s desired response.
VB Transform 2023 On-Demand
Did you miss a session from VB Transform 2023? Register to access the on-demand library for all of our featured sessions.
“Those are kind of some of the subtle differences of behaviors that may be relevant for your particular application,” Wenchel said in an exclusive video interview with VentureBeat.
Arthur has included a number of starter criteria upon which to compare LLM performance, but because the tool is open source, enterprises using it may add their own criteria to fit their needs.
“You can grab the last 100 questions your users asked and run them against all models. Then Arthur Bench will highlight where answers were wildly different so you can manually review those,” explained Wenchel.
The goal is to help enterprises make informed decisions when adopting AI. Arthur Bench accelerates benchmarking and translates academic measures into real-world business impact.
The company uses a combination of statistical measures and scores, as well as the assessment of other LLMs, to grade the response of desired LLMs side-by-side.
Arthur Bench in action
Wenchel said financial services firms have already been using Arthur Bench to generate investment theses and analysis more quickly.
Vehicle manufacturers have taken their equipment manuals with many pages of highly specific technical guidance and used Arthur Bench to create LLMs that are capable of answering customer queries while sourcing information from said manuals quickly and accurately, while reducing hallucinations.
Another customer, the enterprise media and publishing platform Axios HQ, is also using Arthur Bench on its product development side.
“Arthur Bench helped us develop an internal framework to scale and standardize LLM evaluation across features, and to describe performance to the Product team with meaningful and interpretable metrics,” said Priyanka Oberoi, staff data scientist at Axios HQ.
Arthur is open sourcing Bench so anyone can use and contribute to it for free. The startup believes an open source approach leads to the best products. There will still be opportunities to monetize through team dashboards.
Collaborations with AWS and Cohere
Arthur also announced a hackathon with Amazon Web Services (AWS) and Cohere to encourage developers to build new metrics for Arthur Bench.
Wenchel said AWS’s Bedrock environment for choosing between and deploying a variety of LLMs was “very philosophically aligned” with Arthur Bench.
“How do you rationally decide which LLMs are right for you?” Wenchel said. “This compliments the AWS strategy very well.”
The company launched Arthur Shield earlier this year to monitor large language models for hallucinations and other issues.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.