...

5 AI Agents Benchmarked for Price & Performance


AI agents are autonomous software program techniques that plan and act to attain given duties or targets. They’re outfitted with aggregated information and expertise of human consultants and entry to related knowledge.

We benchmarked capabilities of web-focused AI brokers by constructing our personal brokers. Observe the hyperlinks to see our expertise with the brokers:

Benchmark outcomes

To research the enterprise use instances of AI brokers, we used 2 totally different internet scraping duties. All brokers failed many of the duties. Anthropic Laptop use and Dendrite carried out barely higher than Phidata.

To be taught extra about internet scraping, you possibly can learn Roadmap to Web Scraping: Use Cases, Methods & Tools and Web Scraping with RPA.

Job 1:

Immediate: Present all cloud GPU suppliers that supply H100. We’d like each H100 provide from every supplier. Due to this fact a GPU supplier could also be introduced in a number of rows once they provide a number of H100 GPU provide (e.g. a suggestion with a single H100 and one other provide with two H100s). For every row, we’d like these knowledge factors: URL the place provide is shared, variety of GPUs as an integer, worth per hour as a decimal in $. Output as json.

We evaluated their capabilities to

5 AI Agents Benchmarked for Price & Performance
Determine 1: The share of the accurately supplied sources by the merchandise.
Determine 2: The share of the accuracy of the data supplied by the merchandise.

Job 2:

Immediate: Discover B2B tech non-public corporations that raised funding in October 2024. Format every consequence as: [Company name] raised [amount] in [sector/industry].

On this activity, Anthropic Laptop use (Determine 3) and Phidata (Determine 4) failed to supply solutions.

Determine 3: Laptop use’s reply to our activity.
Determine 4: Phidata’s reply to our activity, it supplied related assets however not the solutions.

ChatGPT’s search returned 7 corporations, of which 6 are correct. Nevertheless, one firm was listed as having fundraised in August 2024, which doesn’t meet our requirement for corporations that fundraised in October 2024. Due to this fact, this data is wrong.

Dendrite supplied 2 corporations accurately, though there are lots of extra corporations. It’s because it relied on search engine outcomes which had been incomplete.

Perplexity supplied 6 corporations, and whereas their names, raised quantities, and industries are correct, none of them accomplished fundraising in October 2024. Due to this fact, this data doesn’t meet our necessities.

So the leaders of this activity are ChatGPT search and Dendrite.

Costs

Worth of Anthropic pc use is predicated on API requests. For instance, we spent ~$2,5 to run these 2 duties, operating every duties a pair occasions. $0.5 for a activity run is pricey. If you wish to use agentic process automation, you will discover more cost effective choices.

ChatGPT’s search performance is offered to customers subscribed to the Plus and Workforce plans, priced at $20 per thirty days and $25 per consumer per thirty days (billed yearly), respectively.

Dendrite affords a restricted free plan and a Developer plan priced at $30. Particular particulars relating to the restrictions of the free plan will likely be up to date as soon as they’re formally printed.

Phidata has free, professional and enterprise plans. Plans apart from free are usually not obtainable but. Additionally they declare that they may present professional plan free for college kids, educators and start-ups.

Our methodology

Variations: Newest variations obtainable as of November 1, 2024.

Deployment setting:

  • Dendrite and Phidata had been run on our laptop computer.

  • Anthropic Laptop use was deployed to a cloud VM because it advisable in opposition to deployments on consumer gadgets.

  • ChatGPT search characteristic and Perplexity instantly on their respective web sites.

Course of:

  • To judge the distributors within the internet looking out capabilities, we first ready a ground-truth, which incorporates all of the cloud H100 providers. Then, we in contrast it with the outputs of the AI brokers.

  • To judge the accuracy of the data, we checked all of the hyperlinks they supplied to see whether or not the data they supplied us is appropriate or not.

  • We didn’t attempt immediate engineering to get extra correct outcomes.

Scoring:

For the reason that variety of outputs they supply varies, we aimed to maintain the scoring system as simple as attainable. For activity 1, if a product returns a URL that’s not from a dependable supply, it receives a rating of 0. Moreover, the variety of outputs ranges from 6 to twenty-eight, so it’s essential to notice {that a} product with 3 appropriate solutions out of 6 outputs and one other with 14 appropriate solutions out of 24 outputs obtain the identical rating in Determine 2.

We didn’t rating the merchandise for Job 2, because the search outcomes differ considerably primarily based on the used browser and the situation of the consumer, and the merchandise scrape knowledge accordingly from these sources. Nevertheless, since ChatGPT and Dendrite supplied correct outcomes, they’re thought of the leaders for this activity.

Disclaimer

For the reason that brokers use totally different browsers and places, these fashions can encounter totally different sources whereas internet scraping. To be truthful to all brokers, all potential sources had been included in our ground-truth.

Since all of those merchandise are model 1 or beta, they’ve numerous limitations, we’ll proceed to repeat the benchmark and replace the outcomes because the merchandise develop.

Since these fashions are newly developed, they could trigger safety vulnerabilities, so we suggest utilizing them in a digital machine or container. Anthropic additionally mentions the need of taking this precaution when utilizing Laptop use.

Determine 5: Anthropic’s warning concerning the utilization of Laptop use.

Anthropic Laptop use

Laptop use makes quite a few API requires a single activity. Working an agent with pc use is gradual.

We initially encountered issues as a consequence of Anthropic’s price limits, in Tier 1, Anthropic permits customers to make use of 50 API requests per minutes. This was not sufficient to complete our duties, so we wanted to run the immediate a number of occasions.

Then, we requested for the next API restrict and acquired the restrict inside hours which facilitated benchmarking.

Perplexity

Perplexity’s search instrument is accessible instantly on its web site. Like ChatGPT search, it’s not an agentic AI, we selected to incorporate it in our testing since our benchmark activity includes internet scraping.

ChatGPT’s search characteristic is offered to professional and workforce customers instantly inside the ChatGPT interface. Though it’s not an agentic AI, we included it in our testing as a result of the main focus of this benchmark is internet scraping.

Dendrite

Dendrite gives examples brokers like knowledge extraction agent on their web site which facilitates constructing new brokers.

Dendrite’s brokers are operating slower than many of the different brokers on this benchmark.

Not like different brokers, it requires customers to enter the search question.

Phidata

Phidata gives examples like internet search agent on their web site to make it straightforward to construct new brokers. We developed an agent in minutes.

Phidata’s brokers hallucinated ends in our benchmark offering hyperlinks to pages and pricing data that don’t exist.

FAQ

What are the AI agent functions and use instances?

AI brokers can automate advanced workflows, lowering the necessity for human intervention and growing effectivity. They will deal with exceptions and edge instances, making them extra dependable than conventional automation options.
AI brokers can carry out duties that might be troublesome or boring to people. They may also be used for pure language processing, knowledge processing, and evaluation.

Find out how to construct your individual brokers?

Select a vendor by contemplating your wants, skills and their costs.
They are often built-in with exterior techniques utilizing API calls and might entry a variety of knowledge sources.
Design the duty in your ai agent, it’s best to have the ability to present a immediate which is goal-oriented and never complicated to the mannequin.

Do AI brokers safe?

AI brokers should be designed with knowledge privateness and safety in thoughts, utilizing strategies comparable to encryption and entry controls. In present stage of improvement, we propose you to not share your delicate knowledge with the substitute intelligence brokers.

What are the enterprise advantages of AI brokers?

AI brokers can enhance effectivity and productiveness, automating repetitive duties and releasing up human brokers to deal with extra advanced duties.
They will analyze enterprise knowledge and automate enterprise processes. If it’s essential be taught extra, see agentic process automation. By constructing autonomous brokers, you possibly can automate processes and have extra duties accomplished.

Find out how to measure the success of AI brokers?

In the event you use an agent in your small business, use metrics comparable to effectivity, productiveness, and buyer satisfaction to measure the success of AI brokers.
Monitor the efficiency of AI brokers over time, making changes as wanted.
Use knowledge and analytics to supply insights into the decision-making processes and reliability of AI brokers.

Source link

#Brokers #Benchmarked #Worth #Efficiency