AI agents are autonomous software systems that plan and act to achieve given tasks or goals. They are equipped with aggregated knowledge and experience of human experts and access to relevant data.
We benchmarked capabilities of web-focused AI agents by building our own agents. Follow the links to see our experience with the agents:
Benchmark results
To investigate the business use cases of AI agents, we used 2 different web scraping tasks. All agents failed most of the tasks. Anthropic Computer use and Dendrite performed slightly better than Phidata.
To learn more about web scraping, you can read Roadmap to Web Scraping: Use Cases, Methods & Tools and Web Scraping with RPA.
Task 1:
Prompt: Provide all cloud GPU providers that offer H100. We need every H100 offer from each provider. Therefore a GPU provider may be presented in multiple rows when they offer multiple H100 GPU offer (e.g. an offer with a single H100 and another offer with two H100s). For each row, we need these data points: URL where offer is shared, number of GPUs as an integer, price per hour as a decimal in $. Output as json.
We evaluated their capabilities to
Task 2:
Prompt: Find B2B tech private companies that raised funding in October 2024. Format each result as: [Company name] raised [amount] in [sector/industry].
In this task, Anthropic Computer use (Figure 3) and Phidata (Figure 4) failed to provide answers.
ChatGPT’s search returned 7 companies, of which 6 are accurate. However, one company was listed as having fundraised in August 2024, which does not meet our requirement for companies that fundraised in October 2024. Therefore, this information is incorrect.
Dendrite provided 2 companies correctly, although there are many more companies. This is because it relied on search engine results which were incomplete.
Perplexity provided 6 companies, and while their names, raised amounts, and industries are accurate, none of them completed fundraising in October 2024. Therefore, this information does not meet our requirements.
So the leaders of this task are ChatGPT search and Dendrite.
Prices
Price of Anthropic computer use is based on API requests. For example, we spent ~$2,5 to run these 2 tasks, running each tasks a couple times. $0.5 for a task run is expensive. If you want to use agentic process automation, you can find more cost-effective options.
ChatGPT’s search functionality is available to users subscribed to the Plus and Team plans, priced at $20 per month and $25 per user per month (billed annually), respectively.
Dendrite offers a limited free plan and a Developer plan priced at $30. Specific details regarding the limitations of the free plan will be updated once they are officially published.
Phidata has free, pro and enterprise plans. Plans other than free are not available yet. Also they claim that they will provide pro plan free for students, educators and start-ups.
Our methodology
Versions: Latest versions available as of November 1, 2024.
Deployment environment:
-
Dendrite and Phidata were run on our laptop.
-
Anthropic Computer use was deployed to a cloud VM as it recommended against deployments on user devices.
-
ChatGPT search feature and Perplexity directly on their respective websites.
Process:
-
To evaluate the vendors in the web searching capabilities, we first prepared a ground-truth, which includes all the cloud H100 providers. Then, we compared it with the outputs of the AI agents.
-
To evaluate the accuracy of the information, we checked all the links they provided to see whether the information they provided us is correct or not.
-
We did not try prompt engineering to get more accurate results.
Scoring:
Since the number of outputs they provide varies, we aimed to keep the scoring system as straightforward as possible. For task 1, if a product returns a URL that is not from a reliable source, it receives a score of 0. Additionally, the number of outputs ranges from 6 to 28, so it’s important to note that a product with 3 correct answers out of 6 outputs and another with 14 correct answers out of 24 outputs receive the same score in Figure 2.
We did not score the products for Task 2, as the search results vary significantly based on the used browser and the location of the user, and the products scrape data accordingly from these sources. However, since ChatGPT and Dendrite provided accurate results, they are considered the leaders for this task.
Disclaimer
Since the agents use different browsers and locations, these models can encounter different sources while web scraping. To be fair to all agents, all potential sources were included in our ground-truth.
Since all of these products are version 1 or beta, they have various limitations, we will continue to repeat the benchmark and update the results as the products develop.
Since these models are newly developed, they may cause security vulnerabilities, so we recommend using them in a virtual machine or container. Anthropic also mentions the necessity of taking this precaution when using Computer use.
Anthropic Computer use
Computer use makes numerous API calls for a single task. Running an agent with computer use is slow.
We initially encountered problems due to Anthropic’s rate limits, in Tier 1, Anthropic allows users to use 50 API requests per minutes. This was not enough to finish our tasks, so we needed to run the prompt multiple times.
Then, we asked for a higher API limit and received the limit within hours which facilitated benchmarking.
Perplexity
Perplexity’s search tool is accessible directly on its website. Like ChatGPT search, it is not an agentic AI, we chose to include it in our testing since our benchmark task involves web scraping.
ChatGPT search
ChatGPT’s search feature is available to pro and team users directly within the ChatGPT interface. Although it is not an agentic AI, we included it in our testing because the focus of this benchmark is web scraping.
Dendrite
Dendrite provides examples agents like data extraction agent on their website which facilitates building new agents.
Dendrite’s agents are running slower than most of the other agents in this benchmark.
Unlike other agents, it requires users to enter the search query.
Phidata
Phidata provides examples like web search agent on their website to make it easy to build new agents. We developed an agent in minutes.
Phidata’s agents hallucinated results in our benchmark providing links to pages and pricing information that do not exist.
FAQ
What are the AI agent applications and use cases?
AI agents can automate complex workflows, reducing the need for human intervention and increasing efficiency. They can handle exceptions and edge cases, making them more reliable than traditional automation solutions.
AI agents can perform tasks that would be difficult or boring to humans. They can also be used for natural language processing, data processing, and analysis.
How to build your own agents?
Choose a vendor by considering your needs, abilities and their prices.
They can be integrated with external systems using API calls and can access a wide range of data sources.
Design the task for your ai agent, you should be able to provide a prompt which is goal-oriented and not confusing to the model.
Do AI agents secure?
AI agents must be designed with data privacy and security in mind, using techniques such as encryption and access controls. In current level of development, we suggest you to not share your sensitive data with the artificial intelligence agents.
What are the business benefits of AI agents?
AI agents can increase efficiency and productivity, automating repetitive tasks and freeing up human agents to focus on more complex tasks.
They can analyze enterprise data and automate business processes. If you need to learn more, see agentic process automation. By building autonomous agents, you can automate processes and have more tasks done.
How to measure the success of AI agents?
If you use an agent in your business, use metrics such as efficiency, productivity, and customer satisfaction to measure the success of AI agents.
Monitor the performance of AI agents over time, making adjustments as needed.
Use data and analytics to provide insights into the decision-making processes and reliability of AI agents.
External Links
Source link
#Agents #Benchmarked #Price #Performance