Deep research is a feature on some LLMs that offers users a wider range of searches than AI search engines.
We tested and evaluated the following tools to determine which one is most helpful to users:
- Grok 3 Deep Search
- ChatGPT Deep Research with o1
- Perplexity Deep Research
Results
We evaluated them in terms of accuracy and number of sources.
Grok leads in the accuracy of the data provided:
Grok is also the leader based on the number of indexed sources:
Task 1:
We asked them to create tables about enterprise password management software per our prompt. The whole prompt can be found below.
Perplexity provided a detailed report but failed to create a table with its gathered information. Since our prompt specifically requested table outputs, it received zero points for that task.
Task 2:
The goal of this task is to evaluate their speed and coverage in research. We asked for a detailed report on RPA adoption to see the number of indexed pages and the speed of creating a report. Of course, the number of sources does not have to correlate with the quality of the research. However, since these tools are designed to speed up research, we considered it an important metric.
We should also note that search times vary significantly across these tools. Grok Deep Search is approximately 10 times faster than ChatGPT Deep Research and searches approximately 3 times more webpages.
Methodology
Every data in the prompt scored as 1 point. If the output was not in table format, we rated it as 0.
Prompt of the Task 1:
Research and evaluate the top 5 enterprise password management solutions based on the following criteria to identify the most effective solution for enterprise deployment.
Criteria
1. Security Features
– Encryption standard used
– Zero-knowledge architecture implementation
– MFA options supported
– Third-party security certifications
– Password health monitoring features
2. Deployment & Integration
– Deployment options
– Directory integration capabilities
– API availability and functionality
– SSO integration
3. User Experience
– Browser extension compatibility
– Mobile app availability and rating
– Offline access capabilities
– Password sharing functionality
4. Administration
– Password policy enforcement options
– User provisioning/deprovisioning automation
– Reporting and compliance features
– Emergency access protocols
5. Cost & Scalability
– Compare pricing using standardized enterprise scenarios (100 users, 500 users, 1000+ users)
Delivery Format
1. Detailed table for each criterion
2. Cost comparison table with standardized scenarios
In our second task, we aimed to discover the scope of the research conducted. To do this, we compared the number of references cited. Comparing articles is not an objective method in this case, as establishing a definitive ground truth is not feasible. However, the number of references can give us an idea about their ability to provide information since the strength of these tools is their ability to index hundreds of web pages in minutes.
These tools can help users get faster answers without reading many pages. For example, Grok indexed more than 100 different pages in our second task. Normally, it takes hours for a human to read and gather information from all these pages, but it took ∼2 minutes for Grok. Therefore, these tools can speed up the research process. However, users should always remember that these tools can hallucinate and generate wrong information, so be cautious when using information directly taken from an LLM.
Most people are suspicious of the accuracy of LLM-generated information and double-check it themselves because they know that LLMs can hallucinate. The issue with deep research is that, because it conducts more comprehensive research than standard chat and provides sources, users may mistakenly assume it always provides accurate information. LLMs (even with deep research) still tend to hallucinate, and this may result in serious misunderstandings.
Gary Marcus also warned that it can cause a decline in the quality of scientific papers.
FAQ
What is AI-powered research?
AI-powered research tools transform how scientists conduct research, making it faster and more efficient. Deep research tools, in particular, have the potential to significantly impact the scientific community. They can help speed up the process, but users should be careful about mistakes before publishing that information.
Industry reports and studies have shown that AI tools can be highly effective in certain areas, such as data analysis and literature reviews. These tools use capable AI models to synthesize information from multiple sources, providing key findings and insights.
These models use reasoning models and generative AI to synthesize information and provide insights. They can also respond to complex topics and provide detailed answers. Pro users can leverage AI tools to gain a competitive edge in their research.
Like Deep Research, new models and technologies, such as AI Python tools and text-only subsets, are emerging, and the integration of all these tools will increase the scope and reliability of Deep Research.
Can AI tools make literature reviews?
Yes, they can index many pages and then gather this information as a report.
External Links
Source link
#Grok #ChatGPT #Perplexity