Particle News: Audit Finds AI Search Frequently Makes Unsupported Claims as Error Rates Vary Widely

Overview

Salesforce AI Research reported that roughly one‑third of answers from generative search tools lacked reliable support, with GPT‑4.5 at 47 percent.
Results varied significantly by product and mode, including about 23 percent for Bing Chat, around 31 percent for You.com and Perplexity search, and 97.5 percent for Perplexity’s deep‑research agent.
The study evaluated systems using the DeepTRACE framework across eight metrics such as one‑sidedness, overconfidence and citation support, drawing on both debate and expertise prompts.
Researchers used an LLM trained on more than 100 human‑annotated examples to score outputs, and Tech Xplore notes that human reviewers later checked the framework’s assessments.
OpenAI declined to comment, Perplexity disputed the methodology over default model selection, and outside scholars urged accuracy and sourcing improvements while flagging concerns about LLM‑based annotation and statistical validation.