Particle.news

Download on the App Store

Audit Finds AI Search Frequently Makes Unsupported Claims as Error Rates Vary Widely

The findings stem from a 303‑query audit that used an LLM to score answers, a choice some experts and vendors say requires stronger validation.

Overview

  • Salesforce AI Research reported that roughly one‑third of answers from generative search tools lacked reliable support, with GPT‑4.5 at 47 percent.
  • Results varied significantly by product and mode, including about 23 percent for Bing Chat, around 31 percent for You.com and Perplexity search, and 97.5 percent for Perplexity’s deep‑research agent.
  • The study evaluated systems using the DeepTRACE framework across eight metrics such as one‑sidedness, overconfidence and citation support, drawing on both debate and expertise prompts.
  • Researchers used an LLM trained on more than 100 human‑annotated examples to score outputs, and Tech Xplore notes that human reviewers later checked the framework’s assessments.
  • OpenAI declined to comment, Perplexity disputed the methodology over default model selection, and outside scholars urged accuracy and sourcing improvements while flagging concerns about LLM‑based annotation and statistical validation.