Overview
- Salesforce AI Research reported that roughly one‑third of answers from generative search tools lacked reliable support, with GPT‑4.5 at 47 percent.
- Results varied significantly by product and mode, including about 23 percent for Bing Chat, around 31 percent for You.com and Perplexity search, and 97.5 percent for Perplexity’s deep‑research agent.
- The study evaluated systems using the DeepTRACE framework across eight metrics such as one‑sidedness, overconfidence and citation support, drawing on both debate and expertise prompts.
- Researchers used an LLM trained on more than 100 human‑annotated examples to score outputs, and Tech Xplore notes that human reviewers later checked the framework’s assessments.
- OpenAI declined to comment, Perplexity disputed the methodology over default model selection, and outside scholars urged accuracy and sourcing improvements while flagging concerns about LLM‑based annotation and statistical validation.