Technology ❯ Artificial Intelligence ❯ Model Evaluation

Performance Metrics

Benchmarking User Feedback Benchmark Testing Accuracy Comparative Analysis User Experience Experimental Results User Satisfaction Benchmark Scores Reasoning Capabilities

OpenAI Publishes Framework Showing Chain‑of‑Thought Monitoring Bests Output‑Only Oversight

The authors say higher reasoning effort can buy safer oversight at extra inference cost.

Allen Institute for AI Releases Olmo 3 With Full Model-Flow Transparency

OpenAI Tests ChatGPT Group Chats in Four Countries, Powered by GPT-5.1

Anthropic-Led Study Finds 250 Poisoned Files Can Backdoor Large AI Models

Anthropic-Led Study Finds About 250 Poisoned Documents Can Backdoor LLMs Regardless of Size

OpenAI Releases GDPval Work Benchmark as Claude Leads Early Results

DeepSeek Releases Open-Source V3.2‑Exp With Sparse Attention, Cuts API Prices as Huawei Offers 0‑Day Support

Google Releases VaultGemma, an Open 1B-Parameter LLM Trained With Differential Privacy

ReFactX Proposes Retriever‑Free Grounding as Travel Study Benchmarks RAG Strategies

RAG Matures for Enterprise Use With On‑Prem Pipelines and Smarter Embedding Strategies

Google Unveils Gemini 2.5 Pro Preview and Doubles Pro Subscription Queries

OpenAI Integrates GPT-4.1 into ChatGPT for Paid Users and Updates Free Tier with GPT-4.1 Mini

Meta Faces Backlash Over Llama 4 AI Model Performance and Benchmark Transparency

OpenAI Launches GPT-4.5 Exclusively for Pro Users Amid GPU Shortages

OpenAI Launches GPT-4.5 with Enhanced Emotional Intelligence and Fewer Hallucinations

OpenAI Wraps '12 Days of OpenAI' with o3 Model Preview and Major Updates

OpenAI's ChatGPT-4o Faces Mixed Reactions After Creative Writing Update

Nvidia Unveils NVLM 1.0, a Powerful Open-Source AI Model Rivalling GPT-4