Particle News: LLM-Agent Research Coalesces Around Generalizability, RLVR, and Grounded Retrieval

Overview

A comprehensive survey formalizes what it means for agents to generalize across tasks and domains, separating frameworks from agents and calling for standardized, variance- and cost-aware metrics.
Fresh benchmarks such as EngiBench and AECBench find models strong on basic knowledge yet weak on high-level reasoning, robustness and domain calculations, while WildClaims, AIPsychoBench and DIWALI reveal gaps in real-world information access, psychometrics and culturally grounded adaptation.
Method papers report measured gains from RLVR-trained reasoning models, joint graph–LLM retrievers (GRIL), code-based RL environments like CodeGym, and modular systems such as SignalLLM and LogicGuard; in one result, CodeGym training lifted Qwen2.5-32B-Instruct by 8.7 points on the OOD τ-Bench.
Grounding advances include a knowledge-graph RAG chatbot with a tripartite RAG-Eval scoring framework, a large study showing LLMs handle graph inference best via code generation, and medical KG reward modeling that judges diagnostic paths well but transfers weakly to downstream tasks.
Persistent limits are documented across the literature, including agent hallucinations throughout workflows, resistance to incorporating even near-complete feedback, and systematic biases in LLM-as-judge setups such as position, verbosity and self-preference effects.