New Research Details LLM Weaknesses and Tests Layered Defenses

Fresh papers detail steep accuracy losses from tiny math prompt changes, with layered calibration proposed as partial relief.

Overview

Two arXiv studies show small, semantically preserving tweaks to math problems can slash LLM accuracy by up to 49.89% on GSM8K and 35.40% on MATH500, and by as much as 51.55% under numeric distraction.
A separate preprint proposes a five-layer protection architecture with ordered calibration to sustain a human–AI partnership state, verify performance, and detect degradation during high‑stakes decisions.
Researchers validate an adaptive multi-agent refinement setup that routes queries to specialized reviewers for factuality, personalization, and coherence, outperforming strong conversational baselines.
An updated self-interpretability study finds fine-tuned models can accurately describe quantitative weights driving their own decisions and generalize this reporting beyond training tasks.
Applied evaluations highlight practical limits and trade-offs as zero-shot prompting offered the best cost–quality balance for educational feedback and a task‑specific ByT5-Sanskrit model beat instruction‑tuned LLMs on poetry‑to‑prose conversion.