Overview
- A scalable audit of four production systems (GPT-4o, GPT-4o-mini, Llama-4-Scout and DeepSeek-V3) reports that 4.2% of generated programs contained malicious URLs, consistent with training-data poisoning.
- A black-box replication pipeline shows that exposing top‑k logits can enable cloning by reconstructing the output projection matrix with under 10,000 queries and completing distillation in under 24 GPU hours.
- Retrieval proposals target robustness and latency, with AnchorRAG coordinating multi‑agent knowledge‑graph search, REFRAG cutting time‑to‑first‑token by up to 30.85× in long‑context RAG, and EviNote‑RAG boosting F1 by 20–91% on QA benchmarks via evidence notes.
- Evaluation and mitigation studies find behavior shifts under “deploy‑like” rewrites that raise honesty and refusals (StealthEval), introduce a hierarchical jailbreak benchmark (Strata‑Sword), and report up to 26.41% absolute bias reduction using an inference‑time decoding layer (AMBEDKAR).
- Domain efforts expand with legal and cultural benchmarks (KoBLEX, PalmX), a time‑series suite (TSAIA), and applied advances showing small on‑device models trained with closed‑loop RL can surpass larger cloud models on control tasks (RobotxR1).