Overview
- Researchers retrained open models such as LLaMA and Qwen on X/Twitter corpora built from viral posts versus longer factual text, then evaluated performance on standard benchmarks.
- Reasoning accuracy on ARC‑Challenge fell from 74.9 to 57.2 and long‑context comprehension on RULER‑CWE dropped from 84.4 to 52.3 when models were trained solely on viral content.
- The team identified a failure mode dubbed “thought skipping,” where models omit intermediate reasoning steps and produce shorter, less structured answers with more errors.
- Performance did not return to baseline after fine‑tuning on clean data, which the authors attribute to representational drift that conventional retraining cannot fully reverse.
- The study links degradation to engagement signals rather than semantics and reports personality‑like safety regressions, prompting calls for stronger data provenance and quality controls, including in emerging on‑chain data marketplaces.