Particle: Forgetting Boosts Language Learning in Transformer Models

Overview

Researchers Abishek Thamma and Micha Heilbron modified Transformer language models to include gradual memory decay plus a short echoic buffer and trained them on the child-scale BabyLM benchmark.
Across training runs the altered models achieved better language modeling scores and stronger syntactic generalization than standard Transformers.
The learning gains required preserving the most recent three to seven words in an echoic buffer alongside the memory decay mechanism.
Those fleeting-memory models were worse at surprisal-based prediction of human reading times, and follow-up analyses found no existing explanation for this mismatch.
The paper revives a long-standing cognitive idea that memory limits can aid learning and calls for replication, scaling tests, and studies of why learning gains do not translate to better modeling of human online processing.