New Algorithm Compresses Large Language Models for Local Device Use

Engineers from Princeton and Stanford develop a technique to run LLMs on phones and laptops, enhancing privacy and efficiency.

Overview

The CALDERA algorithm reduces redundancies and precision in LLM data, enabling storage on local devices.
By compressing data, the technique allows LLMs to operate on consumer-grade GPUs, lowering costs and energy use.
The method combines low-precision and low-rank techniques for improved compression over previous methods.
Testing with Meta AI's Llama models showed up to a 5% performance improvement in certain tasks.
Local processing of LLMs enhances privacy by eliminating the need to send data to centralized servers.