New Algorithm Compresses Large Language Models for Local Device Use
Engineers from Princeton and Stanford develop a technique to run LLMs on phones and laptops, enhancing privacy and efficiency.
- The CALDERA algorithm reduces redundancies and precision in LLM data, enabling storage on local devices.
- By compressing data, the technique allows LLMs to operate on consumer-grade GPUs, lowering costs and energy use.
- The method combines low-precision and low-rank techniques for improved compression over previous methods.
- Testing with Meta AI's Llama models showed up to a 5% performance improvement in certain tasks.
- Local processing of LLMs enhances privacy by eliminating the need to send data to centralized servers.