Frontier Supercomputer Achieves 1 Trillion Parameter LLM Run

Using only 8% of its GPUs, the world's fastest supercomputer rivals ChatGPT-4 in training Large Language Models.

Overview

The Frontier supercomputer, powered by AMD's EPYC CPUs and Instinct GPUs, has achieved a 1 trillion parameter Large Language Model (LLM) run, rivaling ChatGPT-4.
The achievement was made using only 3,000 of the supercomputer's 37,000 Instinct MI250X GPUs, suggesting potential for even greater performance.
The Frontier supercomputer, located at the Oak Ridge National Laboratory in Tennessee, USA, is currently the world's fastest and second most efficient supercomputer.
The team achieved this result by optimizing and fine-tuning the model training process, demonstrating the effectiveness of their strategies in training LLMs.
Despite the impressive achievement, the team noted that there needs to be more work exploring efficient training performance on AMD GPUs, as most machine learning at this scale is done within Nvidia's CUDA hardware-software ecosystem.