Particle News: OpenAI Releases gpt-oss-safeguard Open-Weight Safety Models

Overview

The models come in 120B and 20B parameter sizes, fine-tuned from gpt-oss, with weights downloadable on Hugging Face under the Apache 2.0 license.
OpenAI built the system with Discord, SafetyKit, and ROOST and is launching it as a research preview to gather feedback from the safety community.
The approach interprets developer-supplied policies at inference time and produces chain-of-thought rationales to make classification decisions reviewable.
On an internal multi-policy benchmark, the 120B version outperformed GPT-5 on accuracy, yet OpenAI reports that traditional labeled-data classifiers still do better on complex tasks.
OpenAI warns that the reasoning can hallucinate and requires more compute, recommending selective use alongside fast classifiers, with community testing and a Dec. 8 San Francisco hackathon planned.