Particle.news
Download on the App Store

OpenAI Releases gpt-oss-safeguard Open-Weight Safety Models

The open-weight models apply platform policies at runtime with explainable reasoning.

Overview

  • The models come in 120B and 20B parameter sizes, fine-tuned from gpt-oss, with weights downloadable on Hugging Face under the Apache 2.0 license.
  • OpenAI built the system with Discord, SafetyKit, and ROOST and is launching it as a research preview to gather feedback from the safety community.
  • The approach interprets developer-supplied policies at inference time and produces chain-of-thought rationales to make classification decisions reviewable.
  • On an internal multi-policy benchmark, the 120B version outperformed GPT-5 on accuracy, yet OpenAI reports that traditional labeled-data classifiers still do better on complex tasks.
  • OpenAI warns that the reasoning can hallucinate and requires more compute, recommending selective use alongside fast classifiers, with community testing and a Dec. 8 San Francisco hackathon planned.