OpenAI Unveils gpt-oss-safeguard: Open-Weight Reasoning Models for AI Safety
OpenAI has unveiled gpt-oss-safeguard, a pair of open-weight reasoning models designed for safety classification. The models, available in 120b and 20b versions, are hosted on Hugging Face for free use and modification.
The models, derived from OpenAI's internal Safety Reasoner framework, use reasoning at inference time to interpret developer-provided policies for classification. This allows developers to apply their own custom policies to detect and filter unsafe content.
The gpt-oss-safeguard models were first tested and developed by OpenAI in collaboration with ROOST. The partner company helped identify critical developer needs, test the models, and produce developer documentation. OpenAI will lead the launch of the Model Community for developers, aiming to refine the models based on input from the wider research and trust-and-safety community.
The models have been benchmarked against internal datasets and public benchmarks, achieving results comparable to or better than other open models. They are part of OpenAI's 'defense in depth' safety strategy, combining layered protections and open collaboration.
OpenAI has released gpt-oss-safeguard, a significant step in its safety strategy. The models, available on Hugging Face, empower developers to create safer AI experiences. OpenAI welcomes feedback from the community to refine these models.