OpenAI开源安全模型：经验不管用了，安全审核领域迎来推理革命？

2026-04-19 19:44:16分类：嘉峪关阅读(16137)

\u003cdiv class=\"rich_media_content\"\u003e\u003cdiv data-exeditor-arbitrary-box=\"image-box\"\u003e\u003c!--IMG_0--\u003e\u003c/div\u003e\u003cp\u003e美国当地时间10月29日，OpenAI突然发布了开源安全模型gpt-oss-safeguard的研究预览版。这不仅是一次模型更新，更是一场理念上的突破。\u003c/p\u003e\u003cp\u003e在AI飞速发展的当下，如何让机器“懂得安全”正成为最棘手的问题。面对全球数亿用户与日益高压的监管环境，OpenAI首次将其核心的安全推理技术向全球开发者开放，向“让AI守护AI”的方向迈出关键一步。\u003c/p\u003e\u003cp\u003e这次更新很及时，因为就在28日，OpenAI曾亲口承认，每周有数十万用户向ChatGPT发起涉及自残、精神健康等高风险话题的对话。这一数字让整个业界警醒——AI不仅在创造内容，也在直面人类的脆弱。\u003c/p\u003e\u003cp\u003egpt-oss-safeguard正是针对安全分类任务进行的专门优化，可用于内容审核、风险检测等多种场景，被视为从“经验法则式的过滤”，迈向“推理驱动的判断”的重要转折。\u003c/p\u003e\u003cp\u003e这是否意味着AI终于学会了自我约束？又会为开发者带来怎样的全新工具与责任？这场“开源的安全实验”，或许才刚刚开始。\u003c/p\u003e\u003ch2\u003e\u003c!--HPOS_0--\u003e推理式安全：内容审核不用再靠经验了\u003c/h2\u003e\u003cp\u003eOpenAI此次推出了两款模型，分别称为gpt-oss-safeguard-120b和gpt-oss-safeguard-20b，它们都在今年8月发布的gpt-oss开源模型基础上进行微调。\u003c/p\u003e\u003cp\u003e同时，这些模型也都在宽松的Apache 2.0许可证下开放，这意味着任何人都可以自由使用、微调以及部署它们。现在，两款模型已经可从Hugging Face下载。\u003c/p\u003e\u003cp\u003e与传统分类器完全不同，gpt-oss-safeguard引入了一种基于推理的全新安全方法。传统的安全分类器依赖大量手动标记的示例来间接推断决策边界，一旦策略需要更新，就必须进行耗时且昂贵的重新训练。\u003c/p\u003e\u003cdiv data-exeditor-arbitrary-box=\"image-box\"\u003e\u003c!--IMG_1--\u003e\u003c/div\u003e\u003cp style=\"text-align: center\" class=\"qqnews_image_desc\"\u003e\u003c!--NO_READ_BEGIN--\u003e\u003cspan style=\"font-size: 14px\"\u003e\u003cspan style=\"color: rgb(102, 102, 102)\"\u003e 图：gpt-oss-safeguard工作原理\u003c/span\u003e\u003c/span\u003e\u003c!--NO_READ_END--\u003e\u003c/p\u003e\u003cp\u003e而gpt-oss-safeguard采取了完全不同的工作流程：它同时接收开发者编写的策略，以及需要判断的内容。模型使用思维链（Chain-of-Thought）过程，直接对策略进行逻辑推理，从而得出分类结论。\u003c/p\u003e\u003cp\u003e这种设计使开发者能够划定最适合其用例的策略界限，且策略是在推理期间提供的，而非被固化在模型内部。这意味着开发者可以轻松地迭代修订策略，以应对快速演变的安全风险。\u003c/p\u003e\u003cp\u003e例如，一个视频游戏论坛可以使用它来检测讨论作弊的帖子，或者产品评论网站可以筛选出看似虚假的评论。开发者可以审查模型的推理过程，这为审核决策提供了清晰的追踪路径，带来了极高的透明度和适应性。\u003c/p\u003e\u003cp\u003eOpenAI强调，这种方法对于防范两类风险特别有效：一是新出现、其危害性尚不明确的威胁；二是非常微妙、难以简单判定的问题。\u003c/p\u003e\u003ch2\u003e\u003c!--HPOS_1--\u003e社区共建：与安全机构携手打磨开源利器\u003c/h2\u003e\u003cp\u003eOpenAI此次发布gpt-oss-safeguard预览版，目的是接收来自研究和安全社区的反馈，并进一步提升模型性能。\u003c/p\u003e\u003cp\u003eOpenAI与Discord、SafetyKit和ROOST等多个信任与安全组织进行了合作开发。通过这种合作，OpenAI在构建安全工具时，能直接汲取来自一线用户的反馈。\u003c/p\u003e\u003cp\u003eROOST首席技术官Vinay Rao称赞该模型是首个采用“自带策略和伤害定义”设计的开源推理模型。他指出，在测试中，模型熟练地理解不同策略、解释推理过程，并在应用策略时展现出细微差别。\u003c/p\u003e\u003cp\u003e作为此次发布的一部分，ROOST正在GitHub上建立ROOST模型社区（RMC），旨在将安全从业者和研究人员聚集起来，共同探索用于保护网络空间的开源AI模型，推动安全工具的普及。\u003c/p\u003e\u003cp\u003eROOST总裁Camille François对此表示赞同：\u0026#34;随着AI变得更加强大，安全工具和基础安全研究必须同步快速发展，而且必须对所有人开放。\u0026#34;\u003c/p\u003e\u003ch2\u003e\u003c!--HPOS_2--\u003e核心技术解密：内部安全推理引擎首次“拆解”\u003c/h2\u003e\u003cp\u003egpt-oss-safeguard背后的技术并非空中楼阁，它源自OpenAI内部使用的、更为强大的安全工具Safety Reasoner。\u003c/p\u003e\u003cp\u003eSafety Reasoner是OpenAI “深度防御”安全堆栈的核心组件，通过对策略标记任务进行强化微调，教会模型直接推理策略如何导致其判断。这种方法使OpenAI能够在生产环境中动态更新安全策略，所需时间远少于重新训练传统分类器。\u003c!--MID_AD_0--\u003e\u003c!--EOP_0--\u003e\u003c/p\u003e\u003c!--PARAGRAPH_0--\u003e\u003cp\u003e在GPT-5和Sora 2等模型的安全保障中，Safety Reasoner都发挥着至关重要的作用。而面对自残等高危领域，该系统作为最终审查环节，能够依据精细化的分类标准对模型输出进行精准归类，从而确定最安全的响应方案。\u003c/p\u003e\u003cp\u003e通过将这种方法开源，OpenAI正把其内部安全策略的核心部分开放给更多开发者，此举将有效推动行业整体安全标准的快速提升。\u003c/p\u003e\u003ch2\u003e\u003c!--HPOS_3--\u003e性能实测：小模型如何超越GPT-5-thinking？\u003c/h2\u003e\u003cp\u003eOpenAI通过内部与外部双重评估体系，对gpt-oss-safeguard模型进行了系统性测试。\u003c/p\u003e\u003cp\u003e在内部评估中，研究人员采用多策略并行测试方法：在推理过程中同时向模型输入多个安全策略，要求其对同一测试文本在所有给定策略下都能做出准确分类。这项测试要求极为严苛，只有当模型的判断结果与所有策略的标准标注完全一致时，才被认定为准确。\u003c/p\u003e\u003cdiv data-exeditor-arbitrary-box=\"image-box\"\u003e\u003c!--IMG_2--\u003e\u003c/div\u003e\u003cp style=\"text-align: center\" class=\"qqnews_image_desc\"\u003e\u003c!--NO_READ_BEGIN--\u003e\u003cspan style=\"font-size: 14px\"\u003e\u003cspan style=\"color: rgb(102, 102, 102)\"\u003e 图：gpt-oss-safeguard模型在内部审核评估（多策略准确率）中的表现\u003c/span\u003e\u003c/span\u003e\u003c!--NO_READ_END--\u003e\u003c/p\u003e\u003cp\u003e测试结果显示，gpt-oss-safeguard与内部专用工具Safety Reasoner在多策略准确率方面均超越了gpt-5-thinking及开源模型gpt-oss。特别值得关注的是，考虑到gpt-oss-safeguard的模型规模相对较小，其性能能够超越规模更大的gpt-5-thinking，这一表现令人惊喜。\u003c!--MID_AD_1--\u003e\u003c!--EOP_1--\u003e\u003c/p\u003e\u003c!--PARAGRAPH_1--\u003e\u003cp\u003e为了确保评估的全面性，OpenAI 还选取了两个权威基准进行测试：一是该公司2022年研究论文中发布的官方审核数据集，二是基于开源聊天机器人用户查询构建的公共基准ToxicChat。\u003c/p\u003e\u003cp\u003e在2022年审核数据集测试中，研究团队采用经过优化的内部安全策略进行评估。gpt-oss-safeguard以微弱优势超越了所有参比模型，包括Safety Reasoner和gpt-5-thinking。不过需要说明的是，Safety Reasoner与gpt-oss-safeguard之间的性能差异并不大。\u003c/p\u003e\u003cdiv data-exeditor-arbitrary-box=\"image-box\"\u003e\u003c!--IMG_3--\u003e\u003c/div\u003e\u003cp style=\"text-align: center\" class=\"qqnews_image_desc\"\u003e\u003c!--NO_READ_BEGIN--\u003e\u003cspan style=\"font-size: 14px\"\u003e\u003cspan style=\"color: rgb(102, 102, 102)\"\u003e 图：gpt-oss-safeguard模型在两个基准测试中的性能表现\u003c/span\u003e\u003c/span\u003e\u003c!--NO_READ_END--\u003e\u003c/p\u003e\u003cp\u003e在ToxicChat基准测试中，研究团队使用基于内部策略改编的简明提示进行验证。Safety Reasoner继续保持领先优势，其表现优于 gpt-5-thinking，而这两者又都略胜于gpt-oss-safeguard的两个版本。虽然绝对性能稍有差距，但gpt-oss-safeguard凭借其适中的模型规模，在实际部署中仍具有独特的优势，特别是在计算资源受限或需要快速迭代的场景下表现尤为突出。\u003c!--MID_AD_2--\u003e\u003c!--EOP_2--\u003e\u003c/p\u003e\u003c!--PARAGRAPH_2--\u003e\u003ch2\u003e\u003c!--HPOS_4--\u003e现实挑战：在推理成本与性能之间寻求平衡\u003c/h2\u003e\u003cp\u003e尽管gpt-oss-safeguard创新显著，OpenAI仍自曝了其两大局限性。\u003c/p\u003e\u003cp\u003e首先，对于更复杂的风险，由数万个高质量标记样本训练出的专用分类器，在性能上仍然可能优于gpt-oss-safeguard直接推理的性能。因此，对于需要更高性能的复杂风险场景，训练专用分类器仍是首选。\u003c/p\u003e\u003cp\u003e其次，gpt-oss-safeguard的推理过程相对耗时且计算密集，这使得在需要低延迟处理海量内容的平台上进行扩展具有挑战性。\u003c/p\u003e\u003cp\u003e在内部，OpenAI通过混合部署来解决这一问题：他们使用更小、更快的分类器来首先识别潜在问题内容，然后才将其传递给更强大的Safety Reasoner进行详细、高成本的分析。这种分层部署模式，无疑将成为未来开发者借鉴和采用的主流安全架构。\u003c!--MID_AD_3--\u003e\u003c!--EOP_3--\u003e\u003c/p\u003e\u003c!--PARAGRAPH_3--\u003e\u003cp\u003egpt-oss-safeguard 的发布，为 AI 安全系统带来了透明化、定制化和社区化的新方向。这款推理模型是开发者手中的一把利器，但其计算成本和性能边界依然存在挑战。那么，在全球开发者广泛采用这一工具后，能否真正迎来一个由策略而非数据主导的AI安全新时代？（\u003cstrong\u003e文/腾讯科技特约编译金鹿，编辑/涵清\u003c/strong\u003e）\u003c!--MID_AD_4--\u003e\u003c!--EOP_4--\u003e\u003c/p\u003e\u003c!--PARAGRAPH_4--\u003e\u003cdiv powered-by=\"qqnews_ex-editor\"\u003e\u003c/div\u003e\u003cstyle\u003e.rich_media_content{--news-tabel-th-night-color: #444444;--news-font-day-color: #333;--news-font-night-color: #d9d9d9;--news-bottom-distance: 22px}.rich_media_content p:not([data-exeditor-arbitrary-box=image-box]){letter-spacing:.5px;line-height:30px;margin-bottom:var(--news-bottom-distance);word-wrap:break-word}.rich_media_content{color:var(--news-font-day-color);font-size:18px}@media(prefers-color-scheme:dark){body:not([data-weui-theme=light]):not([dark-mode-disable=true]) .rich_media_content p:not([data-exeditor-arbitrary-box=image-box]){letter-spacing:.5px;line-height:30px;margin-bottom:var(--news-bottom-distance);word-wrap:break-word}body:not([data-weui-theme=light]):not([dark-mode-disable=true]) .rich_media_content{color:var(--news-font-night-color)}}.data_color_scheme_dark .rich_media_content p:not([data-exeditor-arbitrary-box=image-box]){letter-spacing:.5px;line-height:30px;margin-bottom:var(--news-bottom-distance);word-wrap:break-word}.data_color_scheme_dark .rich_media_content{color:var(--news-font-night-color)}.data_color_scheme_dark .rich_media_content{font-size:18px}.rich_media_content p[data-exeditor-arbitrary-box=image-box]{margin-bottom:11px}.rich_media_content\u003ediv:not(.qnt-video),.rich_media_content\u003esection{margin-bottom:var(--news-bottom-distance)}.rich_media_content hr{margin-bottom:var(--news-bottom-distance)}.rich_media_content .link_list{margin:0;margin-top:20px;min-height:0!important}.rich_media_content blockquote{background:#f9f9f9;border-left:6px solid #ccc;margin:1.5em 10px;padding:.5em 10px}.rich_media_content blockquote p{margin-bottom:0!important}.data_color_scheme_dark .rich_media_content blockquote{background:#323232}@media(prefers-color-scheme:dark){body:not([data-weui-theme=light]):not([dark-mode-disable=true]) .rich_media_content blockquote{background:#323232}}.rich_media_content ol[data-ex-list]{--ol-start: 1;--ol-list-style-type: decimal;list-style-type:none;counter-reset:olCounter calc(var(--ol-start,1) - 1);position:relative}.rich_media_content ol[data-ex-list]\u003eli\u003e:first-child::before{content:counter(olCounter,var(--ol-list-style-type)) '. ';counter-increment:olCounter;font-variant-numeric:tabular-nums;display:inline-block}.rich_media_content ul[data-ex-list]{--ul-list-style-type: circle;list-style-type:none;position:relative}.rich_media_content ul[data-ex-list].nonUnicode-list-style-type\u003eli\u003e:first-child::before{content:var(--ul-list-style-type) ' ';font-variant-numeric:tabular-nums;display:inline-block;transform:scale(0.5)}.rich_media_content ul[data-ex-list].unicode-list-style-type\u003eli\u003e:first-child::before{content:var(--ul-list-style-type) ' ';font-variant-numeric:tabular-nums;display:inline-block;transform:scale(0.8)}.rich_media_content ol:not([data-ex-list]){padding-left:revert}.rich_media_content ul:not([data-ex-list]){padding-left:revert}.rich_media_content table{display:table;border-collapse:collapse;margin-bottom:var(--news-bottom-distance)}.rich_media_content table th,.rich_media_content table td{word-wrap:break-word;border:1px solid #ddd;white-space:nowrap;padding:2px 5px}.rich_media_content table th{font-weight:700;background-color:#f0f0f0;text-align:left}.rich_media_content table p{margin-bottom:0!important}.data_color_scheme_dark .rich_media_content table th{background:var(--news-tabel-th-night-color)}@media(prefers-color-scheme:dark){body:not([data-weui-theme=light]):not([dark-mode-disable=true]) .rich_media_content table th{background:var(--news-tabel-th-night-color)}}.rich_media_content .qqnews_image_desc,.rich_media_content p[type=om-image-desc]{line-height:20px!important;text-align:center!important;font-size:14px!important;color:#666!important}.rich_media_content div[data-exeditor-arbitrary-box=wrap]:not([data-exeditor-arbitrary-box-special-style]){max-width:100%}.rich_media_content .qqnews-content{--wmfont: 0;--wmcolor: transparent;font-size:var(--wmfont);color:var(--wmcolor);line-height:var(--wmfont)!important;margin-bottom:var(--wmfont)!important}.rich_media_content .qqnews_sign_emphasis{background:#f7f7f7}.rich_media_content .qqnews_sign_emphasis ol{word-wrap:break-word;border:none;color:#5c5c5c;line-height:28px;list-style:none;margin:14px 0 6px;padding:16px 15px 4px}.rich_media_content .qqnews_sign_emphasis p{margin-bottom:12px!important}.rich_media_content .qqnews_sign_emphasis ol\u003eli\u003ep{padding-left:30px}.rich_media_content .qqnews_sign_emphasis ol\u003eli{list-style:none}.rich_media_content .qqnews_sign_emphasis ol\u003eli\u003ep:first-child::before{margin-left:-30px;content:counter(olCounter,decimal) ''!important;counter-increment:olCounter!important;font-variant-numeric:tabular-nums!important;background:#37f;border-radius:2px;color:#fff;font-size:15px;font-style:normal;text-align:center;line-height:18px;width:18px;height:18px;margin-right:12px;position:relative;top:-1px}.data_color_scheme_dark .rich_media_content .qqnews_sign_emphasis{background:#262626}.data_color_scheme_dark .rich_media_content .qqnews_sign_emphasis ol\u003eli\u003ep{color:#a9a9a9}@media(prefers-color-scheme:dark){body:not([data-weui-theme=light]):not([dark-mode-disable=true]) .rich_media_content .qqnews_sign_emphasis{background:#262626}body:not([data-weui-theme=light]):not([dark-mode-disable=true]) .rich_media_content .qqnews_sign_emphasis ol\u003eli\u003ep{color:#a9a9a9}}.rich_media_content h1,.rich_media_content h2,.rich_media_content h3,.rich_media_content h4,.rich_media_content h5,.rich_media_content h6{margin-bottom:var(--news-bottom-distance);font-weight:700}.rich_media_content h1{font-size:20px}.rich_media_content h2,.rich_media_content h3{font-size:19px}.rich_media_content h4,.rich_media_content h5,.rich_media_content h6{font-size:18px}.rich_media_content li:empty{display:none}.rich_media_content ul,.rich_media_content ol{margin-bottom:var(--news-bottom-distance)}.rich_media_content div\u003ep:only-child{margin-bottom:0!important}.rich_media_content .cms-cke-widget-title-wrap p{margin-bottom:0!important}\u003c/style\u003e\u003c/div\u003e

未经允许不得转载：>干净利索网»OpenAI开源安全模型：经验不管用了，安全审核领域迎来推理革命？