
Anthropic proposes a common jailbreak framework
Anthropic used the announcement to call for an industry-wide framework to assess AI jailbreaks, saying developers and governments currently lack a common standard for evaluating newly discovered techniques.
“There’s currently no consensus in the AI industry on how to describe, in objective terms, the severity of an AI jailbreak,” the company said. Anthropic said it is working with Amazon, Microsoft, Google, and other Project Glasswing partners on a framework for evaluating jailbreaks, while also expanding collaboration with the US government through pre-release testing of future frontier models, information sharing, and joint AI security research.
“Government involvement in AI releases requires a durable, transparent process that gives cyber defenders and others the certainty they need about access to powerful models,” Anthropic said. “These rules should be codified in strong regulation and applied equally across frontier model developers.”