Meet the Hackers Bulletproofing AI Giants: How this team is Fortifying OpenAI and Meta Against Digital Threats

BY MODERN OPULENT GAZETTE
Oct 31, 2024
3 min read

Photo of Gray Swan's founders Zico Kolter, Matt Fredrikson and Andy Zou

A new wave of AI security is emerging, led by the cutting-edge team at Gray Swan AI. Founded by Carnegie Mellon researchers Zico Kolter, Matt Fredrikson, and PhD student Andy Zou, Gray Swan emerged after the team uncovered vulnerabilities in today’s most sophisticated AI systems. Now, the company’s mission is clear: to bulletproof AI models against harmful use and unexpected breakdowns. Gray Swan AI is reshaping artificial intelligence security, positioning itself as a guardian for tech giants like OpenAI, Anthropic, Google, and Meta. The startup, formed last September, the company aims to strengthen AI systems through advanced safeguards, building a new layer of trust in artificial intelligence.

It all started when the founding team discovered how subtle prompt alterations could trick major AI models into producing dangerous outputs and evade model safety protocols. Kolter, who now chairs OpenAI’s Safety and Security Committee, joined Fredrikson and Zou in launching Gray Swan to address these security gaps head-on. For example, simple tweaks like adding characters could enable restricted responses from a model. Recognising the significance of this flaw, the trio launched Gray Swan with a vision to fortify AI and preempt potential misuses. The team quickly proved itself, establishing partnerships with major tech players and securing contracts with organisations like the U.K.’s AI Safety Institute.

Redefining AI Security: Gray Swan’s "Cygnet" Model

At the heart of Gray Swan’s offerings is Cygnet, a groundbreaking model equipped with “circuit breakers” to prevent jailbreaks. These circuit breakers serve as trip wires, triggering automated responses to potentially harmful and sensitive prompts. This approach has shown remarkable promise; during a recent hacker event, Cygnet’s resilience stood out, with only two successful breach attempts. Dan Hendrycks, an advisor to Gray Swan, likened the model’s defenses to an “allergic reaction” that blocks inappropriate responses before they escalate.

Gray Swan recently hosted a “jailbreaking arena,” inviting more than 600 hackers to try their best to trick popular AI models into producing prohibited content. Participating hackers attempted to coerce the models into unsafe outputs, from guides on illicit activities to deceptive content and misinformation campaigns, hackers tested the limits of these models.

This kind of “red teaming” aims to simulate real-world abuse, identifying where AI systems could potentially fail and refining their defenses accordingly. These red-teaming exercises, increasingly popular in AI safety circles, simulate abuse scenarios to refine model safeguards. For Gray Swan, these events underscore the importance of continuous testing to anticipate new risks and fine-tune model defenses.

Expanding the AI Security Toolbox

Alongside Cygnet, Gray Swan has developed “Shade,” a tool that automates AI vulnerability assessment. Used to probe and identify weaknesses in OpenAI’s models, Shade provides a proactive approach to safety, ensuring AI developers can fortify their models without relying solely on human red teams. An essential feature for clients like OpenAI, whose systems face constant exposure to diverse user prompts. It’s become part of Gray Swan’s broader mission: to create a comprehensive security arsenal for AI.

Gray Swan has ambitious plans to expand its reach, recently securing $5.5 million in seed funding and eyeing further investment. The startup plans to keep building its community of AI testers and engaging in “red teaming” events to keep pushing the boundaries of AI safety. With red teaming becoming a staple in AI development—and endorsed by governments like the U.S.—the need for Gray Swan’s expertise is expected to grow exponentially.

With their groundbreaking work, Gray Swan is rapidly establishing itself as a critical player in safeguarding the future of AI, ensuring that these powerful models are used safely across technology, business, and society.