Following our series of blog posts about “Best Practices for Evaluations and Evaluation Suites”, this blog post introduces our existing state-of-the-art Evaluation Platform. Our platform is already actively deployed and is assisting multiple top frontier labs to measure the risks associated with AI systems through cutting-edge empirical testing. In this post, we highlight our security evaluations, one of the facets of our platform.
Pattern Labs researcher Yoni Rozenshein will present a short talk at BlueHat IL 2025 on the capability (and incapability) of leading AI models to perform seemingly trivial tasks in vulnerability discovery & exploit development.
Pattern Labs participated in the security evaluation of Anthropic's Claude 3.7 Sonnet model using our state-of-the-art cyber evaluation suite. It also utilized the SOLVE scoring system we recently introduced. Our real-world attack simulations tested capabilities across the entire cyber kill chain, helping responsible development of this frontier AI model.
Modern AI systems possess significant capabilities across various domains. In cybersecurity, these systems can perform complex tasks such as vulnerability research, log analysis, and security architecture design. Many of these capabilities are inherently dual-use: they can be employed both defensively to protect systems and offensively to cause harm. This dual-use nature creates a significant challenge for AI system providers and policy makers.
Pattern Labs CEO Dan Lahav co-delivered the keynote "The AI Security Landscape" with Sella Nevo (RAND) at the Paris AI Security Forum ‘25, a satellite event of the Paris AI Action Summit. The forum also featured Yoshua Bengio (Turing Award winner), David 'davidad' Dalrymple (ARIA), and Xander Davies (AISI), and many others, to accelerate both our understanding of the critical importance and practical approaches to securing frontier AI models.
We introduce a new scoring system for assessing the difficulty of a vulnerability discovery & exploit development challenge. The scoring system described here is a framework for making a judgement about how complicated it is to discover vulnerabilities and develop working exploits for them within an end-to-end challenge.
Pattern Labs researchers' paper named "What Makes an Evaluation Useful? Key Guidelines and Best Practices" was accepted to the Conference on frontier AI safety frameworks. This paper is a synthesis and update on parts of the blog posts series we published in the autumn of 2024, and would be published with its proceedings. Our researchers also took part in the conference workshop discussing the most pressing challenges in the design and implementation of frontier AI safety frameworks.
This is the third and final part in our series outlining the best practices for the design and creation of evaluations and evaluation suites.
This is the second part in our series outlining the best practices for the design and creation of evaluations and evaluation suites.
We believe that quality evaluation suites are crucial for labs’ and governments’ policy making ability, both in the short and long term. While considerable academic research has been done on evaluating AI models, especially since the breakthrough in LLMs, we have seen comparatively little written about assessing the evaluations themselves.
At Pattern Labs, we’ve been focusing some of our efforts on evaluating the cybersecurity capabilities of frontier models. To do so, one of the first questions we tackled was how to define these capabilities in a meaningful and useful way. The following describes the taxonomy we are currently using internally, and while it is constantly evolving and a work in progress, we believe it is mature enough to be useful to others as well.
We're excited to share that Pattern Labs was covered in Forbes!
Yoni Rozenshein's BlueHat IL 2024 talk is about our philosophy for evaluating AI dangerous cyber capabilities, how we actually do it (let's make an LLM play CTF!), and who cares about it (governments and frontier AI labs).