In today's rapidly evolving AI landscape, understanding and precisely evaluating the capabilities of advanced AI systems has become a critical security concern. Even though different benchmarks are constantly being developed and published, a significant challenge lies in converting raw evaluation results into meaningful capability levels of AI systems, as part of a greater risk evaluation system. This blog post presents a specific framework to translate those evaluation results into capability levels enabling the assessment of risk levels for AI models.
Before diving into our capability measurement framework, it's important to understand why careful capability assessment matters.
While capabilities are inherently important to measure, in our context they serve as a critical component within a larger risk assessment system. Our approach identifies and addresses this complexity by establishing capability measurement as the first crucial building block in a larger assessment methodology. As illustrated in our framework diagram below, we begin by testing AI models in various evaluations that can be analyzed to determine specific capability levels (the left branch). While this process tells us what an AI system can do under controlled conditions, translating these measurements into some meaningful risk categorization system requires additional context. This context comes in the form of threat modeling that determines what level of capability becomes concerning in specific scenarios (the right branch in our diagram). This relationship exemplifies that both measuring capabilities and threat modeling are essential for meaningful risk assessment.
In the sections that follow, we'll detail our systematic methodology for quantifying AI capabilities, focusing on the capability assessment branch of our framework. This capability measurement and analysis process is key when trying to conduct a comprehensive risk assessment.
As generally agreed by the AI Security community, the systematic assessment of AI capabilities requires a robust evaluation framework. The methodology we outline focuses on the conversion of raw evaluation data into meaningful capability assessments that can be mapped to specific risk levels.
Since we have previously covered Evaluation Design, we will only briefly touch on this topic here. In general, evaluations should satisfy several criteria in order to be useful, mainly:
To account for the performance inconsistency of AI systems, we recommend running each evaluation multiple times.
Note that the following methodology focuses exclusively on capability evaluations: measuring concrete task performance under controlled conditions. Other evaluation types, such as jailbreak testing, fall outside the scope of this framework.
We employ a two-step analysis process:
Across our evaluation set, we calculate the percentage of successful attempts for evaluations at each difficulty level. Each difficulty level contains its own set of distinct evaluations, designed to test capabilities at that specific complexity threshold. These calculated success rates are plotted against their corresponding difficulty levels to generate a capability performance curve. The derived curve illustrates how the system's performance varies across difficulty thresholds.
The resulting graph typically displays a non-increasing trend line with high success rates at lower difficulty levels, declining performance as difficulty increases, and near-zero success rates at maximum difficulty. The curve usually shows three key zones: an easy range where the AI system performs at a nearly 100% success rate, a challenging middle zone where performance drops off, and a boundary beyond which tasks become, with high probability, too difficult for the AI system to handle.
The graph above encompasses a lot of information about the relation between task difficulty (on the specific tested capability) and success rate. In practice, it’s usually needed to process this information into one clear number which indicates the AI system’s capability level. While it’s clear that there’s nuanced information that is being lost, in many cases a decision-relevant number is needed for practical reasons. The method we suggest for doing this is setting a specific predefined success rate threshold (e.g. 10%) and looking at the difficulty level for which the graph crosses the set threshold. This is a good intuitive marker for the difficulty at which the model has a reasonable (where what is reasonable can be set by the threshold) ability to solve challenges.
The capability levels we measure feed into our risk assessment process. Usually, each threat model can be realized through multiple, different combinations of capabilities, where each combination defines a separate set of minimum requirements. An AI system poses a specific risk if it meets any one of these requirement sets, making it capable of executing that threat model through at least one path. As a simple example, if the threat model of concern is fully autonomous AI cyber agents conducting cyber operations against large corporate networks, a model being able to only spear-phish is irrelevant, but even with basic lateral movement and exploitation knowledge the risk posed from it changes dramatically.
The detailed definition and formulation process of these threat models falls outside the scope of this blog post. While generic threat models exist, they should be tailored to each stakeholder's specific priorities and concerns. We believe conducting personalized, in-depth research on applicable threat models is essential for any comprehensive risk assessment framework, and we hope to address this topic more thoroughly in future publications.
The above framework provides a systematic approach to measuring and quantifying AI capabilities, as well as integrating them with concrete threat models, in the context of risk assessment. This blogpost mostly focused on one of the pillars of this process - translating evaluation results into meaningful capability levels. In the future, we plan to explore how these capability measurements can be effectively mapped to specific threat models and risk thresholds, supporting more informed risk-level classification decisions.
Deriving Capability Levels From Evaluation Results © 2025 by Pattern Labs Tech Inc. Licensed under CC BY-NC-ND 4.0.
@misc{pl-deriving2025, title={Deriving Capability Levels From Evaluation Results}, author={Pattern Labs}, year={2025}, howpublished={\url{https://patternlabs.co/blog/deriving-capability-levels-from-evaluation-results}}, }