Best Practices for Evaluations and Evaluation Suites: Part 1

Introduction and Goal

In Pattern Labs, a considerable amount of our time is spent designing and building evaluations for government and AI organizations. Consequently, we have been thinking about this topic for quite a while and wanted to share some of our internal insights.

We believe that quality evaluation suites are crucial for labs’ and governments’ policy making ability, both in the short and long term. While considerable academic research has been done on evaluating AI models, especially since the breakthrough in LLMs, we have seen comparatively little written about assessing the evaluations themselves.

As this topic is lengthy, we plan on breaking it down into 3 parts, with this being the first blog post in the series:

  1. Critical decision making before setting out to create an evaluation suite
  2. Qualities and benchmarks we believe should be followed in all evaluations and evaluation suites
  3. Different parameters and tradeoffs of evaluations and evaluation suites

To clarify, this series will focus exclusively on the characteristics of evaluations and evaluation suites. There are other important questions in the field which it will not address, such as:

  1. How much weight should be given to evaluations when assessing frontier models?
  2. Which evaluations should be open sourced, and which datasets should remain private?
  3. When to evaluate models autonomously and when as part of human uplift trials?

Critical Decision Making before Evaluation Creation

Before we get into what makes a good evaluation or evaluation suite for assessing model risks, there are a few key steps to consider:

  1. Defining the relevant decision making process the evaluations will feed into
  2. Employing threat modeling and specifying risk scenarios of interest
  3. Deriving the critical capabilities from the risk scenarios and threat models
  4. Setting appropriate capability thresholds

These steps are crucial for designing effective evaluations. In the rest of this blog post we will delve into each of these.

Defining the Decision Making Process

When thinking about evaluating models, the immediate question that comes to mind is “evaluating them for what?” or “why does this evaluation need to exist?”. Ideally, an evaluation does not exist in a vacuum, but with a concrete reason and as part of some larger structure of tests¹, and more importantly, in order to provide critical data for a specific decision making process. Subsequently, before setting out to create any evaluation, it is vital to define the framework in which the evaluation will be used. For instance, a lab might evaluate a model to determine if it is safe enough to deploy, or a regulator might test a model to check if it necessitates stricter access-control mechanisms. In essence, it is vital to specify what decisions hinge on these evaluations before undertaking their development, as this has direct implications on all of the evaluations’ parameters.

Employing Threat Modeling and Specifying Risk Scenarios

After outlining the decision making process that the evaluations are going to support, the next step is to tie that process to a concrete real-world context. In the case of safety and security evaluations, the optimal approach is to design the evaluations after conducting threat modeling work and specifying relevant risk scenarios. As there are currently many nomenclatures in use in the industry, we would like to define the way we use the terms Threat Modeling and Risk Scenarios:

  1. Threat Modeling: The process of defining possible threat categories that may arise from the availability of frontier models² with certain capabilities.They may be divided by threat actors, targets, types of operations or other relevant parameters. Example threat models:
    a. Models might uplift cybersecurity experts, enabling them to create state-of-the-art cyber weapons previously only available to nation-state actors.
    b. Creation of automated attack tools by developers without cybersecurity backgrounds, augmented by models' cyber knowledge and coding proficiency, potentially increasing the availability of high-end cyber tools³.
  2. Risk Scenario: A particular manifestation of a threat model (or at least a narrow and well defined group of specific scenarios). Risk scenarios add external variables to a threat model, such as specifying classes of targets, timeframes, and other factors. They are the application of threat modeling to concrete situations and frame the possible dangers that might arise. Some example risk scenarios:
    a. A cybersecurity expert develops malware on the scale of Stuxnet using knowledge and the support of AI systems, and decides to attack a nuclear reactor due to personal resentment.
    b. A talented developer looking to earn a lot of money uses LLMs to quickly create an advanced ransomware program and sells it to multiple cyber-criminals over the darkweb.

When our clients develop evaluations and test suites, we recommend that they first focus on at least some minimal threat modeling and consider the specific risk scenarios that worry them. Though exhaustively defining all threat models beforehand is often unfeasible, iterative refinement of these models, coupled with ongoing development of the evaluation suite, can prove highly effective following initial efforts to understand the basic needs.

Deriving the Critical Capabilities

After understanding what are the main threat models and specific risk scenarios, the immediate next step is to analyze what are the relevant capabilities for each threat model and to prioritize them. Usually, most threat models require more than one capability to actualize, or at the very least have multiple possible avenues that might lead to harm. However, to maximize the effectiveness of the evaluation suite and subsequently the decision making process, it is best to focus on the few capabilities that are either a bottleneck or significantly more impactful⁴. Generally, it is useful to start with these and to focus on evaluating them first, even if the aim is for the evaluation suite to be eventually comprehensive.

The fundamental question regarding the choice of critical capabilities is “what specific, measurable skills would make a critical difference in a threat actor’s ability to cause harm?” These are the capabilities that we should prioritize when evaluating models.They typically fall into three categories: capabilities that bottleneck much of the work a threat actor would need to perform, the most challenging aspects of their work, or skills that enable substantial scaling.

In many cases, this analysis leads to discussions about how “wide” capabilities should be or what should be the scope for a single capability. For example: is “vulnerability discovery” a single capability or a group of different capabilities separated by vulnerability type?

From our experience, the best answers to these questions tend to come from a practical point of view: what can we evaluate efficiently and what aligns with the framework of the decision making process (as discussed above). It is more a question of definitions and semantics than a question of “what capabilities are most important?”

Setting Capability Thresholds

Once we know which capabilities we want to evaluate, the next step is to set the appropriate thresholds: at what level does that capability become dangerous and increases the likelihood of the threat to manifest. This threshold will define what difficulty / skill level we will need to evaluate: both around the critical threshold itself and at a lower level in order to ascertain model progress.

In order to effectively design evaluations we need to put a specific threshold on the capability that we wish to evaluate. Depending on the capability, it can be easy or difficult to define a threshold. For some capabilities, there are clear standardized metrics (and sometimes evaluations) which are relevant not only for humans, but also for models. For others, we need to figure out how to define our capability levels. This can be anything from “score on this test”, “finding X vulnerabilities from this list” to “doing this work at a high-school level”.

The thresholds themselves can be based on ties to specific risk thresholds (“here is what unacceptable risk looks like - which capability level creates this risk”), comparison to a baseline (“this level of capabilities is clearly above what is currently available to the threat actors”) or any other basis which would be relevant for the decision making process which the evaluation feeds into. In practice, a lot of times defining the appropriate risk (via setting risk thresholds, choosing baselines, etc.) and deriving critical capabilities happens at the same time. This is fine, as long as clear capability thresholds are established and implemented afterwards.

Essentially, capability thresholds must do three things:

  1. They must be reasonably convenient to measure for specific evaluations. For example, it should be possible to discern, at least for experts, how difficult it is to tell what is “high-school level” for the capability.
  2. They must be tied as closely as possible to the threat models of interest. This means it should be justifiable, to some extent, why a model exhibiting this level of capability increases the likelihood of the threat model to actualize.
  3. They should be reasonably easy to resolve. Experts looking at evaluation results should be able to mostly agree on whether or not the capability has exceeded the threshold.

The third point is crucial. It is always preferable to have difficult discussions (and even disagreements) in the evaluation design stage over having them at the resolution stage. Disagreements in the design stage are a risk to the evaluations’ effectiveness, but a much lesser one than disagreements at resolution time.

This is one of the main reasons why we recommend using capability thresholds as opposed to something like risk thresholds (which have been also discussed elsewhere). The link between risk and capabilities is always complex and the questions that tie them together should be resolved before designing evaluations.

Tying it all Together

After all of the above work is completed, the next step is to integrate all of it into the design of the evaluation suite and the specific evaluations. In the next part we will expand on this process, and specifically on criteria Pattern Labs believes are essential for evaluation creation in most scopes.


¹ It is sometimes beneficial to single out specific important capabilities with a single end-to-end evaluation, but this is circumstance dependent and is part of a different discussion.
² Or their derivative models.
³ For those versed in the cybersecurity realm, imagine allowing a single talented developer to build and distribute a tool of the same quality as Cobalt Strike or Metasploit.
⁴ Although sometimes all capabilities are equally important, this is rarely the case.


Best Practices for Evaluations and Evaluation Suites - Part 1 © 2024 by Pattern Labs Tech Inc. is licensed under CC BY-NC-ND 4.0.

To cite this article, please credit Pattern Labs with a link to this page, or click to view the BibTeX citation.
@misc{pl-best2024,
  title={Best Practices for Evaluations and Evaluation Suites: Part 1},
  author={Pattern Labs},
  year={2024},
  howpublished={\url{https://patternlabs.co/blog/best-practices-for-evaluations-and-evaluation-suites-part-1}},
}