Best Practices for Evaluations and Evaluation Suites: Part 2

Introduction

This is the second part in our series outlining the best practices for the design and creation of evaluations and evaluation suites. In the previous part, we went over:

  1. Defining the relevant decision making process the evaluations will feed into
  2. Employing threat modeling and specifying risk scenarios of interest
  3. Deriving the critical capabilities from the risk scenarios and threat models
  4. Setting appropriate capability thresholds

Going through the above steps is vital before setting out and designing evaluations and evaluation suites. The following post will focus on which qualities and benchmarks are usually critical in most evaluation suites, regardless of the specific topic they are addressing. 

Key Attributes of Quality Evaluations

The following descriptions are ones that we believe should characterize most evaluations. While there are some exceptions to all of these, they are relatively rare.

The evaluation is representative of realistic risk scenarios

Even when evaluations only represent a very narrow capability from within a larger work process, that capability should be as realistic as possible. When testing for capabilities, it is preferable to create evaluations that require executing a realistic task, rather than answering multiple choice questions on the issue. As a negative example, many (though not all) CTF competitions include cyber tasks that are more riddle-like than realistic challenges. This makes it harder to connect them clearly to real risks and risk scenarios.

The difficulty of the evaluation is explicit and clear

The difficulty scale used can either be based on some framework (e.g. as outlined in a Responsible Scaling Policy https://metr.org/blog/2023-09-26-rsp/), or more general (easy/medium/hard). Generally, the more concrete and specific the scale, the better data the evaluation yields, although this depends on how the evaluation results are integrated into the broader decision making process. At the very least, it should be easy to understand the relation between the difficulty of the evaluation and the relevant capability threshold.

As part of this criteria, there ought to be a known, well defined solution to the evaluation. Stringently, there should be high confidence that there are no other solutions to the evaluation, or at least easier solutions than the intended one. This is necessary to at minimum bound the difficulty level of the evaluation from below.

The evaluation does not exist in the training set

This can be achieved either via the creation of novel evaluations, or through significantly obfuscating existing tests in a manner that fundamentally changes them. A theoretical rule-of-thumb for how different an evaluation should be from the version in the training data is that it should be harder to connect the two versions to each other than to solve the evaluation. Unfortunately, this is difficult to ascertain in practice. The difficulty of different tasks for models is sometimes surprising, making it very hard to be certain of this property. In any case, it is critical to avoid contamination issues, where results may stem from simple pattern matching (the model memorizing answers) rather than genuine capability manifestation.

The subject of focus and the granularity level of the evaluation are clear

Similarly to how the difficulty should be precise, the knowledge or skill being tested should be strictly stated as well. For instance, evaluations can test AI systems on “malicious code creation”, “writing malware” and “writing code that evades the 3 most common EDR products”, and each test would yield different output. These outputs would have varying degrees of relevance to different decision making processes.

The evaluation has high signal density

Signal density in evaluations refers to how much information we can derive from an evaluation. For instance, creating multiple checkpoints on the way to success or logging a model's work can typically provide valuable insights about its behavior and capabilities beyond a simple success/fail result.

The scoring method is coherent and tailored to the decision making framework the evaluation is serving

Evaluations can be graded as fail/pass, on a scale of 1 to 10, or using other methods. It is essential to tailor the scoring method to the broader risk methodology utilizing the evaluations and to the final decision the evaluation serves.

There are additional, more general, best practices that relate to writing infrastructure or code, such as:

  1. Conducting extensive QA, for example as a method to ensure there aren’t unintended easier ways to pass the evaluation
  2. Performing thorough testing, for instance to verify that a model can solve the evaluation


Since these practices aren’t specific to evaluations, we don’t expand on them in this document. However, they shouldn’t be neglected, as they represent industry best practices.

Although in some rare cases it is necessary to use non-standard evaluation methods, most of the time applying the above principles is required for the creation of a good evaluation. Otherwise, the evaluation might not produce actionable results, or its output would not contain enough meaningful information.

Key Attributes of Quality Evaluation Suites

The following descriptions are ones that we believe should characterize most evaluation suites:

The coverage of the suite should be varied along parameters that could affect AI systems’ performance in the future

This includes various parameters: for example, when measuring vulnerability discovery skills, it is important to test different vulnerability types, attack contexts, technical details (code language, etc). This helps protect the suite against surprising differential progress in AI systems’ abilities. This is sometimes difficult to do as surprising details may have a large effect on model performance. Things like “what type of service am I attacking”, code style and even the addition of irrelevant information can meaningfully change a model’s success rate. This is obviously even more true for clearly relevant parameters such as subcategories of the capability we are evaluating.

Complexity and difficulty should be consciously chosen

Usually it is preferable to create suites with evaluations of varying difficulty to increase the likelihood of distilling actionable information from the suite. However, suites with similar difficulties can also be useful. In some cases, it’s important to create evaluations below the capability thresholds to determine capability progress, but not always. Complexity should also be intentionally planned: for instance, should evaluations measure multiple parameters? Should the AI system be tested on them sequentially or in parallel? All of this should be derived from the relevant decision making process, risk scenarios and threat models.

The scoring method is purposefully chosen to maximize usefulness

Even more importantly than the scoring method of specific evaluations, scoring the entire suite should be legible and meaningful to the intended end-user (which might be a non-expert in the suite’s subject field of focus). Note that taking scores from many different evaluations and turning them into a single score (or even a clear score-card) is a difficult task and, again, relates very closely to the threat models. Whether a model that succeeds at 5% of the tasks is dangerous or not is a threat-model specific question. Typically, this leads to using non-binary scoring methods for the suite across multiple scoring parameters.

There should be overlap in coverage between evaluations

Most evaluations incorporate assumptions about how models will approach specific problems or skills. These may mean that capable models will still fail at some evaluations and weaker models may still succeed at rare evaluations for seemingly-random reasons. In order to avoid over-valuing hidden assumptions by mistake, we need to create some overlap between different evaluations in our set as mutual verification. In general, we should always have enough similar evaluations so that we can have good confidence when considering the possibility that the model just happened to “get lucky/unlucky".

In addition to these guidelines, there are other parameters that are worth considering when creating evaluations and evaluation suites and are more adjustable depending on the goal of the suite. We plan on expanding on these in the next part of this series.



Best Practices for Evaluations and Evaluation Suites - Part 2 © 2024 by Pattern Labs Tech Inc. is licensed under CC BY-NC-ND 4.0.

To cite this article, please credit Pattern Labs with a link to this page, or click to view the BibTeX citation.
@misc{pl-best2024,
  title={Best Practices for Evaluations and Evaluation Suites: Part 2},
  author={Pattern Labs},
  year={2024},
  howpublished={\url{https://patternlabs.co/blog/best-practices-for-evaluations-and-evaluation-suites-part-2}},
}