Best Practices for Evaluations and Evaluation Suites: Part 3

October 2, 2024

Introduction

This is the third and final part in our series outlining the best practices for the design and creation of evaluations and evaluation suites. In the previous parts, we first outlined the crucial steps of AI risk mapping: beginning with defining the decision making framework, then identifying the critical capabilities requiring evaluation, and finally establishing specific thresholds for these capabilities. Subsequently, we provided an overview of the attributes and principles that should guide the development of most evaluations and evaluation suites.

In this part we will enumerate additional parameters that are worth considering when creating evaluations and evaluation suites. Unlike the attributes we expanded on in the previous part, the following parameters are much more context-dependent and adjustable, and should be used according to specific needs. Some of these have specific tradeoffs, where the optimum depends on context, some of these are best to be varied among the different evaluations in the suite, and some of these just depend on the way we expect the evaluations to be analyzed later. It is entirely possible for some suites of evaluations to mostly ignore the following parameters and still be excellent. Additionally, we will expand on cybersecurity evaluation types as an example of different types of evaluations that can be used to customize evaluation suites.

Evaluation Parameters

When creating an evaluation, it is useful to consider the following range of parameters:

Evaluation type. This parameter can be further broken down to general evaluation formats (for instance, human uplift trials vs. autonomous AI evaluations) and field-specific types (e.g., in cybersecurity evals, Capture the Flag evaluations vs. network simulation evaluations). Some more examples of cybersecurity types will be expanded upon in the next section.
Modality. Both input and output of models may vary significantly in format and type: from text based communication, to visual information, to performing concrete actions in the digital space (or even physical space). Each of these require the AI to exhibit different capabilities and have different implications on risk models. Each will likely require separate evaluations to some extent. We plan on elaborating on this in the future.
Evaluated Capability Scope. Evaluations can check anything from a very specific sub-capability to a full end to end cyber operation. In most cases, we recommend a mix of different scopes but the specific needs may vary.
Non subject-matter technical constraints. This can be implemented through various methods, such as imposing a time limit or restricting the number of attempts allowed to provide the correct answer. These constraints can be useful for instance to vary the difficulty without changing the subject of focus of the evaluation, or to simulate real-world constraints which threat actors find themselves facing.
Distractors/red-herrings. These can be quite useful as adversarial difficulty enhancers for LLM-based AI systems.
Randomized parameters. This can be used for example to robustly check if the AI has generalized the solution and not memorized it from its training data (or succeeded / failed due to some very specific parameter). It’s important to understand the extent and scope of randomization that is employed - AI models have an impressive (and improving) ability to make some inferences and may be able to “deconstruct” the randomization.
Maintenance and upkeep. How much work is needed in order to keep the evaluation working and relevant? How costly is it to run? For how long? At what scale?

The above list is by no means exhaustive, but in our eyes valuable when constructing a new evaluation. Note that we intentionally did not include tunable parameters in the evaluatee: e.g., number of runs per evaluation, or in the case of LLMs, a limit on the amount of messages, tool usages, etc. These parameters are worth considering in the context of the broader framework of risk assessment, but they are not in the scope of designing a specific evaluation.

Cybersecurity Evaluation Types

The following is a non-comprehensive list of cybersecurity evaluation types. The list is given to enrich the public discussion and to highlight the different properties of several evaluation types. Using different evaluation types can be very useful to tailor the evaluation suite to specific risk scenarios and threat models.

Question/Answer evaluations. These can be MCQs, open-ended questions, or other forms. This evaluation type can sometimes be useful to test for the effectiveness of safeguards and refusal rates, as well as the existence of specific knowledge in the AI system. Usually, these evaluations are not sufficient to test specific offensive cybersecurity capabilities.
Capture the Flag (CTF) evaluations. This type of evaluation is suitable for measuring the cyber capabilities of AI systems in narrow domains, for instance vulnerability discovery and exploit development. They usually involve a server running some specific, vulnerable code which the model must attack.
Network evaluations. These are evaluations that include simulating complex real-world networks, usually including multiple different network components.These can be useful to simulate either end-to-end tests, for example to test autonomous AI risk scenarios, or to test specific skills such as lateral movement or privilege escalation in network environments.
Social engineering. These evaluations generally involve interactions where the model is causing some specific human behavior. Usually, this involves some sort of phishing attack, but not necessarily. In essence, this type of evaluation is narrowly focused on testing persuasion capabilities relevant to cybersecurity.
Cyber tool creation. In this type of evaluation, the AI system is tasked with generating some sort of cybersecurity code/product with a clear purpose. Evaluations of this type can be useful for measuring the ability of AIs to create malware, automated attack tools, and other cyber-oriented instruments.
Evasion evaluations. These are useful to test how robust are AIs at avoiding cybersecurity defensive solutions, e.g. as an indicator to the danger they pose in ARA scenarios.

Evaluation Suite Parameters

When building evaluation suites, in a similar manner to the evaluations themselves, there are multiple tunable parameters to consider. These parameters should be tweaked depending on the goal of the suite. The following is a partial yet useful list we commonly use when advising our customers:

Risk measurement. Before designing a suite, it is helpful to decide what type of risk the suite will measure:
- Absolute risk: What is the direct risk from the AI system?
- Marginal risk: What is the increase in risk from the AI system, considering other technologies and variables (e.g., information widely available)?
- Residual risk: What is the increase in risk from the system, assuming the safety safeguards put in place are not bypassed?
Each of the above require different evaluations and most likely will affect other parameters in this list as well.
Included evaluation types. Suites can range from those composed of a single evaluation type repeated to increase result confidence, to those consisting of multiple evaluation types examining different capabilities, all relevant to a specific risk scenario.
Quality and Quantity. Given limited resources, there’s usually some tradeoff between having high quality, custom-made, well tested evaluations, and having a lot of evaluations. Both ends of the scale tend to produce very little information but there are a lot of reasonable points along this spectrum.
Coverage and Scope. Suites can aim to be comprehensive regarding a critical capability - e.g., testing all possible categories of memory exploits - or alternatively, opt for limited evaluations across multiple areas of focus, emphasizing the assessment of various aspects within a specific risk scenario.
Randomization and adaptability of the suite. It is possible to design suites with a dynamically changing composition of evals: e.g., randomly or adaptively to the success of the evaluated AI system. This can increase the amount of information derived from the suite, but should be carefully considered as it might also cause the suite to be inconsistent or unreliable.
Expected shelf life. As AI capabilities are increasing rapidly, many benchmarks are being passed quickly. It is important to at least try and estimate for how long the suite is going to be relevant and derive the necessary implications for the included evaluations. This can include the planned efforts to maintain and upkeep the suite, or a deadline for when a newly created suite should be ready.

To emphasize, all of the above should be dictated by the encompassing decision making process and AI risk mapping that we described in part one of this series: from threat modeling to selecting the appropriate capability thresholds. This is key and directly affects the choices and adjustments to the above parameters. Furthermore, there are many additional parameters beyond those described here - e.g., is the suite public or private, what documentation should accompany it - that should also be given careful thought.

Putting it all Together

Although it might appear straightforward at start, the art of designing evaluations and evaluation suites has many moving parts and considerations. Generally, we recommend both looking at the wider picture (What are my threat models? What is the goal of the evaluations I’m looking to create?) as well as remaining very connected to the object level risks and considerations (What specific risk scenario I want to evaluate and account for? What specific knowledge and capabilities are critical to evaluate? Given the constraints, how many evaluations should I focus on in each area of worry?).

While we hope this series is helpful and gives the community tools to improve their design process and policy making, this work is not intended to be comprehensive. The process of evaluation design and choice is unique, and we find ourselves making a significant amount of case-by-case decisions. We are more than happy for others to reach out and hear their thoughts on the subject from their experience.

To cite this article, please credit Pattern Labs with a link to this page, or click to view the BibTeX citation.

@misc{pl-best2024,
  title={Best Practices for Evaluations and Evaluation Suites: Part 3},
  author={Pattern Labs},
  year={2024},
  howpublished={\url{https://patternlabs.co/blog/best-practices-for-evaluations-and-evaluation-suites-part-3}},
}

← Back to Blog Feed