This is the third and final part in our series outlining the best practices for the design and creation of evaluations and evaluation suites. In the previous parts, we first outlined the crucial steps of AI risk mapping: beginning with defining the decision making framework, then identifying the critical capabilities requiring evaluation, and finally establishing specific thresholds for these capabilities. Subsequently, we provided an overview of the attributes and principles that should guide the development of most evaluations and evaluation suites.
In this part we will enumerate additional parameters that are worth considering when creating evaluations and evaluation suites. Unlike the attributes we expanded on in the previous part, the following parameters are much more context-dependent and adjustable, and should be used according to specific needs. Some of these have specific tradeoffs, where the optimum depends on context, some of these are best to be varied among the different evaluations in the suite, and some of these just depend on the way we expect the evaluations to be analyzed later. It is entirely possible for some suites of evaluations to mostly ignore the following parameters and still be excellent. Additionally, we will expand on cybersecurity evaluation types as an example of different types of evaluations that can be used to customize evaluation suites.
When creating an evaluation, it is useful to consider the following range of parameters:
The above list is by no means exhaustive, but in our eyes valuable when constructing a new evaluation. Note that we intentionally did not include tunable parameters in the evaluatee: e.g., number of runs per evaluation, or in the case of LLMs, a limit on the amount of messages, tool usages, etc. These parameters are worth considering in the context of the broader framework of risk assessment, but they are not in the scope of designing a specific evaluation.
The following is a non-comprehensive list of cybersecurity evaluation types. The list is given to enrich the public discussion and to highlight the different properties of several evaluation types. Using different evaluation types can be very useful to tailor the evaluation suite to specific risk scenarios and threat models.
When building evaluation suites, in a similar manner to the evaluations themselves, there are multiple tunable parameters to consider. These parameters should be tweaked depending on the goal of the suite. The following is a partial yet useful list we commonly use when advising our customers:
Although it might appear straightforward at start, the art of designing evaluations and evaluation suites has many moving parts and considerations. Generally, we recommend both looking at the wider picture (What are my threat models? What is the goal of the evaluations I’m looking to create?) as well as remaining very connected to the object level risks and considerations (What specific risk scenario I want to evaluate and account for? What specific knowledge and capabilities are critical to evaluate? Given the constraints, how many evaluations should I focus on in each area of worry?).
While we hope this series is helpful and gives the community tools to improve their design process and policy making, this work is not intended to be comprehensive. The process of evaluation design and choice is unique, and we find ourselves making a significant amount of case-by-case decisions. We are more than happy for others to reach out and hear their thoughts on the subject from their experience.
Best Practices for Evaluations and Evaluation Suites - Part 3 © 2024 by Pattern Labs Tech Inc. is licensed under CC BY-NC-ND 4.0.
@misc{pl-best2024, title={Best Practices for Evaluations and Evaluation Suites: Part 3}, author={Pattern Labs}, year={2024}, howpublished={\url{https://patternlabs.co/blog/best-practices-for-evaluations-and-evaluation-suites-part-3}}, }