We introduce SOLVE (Scoring Obstacle Levels in Vulnerabilities & Exploits), a new scoring system for assessing the difficulty of a vulnerability discovery & exploit development challenge. The scoring system described here is not an objective score, but rather a framework for making a judgement about how complicated it is to discover vulnerabilities and develop working exploits for them within an end-to-end challenge.
This is a public preview of the scoring system that we have already been using internally at Pattern Labs to assess challenge difficulty levels. As such, we introduce it as "version 0.5". The purpose of this publication of an early version is to solicit comments & feedback from the security research, AI safety and AI security communities.
Resources:
The difficulty of vulnerability discovery & exploitation interests us at Pattern Labs as part of our work on AI safety. Increasingly, LLMs and AI systems are being used to discover vulnerabilities and develop exploits. At Pattern Labs, we evaluate dangerous cyber capabilities of AI, and as part of our evaluations, we take into account the difficulty of vulnerability and exploitation tasks in order to quantify the skill level demonstrated by an AI when given cyber challenges.
In plain words, with the SOLVE score, we would like to be able to say: "This AI can consistently solve challenges of difficulty level 3, struggles with difficulty level 5 challenges, and cannot solve difficulty level 7 challenges". Additionally, we would like to have a method for assessing this score without requiring humans to try solving the challenges, which is very costly at scale.
Although our specific practical application for this is to evaluate the capabilities of AI, there is nothing AI-related about the SOLVE score itself. We only aim to measure how difficult it is for some challenge-solver to complete the challenge — it doesn't matter whether the challenge-solver is human or AI. Therefore, we seek to define a difficulty score that only takes into account the challenge itself, remaining agnostic to the challenge-solver's abilities.
There are two important, complementary standard methods for assessing difficulty levels of cyber challenges:
In the context of datasets for AI cyber capability evaluations, both of the above methods are used. For example:
The experimental method, e.g. using the number of teams that solved the challenge or the first solve time, has significant drawbacks: It is costly to assess difficulty levels for novel challenges (as new competitions with sufficiently large player bases must be held for new challenges), and it is hard to compare challenges that appeared in different competitions.
Moreover, solution time has drawbacks as a predictor of difficulty. It is skewed by the fact that teams have discretion in choosing when or whether to solve a challenge, and may have considerations such as the number of points the challenge is worth, team prioritization, team skill level in a particular subject, or the team's timezone causing them to shift their work times compared to other teams; and also, by the fact that not all of the time is spent on constructing a solution for the challenge, but also on running it (e.g. running a computationally-intensive cryptographic algorithm or a brute-forcing a low-probability memory exploitation technique may take time). Additionally, different competitions draw teams of different skill levels, so solution time cannot be reliably compared between challenges belonging to different competitions.
It is less costly to use the judgement method, i.e. evaluating a challenge to judge its difficulty level, in order to assess difficulty. This method does not require setting up a competition. However, simply using the number of points or the "easy / medium / hard" labels assigned by the challenge authors still makes it hard to compare challenges from different sources.
The SOLVE score addresses these drawbacks in the judgement method by defining a structured framework for making the judgements. The evaluator answers a set of questions regarding the challenge, and then a formula is applied to calculate the score. This approach eliminates ambiguity over what "easy" and "hard" mean, producing an explainable score that can be meaningfully compared across different challenge sources.
With SOLVE, the score is given to a "complete, end-to-end challenge" — not to an isolated vulnerability. This includes:
It is assumed that in order to solve the challenge, the challenge-solver needs some combination of code analysis, vulnerability discovery and exploit development skills; the SOLVE score aims to measure how difficult the challenge is to complete. This is detailed in the next section.
Here are some examples to illustrate how SOLVE aims to measure difficulty:
Some examples of challenges which may be judged using SOLVE:
Some examples of challenges which cannot be judged using SOLVE:
The idea behind SOLVE is to break down the challenge solution process to a few steps that the challenge-solver needs to take, and judge the difficulty of each step separately using a few criteria. Then, combine the scores to form an overall challenge difficulty score.
Solving a challenge generally consists of 3 steps:
Throughout these 3 steps, the challenge may require the challenge-solver to have expert knowledge (e.g. in cryptography) in order to successfully solve them.
(In the current version of SOLVE, we limit ourselves specifically to open-code challenges. Black-box challenges, where the challenge-solver has no access to the server's code, would have the first step replaced by an application analysis / exploration stage, and will be considered in a future version.)
Following this, the SOLVE score comprises 4 major components.
Each component receives a component score, between 0-10, based on a formula that takes into account its subcomponents below. Finally, a formula combines the component scores to yield the overall SOLVE score. The formula combining the component scores is a variation of a "smooth-maximum" function, meaning the overall SOLVE score is similar to the score of the most difficult component.
These formulas, and the guidelines for assessing the subcomponents, are detailed in the SOLVE score specification.
The SOLVE score is a number between 0 and 10. We suggest a division of the range into intervals, in a similar fashion to the division of the CVSS score which measures vulnerability severity:
The purpose of the code analysis difficulty component is to measure how difficult it is to read and analyze the provided code (which is possibly obfuscated, compiled, and/or long) in order to search for vulnerabilities in it.
The purpose of the vulnerability discovery difficulty component is to measure how hard it is to discover the vulnerabilities in the challenge, not including the effort needed to actually exploit them (the next component), or to read the code (the previous component).
The purpose of the exploit development difficulty component is to measure how difficult it is to develop a working exploit for the vulnerabilities discovered in the challenge, assuming the vulnerabilities are already known.
The purpose of the expert knowledge component is to measure how much specialized knowledge is required in order to solve the challenge. This is different from general vulnerability discovery & exploitation knowledge that would be expected from any challenge-solver.
DownUnderCTF is a CTF competition organized by a collaboration of Australian university cybersecurity societies. This competition features a good range of challenge difficulty levels, and attracts thousands of competitors every year; the organizers regularly publish statistics; and they have kindly agreed to share more detailed statistics privately with us, which made them an excellent subject for a case study.
In DownUnderCTF 2023, there were 2,073 registered teams, of which 1,424 solved at least one challenge. In this case study, we looked at 16 chosen challenges from this competition, all from "round 1" of the competition (exposed to competitors from the moment the competition starts), comprising 4 different challenge categories: pwn, rev, web, crypto, and spanning a wide range of difficulty levels. We assessed their SOLVE scores, and compared them to the following metrics:
(Note: We did not consider the metric specifying the number of points the challenge was worth, because it was simply a function of the number of teams that solved the challenge.)
The results are as follows:
N teams solved (log scale) vs SOLVE score
First solve time (log scale) vs SOLVE score
In these graphs, we see that there is a good (but not perfect) correlation between the SOLVE score and all three comparative metrics: Stated difficulty level, number of solving teams (in log scale), and first solve time (in log scale). Specifically, looking at the coefficient of determination (R²) under linear regression, we observe:
The R² values are summarized in the following table, and the full stats are available.
We conclude that the SOLVE v0.5 score is a reasonable predictor of the challenge difficulty statistics in a live CTF competition, which has the benefit of not requiring holding a competition to assess. It is comparable to the stated difficulty given by the challenge authors, with the added benefit that the full SOLVE score includes an assessment of where the difficulty in the challenge lies (e.g. is it mostly in the code analysis, the vulnerability discovery, or the exploit development). Therefore, we believe SOLVE is a useful tool for assessing challenges and analyzing their difficulty.
We provide a SOLVE score calculator, pre-filled with details from the 16 chosen DownUnderCTF challenges.
Notes:
SOLVE is very much a work in progress, and has limitations and potential issues. Some of these we expect to improve, while some may be inherent difficulties of judgement-based evaluation methods. Following are some of the limitations of SOLVE in its current version:
We would like to thank DownUnderCTF for making their statistics and code openly available, for sharing additional statistics with us, and for allowing us to use their competition as a case study in this research. Their openness allows us, as researchers, to better analyze how both humans and LLMs approach vulnerability discovery challenges, which is crucial for advancing our understanding of AI capabilities.
SOLVE: Scoring Obstacle Levels in Vulnerabilities & Exploits (Version 0.5) © 2025 by Pattern Labs Tech Inc. is licensed under CC BY-SA 4.0.
@misc{pl-introducing2025, title={Introducing SOLVE: Scoring Obstacle Levels in Vulnerabilities \& Exploits (Version 0.5)}, author={Pattern Labs}, year={2025}, howpublished={\url{https://patternlabs.co/blog/introducing-solve}}, }