Introducing SOLVE: Scoring Obstacle Levels in Vulnerabilities & Exploits (Version 0.5)

Summary

We introduce SOLVE (Scoring Obstacle Levels in Vulnerabilities & Exploits), a new scoring system for assessing the difficulty of a vulnerability discovery & exploit development challenge. The scoring system described here is not an objective score, but rather a framework for making a judgement about how complicated it is to discover vulnerabilities and develop working exploits for them within an end-to-end challenge.

This is a public preview of the scoring system that we have already been using internally at Pattern Labs to assess challenge difficulty levels. As such, we introduce it as "version 0.5". The purpose of this publication of an early version is to solicit comments & feedback from the security research, AI safety and AI security communities.

Resources:

Overview

Why is this score needed?

The difficulty of vulnerability discovery & exploitation interests us at Pattern Labs as part of our work on AI safety. Increasingly, LLMs and AI systems are being used to discover vulnerabilities and develop exploits. At Pattern Labs, we evaluate dangerous cyber capabilities of AI, and as part of our evaluations, we take into account the difficulty of vulnerability and exploitation tasks in order to quantify the skill level demonstrated by an AI when given cyber challenges.

In plain words, with the SOLVE score, we would like to be able to say: "This AI can consistently solve challenges of difficulty level 3, struggles with difficulty level 5 challenges, and cannot solve difficulty level 7 challenges". Additionally, we would like to have a method for assessing this score without requiring humans to try solving the challenges, which is very costly at scale.

Although our specific practical application for this is to evaluate the capabilities of AI, there is nothing AI-related about the SOLVE score itself. We only aim to measure how difficult it is for some challenge-solver to complete the challenge — it doesn't matter whether the challenge-solver is human or AI. Therefore, we seek to define a difficulty score that only takes into account the challenge itself, remaining agnostic to the challenge-solver's abilities.

SOLVE as a framework for making difficulty judgements

Judgement / Experiment -> Challenge difficulty rating

There are two important, complementary standard methods for assessing difficulty levels of cyber challenges:

  1. 🧑‍⚖️ Judgement
    The difficulty of the challenge is determined by the challenge authors/evaluators. This is usually presented either as a bottom-line classification (e.g. "easy / medium / hard"), or as a constant number of points for a CTF (capture-the-flag) challenge (e.g. "100-point challenge", "500-point challenge").
  2. 🧑‍🔬 Experiment
    The difficulty of the challenge is determined by letting people try to solve it and seeing how well they do. In CTF competitions employing this method, the number of points per challenge is dynamic, e.g. becomes lower as the challenge is solved by more teams. Two ways to assess the difficulty of a challenge after a CTF competition is complete are to check how many people (or teams) are able to solve it, or how quickly it can be solved (e.g. "first solve time" - the time until one team succeeds to solve the task, a.k.a. time to "first blood").

In the context of datasets for AI cyber capability evaluations, both of the above methods are used. For example:

  • In NYU CTF Bench, the first method is used: Each challenge has a number of points (ranging from 1-500) representing its difficulty, chosen by the NYU CSAW CTF challenge authors.
  • In Cybench, the second method is used: Each challenge's difficulty level is measured as the "first solve time" from when the challenge appeared in a live CTF competition (note that the dataset contains challenges from 4 different CTF competitions).

The experimental method, e.g. using the number of teams that solved the challenge or the first solve time, has significant drawbacks: It is costly to assess difficulty levels for novel challenges (as new competitions with sufficiently large player bases must be held for new challenges), and it is hard to compare challenges that appeared in different competitions.

Moreover, solution time has drawbacks as a predictor of difficulty. It is skewed by the fact that teams have discretion in choosing when or whether to solve a challenge, and may have considerations such as the number of points the challenge is worth, team prioritization, team skill level in a particular subject, or the team's timezone causing them to shift their work times compared to other teams; and also, by the fact that not all of the time is spent on constructing a solution for the challenge, but also on running it (e.g. running a computationally-intensive cryptographic algorithm or a brute-forcing a low-probability memory exploitation technique may take time). Additionally, different competitions draw teams of different skill levels, so solution time cannot be reliably compared between challenges belonging to different competitions.

It is less costly to use the judgement method, i.e. evaluating a challenge to judge its difficulty level, in order to assess difficulty. This method does not require setting up a competition. However, simply using the number of points or the "easy / medium / hard" labels assigned by the challenge authors still makes it hard to compare challenges from different sources.

The SOLVE score addresses these drawbacks in the judgement method by defining a structured framework for making the judgements. The evaluator answers a set of questions regarding the challenge, and then a formula is applied to calculate the score. This approach eliminates ambiguity over what "easy" and "hard" mean, producing an explainable score that can be meaningfully compared across different challenge sources.

What is being scored?

With SOLVE, the score is given to a "complete, end-to-end challenge" — not to an isolated vulnerability. This includes:

  • Directions for starting the challenge (e.g. a website URL; an IP address and port; a text / source code / resource / executable file; or some combination of these).
  • Sometimes, hints for solving the challenge are also available.

It is assumed that in order to solve the challenge, the challenge-solver needs some combination of code analysis, vulnerability discovery and exploit development skills; the SOLVE score aims to measure how difficult the challenge is to complete. This is detailed in the next section.

Here are some examples to illustrate how SOLVE aims to measure difficulty:

  • If the vulnerability is buried in 50,000 lines of code or requires considerable reverse engineering, the challenge should be considered more difficult (i.e. have a higher SOLVE score) than if it were a 50-line-of-code CTF challenge.
  • If the exploitation requires chaining together multiple vulnerabilities, the challenge should be considered harder than if there was a single vulnerability to exploit.
  • If the vulnerability is very simple (e.g. a straightforward buffer overflow, XSS, shell injection, etc.), it should be considered easier than if it were a complicated vulnerability that does not stand out (e.g. a small-window race condition that triggers an unexpected program state, a heavily constrained buffer overflow, a subtle flaw in a filter for shell injection protection, etc.).
  • The same challenge could become easier or harder in different contexts:
    - The same exploitable daemon could be run as root, or as a weak user (and then privilege escalation is required).
    - The challenge-solver may or may not have access to server source code.
    - The exploitable binary might be compiled with or without mitigations (e.g. stack canary).
    - The challenge may become significantly easier if the challenge-solver receives a hint in advance.

Examples of challenges that can / can't be judged

Some examples of challenges which may be judged using SOLVE:

  • Most challenges in CTF (capture-the-flag) competitions. These are usually end-to-end challenges revolving around a vulnerability or an exploitation method, proven by discovering the flag on a live challenge setup, and are eligible for SOLVE judgement.
  • A challenge built around a known vulnerability. For example, the "Heartbleed" challenge from Plaid CTF 2014, where a modified old version of nginx (vulnerable to Heartbleed) has the flag loaded into memory in a certain way, and the challenge-solver can exploit Heartbleed to read the flag.

Some examples of challenges which cannot be judged using SOLVE:

  • A "plain" CVE without full challenge context. For example, the "Heartbleed" vulnerability cannot be judged by itself, because it lacks challenge context, such as: What information is available to the challenge-solver? Is the challenge-solver trying to just generally find bugs in OpenSSL, or attack a specific application/server that's using OpenSSL in some way? What is the challenge-solver's objective? Info leak, RCE, privilege escalation, something else?
  • A challenge requiring finding a new 0-day (not known to the person making the SOLVE judgement) cannot be judged for difficulty. A critical part of judging how difficult it is to find a specific vulnerability is to know what it is.
  • Challenges which are not focused on vulnerability discovery & exploit development (such as "forensics" / "misc" challenges in some CTF competitions) and closed-code challenges (which are sometimes given in CTF competitions, particularly in the "web" category) are out-of-scope in the current version of SOLVE.

Breakdown of the score

The idea behind SOLVE is to break down the challenge solution process to a few steps that the challenge-solver needs to take, and judge the difficulty of each step separately using a few criteria. Then, combine the scores to form an overall challenge difficulty score.

Solving a challenge generally consists of 3 steps:

  1. Code analysis.
  2. Vulnerability discovery.
  3. Exploit development.

Throughout these 3 steps, the challenge may require the challenge-solver to have expert knowledge (e.g. in cryptography) in order to successfully solve them.

(In the current version of SOLVE, we limit ourselves specifically to open-code challenges. Black-box challenges, where the challenge-solver has no access to the server's code, would have the first step replaced by an application analysis / exploration stage, and will be considered in a future version.)

(Background knowledge) Analyze the code -> Discover vulnerabilities -> Develop an exploit

Following this, the SOLVE score comprises 4 major components.

  1. Code analysis difficulty. How difficult is it for the challenge-solver to read and analyze the code?
  2. Vulnerability discovery difficulty. How complicated are the vulnerabilities in the code?
  3. Exploit development difficulty. How difficult is it to develop a working exploit?
  4. Expert knowledge. How much special knowledge/expertise in special areas is required in order to tackle this challenge?

Each component receives a component score, between 0-10, based on a formula that takes into account its subcomponents below. Finally, a formula combines the component scores to yield the overall SOLVE score. The formula combining the component scores is a variation of a "smooth-maximum" function, meaning the overall SOLVE score is similar to the score of the most difficult component.

These formulas, and the guidelines for assessing the subcomponents, are detailed in the SOLVE score specification.

Interpreting the SOLVE score

The SOLVE score is a number between 0 and 10. We suggest a division of the range into intervals, in a similar fashion to the division of the CVSS score which measures vulnerability severity:

  • SOLVE score between 0.0 and 3.9: Easy challenge, usually requiring only basic security research skills, which can be solved by a newcomer in the field of security research.
  • SOLVE score between 4.0 and 6.9: Medium challenge, usually requiring intermediate security research skills, which can be solved by an experienced security researcher.
  • SOLVE score between 7.0 and 8.9: Hard challenge, usually requiring advanced security research skills, which can be solved by an expert security researcher.
  • SOLVE score between 9.0 and 10.0: Expert challenge, usually requiring top security research skills.

Code analysis difficulty subcomponents

The purpose of the code analysis difficulty component is to measure how difficult it is to read and analyze the provided code (which is possibly obfuscated, compiled, and/or long) in order to search for vulnerabilities in it.

  • Code readability. How hard is it to bring the available code/binary from its raw state to a readable state?
  • Code analysis complexity. How hard is it to understand what the relevant parts of the code actually do (without thinking about vulnerabilities yet)?
  • Code length. How much code is there to read?
  • Attack surface size. How large is the attack surface? Usually, the attack surface size is proportional to the amount of code that actually needs to be analyzed.

Vulnerability discovery difficulty subcomponents

The purpose of the vulnerability discovery difficulty component is to measure how hard it is to discover the vulnerabilities in the challenge, not including the effort needed to actually exploit them (the next component), or to read the code (the previous component).

  • Vulnerability complexity. How complicated is the vulnerability or chain of vulnerabilities?
  • Number of CWEs. How many different bug classes (using CWEs to represent them) does the challenge comprise?
  • Locality. How local is the vulnerability? From function-local, to depending on an interaction between different code bases running on different machines.

Exploit development difficulty subcomponents

The purpose of the exploit development difficulty component is to measure how difficult it is to develop a working exploit for the vulnerabilities discovered in the challenge, assuming the vulnerabilities are already known.

  • Exploit complexity. How difficult is it to develop the main exploit logic (the "interesting parts" of the exploit, which directly exploit the vulnerabilities)?
  • Number of techniques. How many different techniques (such as: memory safety mitigation bypass, hitting a race window, compute-intensive cryptanalysis, etc.) are needed as part of the exploit development?
  • Complexity of exploit harness. How difficult is it to develop the harness for delivering the exploit (the "boring parts" of the exploit, which simply interact with the attack surface)? For example, a challenge which accepts the malicious input directly via stdin has a trivial harness, while a challenge where complicated hooking code must be written in order to deliver the malicious input is said to have a complex harness.

Expert knowledge subcomponent

The purpose of the expert knowledge component is to measure how much specialized knowledge is required in order to solve the challenge. This is different from general vulnerability discovery & exploitation knowledge that would be expected from any challenge-solver.

  • Expert knowledge level. What is the level of knowledge required in expert areas of knowledge?

SOLVE score as a predictor of challenge difficulty in a real CTF competition

DownUnderCTF is a CTF competition organized by a collaboration of Australian university cybersecurity societies. This competition features a good range of challenge difficulty levels, and attracts thousands of competitors every year; the organizers regularly publish statistics; and they have kindly agreed to share more detailed statistics privately with us, which made them an excellent subject for a case study.

In DownUnderCTF 2023, there were 2,073 registered teams, of which 1,424 solved at least one challenge. In this case study, we looked at 16 chosen challenges from this competition, all from "round 1" of the competition (exposed to competitors from the moment the competition starts), comprising 4 different challenge categories: pwn, rev, web, crypto, and spanning a wide range of difficulty levels. We assessed their SOLVE scores, and compared them to the following metrics:

  • The difficulty level stated by the authors (available on their GitHub repository)
  • The number of teams that solved the challenge
  • First solve time, in seconds

(Note: We did not consider the metric specifying the number of points the challenge was worth, because it was simply a function of the number of teams that solved the challenge.)

The results are as follows:

Table of SOLVE scores for DownUnderCTF 2023 challenges

N teams solved (log scale) vs SOLVE score

N teams solved (log scale) vs SOLVE score

First solve time (log scale) vs SOLVE score

First solve time (log scale) vs SOLVE score

In these graphs, we see that there is a good (but not perfect) correlation between the SOLVE score and all three comparative metrics: Stated difficulty level, number of solving teams (in log scale), and first solve time (in log scale). Specifically, looking at the coefficient of determination (R²) under linear regression, we observe:

  • R² ≈ 0.79 for SOLVE score vs. stated difficulty. Here, the conversion of the stated difficulty to numbers is: Beginner=1, Easy=2, Medium=3, Hard=4.
  • R² ≈ 0.83 for SOLVE score vs. N teams solved (log scale), where we use log(N+1) instead of log(N) to account for the challenge solved by 0 teams.
  • R² ≈ 0.72 for SOLVE score vs. first solve time (log scale), where we used a ceiling of "48 hours" (the full duration of the competition) as the "first solve time" of the challenge that was not solved by any team during the CTF.

The values are summarized in the following table, and the full stats are available.

R² values table

We conclude that the SOLVE v0.5 score is a reasonable predictor of the challenge difficulty statistics in a live CTF competition, which has the benefit of not requiring holding a competition to assess. It is comparable to the stated difficulty given by the challenge authors, with the added benefit that the full SOLVE score includes an assessment of where the difficulty in the challenge lies (e.g. is it mostly in the code analysis, the vulnerability discovery, or the exploit development). Therefore, we believe SOLVE is a useful tool for assessing challenges and analyzing their difficulty.

We provide a SOLVE score calculator, pre-filled with details from the 16 chosen DownUnderCTF challenges.

Notes:

  1. "N teams solved" and "First solve time" are highly but not perfectly correlated, with R² ≈ 0.88 for log(N+1) vs. log(FST). This correlation serves as a baseline: We don't expect any predictor of difficulty to do better than the "internal" correlation of these statistics.
  2. Some challenges were only solved by a small number of teams; for these challenges, the statistics represent their difficulty less reliably.
  3. A few challenges (such as "pyny", "vrooom vroom" and "0day blog") are clear outliers in the SOLVE charts, having significantly lower or higher SOLVE scores than expected. This is partially due to limitations in the current version of the SOLVE score (detailed in the next section) and is an area for future improvement.

Limitations and directions for future improvement

SOLVE is very much a work in progress, and has limitations and potential issues. Some of these we expect to improve, while some may be inherent difficulties of judgement-based evaluation methods. Following are some of the limitations of SOLVE in its current version:

  • Judging a challenge and assigning a SOLVE score requires intimate familiarity with the challenge - what the bugs are and what the intended solution is. This limits the scale at which challenges may be judged, especially when the judge is not the challenge author; the judge must take the time to study the challenge before assigning the score.
  • The SOLVE judgement criteria are subjective, and relative; In practice, judging a vulnerability or exploit's logical complexity generally involves comparing to other challenges ("X is more complicated than Y, so X should have a higher difficulty score than Y").
  • The final SOLVE score is a single number, which doesn't include important challenge context, such as what skills are required to solve the challenge (e.g., different people may have different skill levels in advanced topics, such as mathematical flaws in cryptographic algorithms or heap overflow exploitation techniques). A framework for this additional information and context is a direction for future work.
  • The current version of SOLVE can be used to assess open-code challenges involving vulnerability discovery and exploitation. Black-box challenges will be covered in a future version. Additionally, challenges that don't model open-code vulnerability discovery and exploit development (such as reverse-engineering challenges, malware development challenges, network attack simulation challenges, and other challenge types) are not considered in the current version of SOLVE. (In the DownUnderCTF statistics given above, reverse-engineering challenges were assessed by skipping the "vulnerability discovery" stage of the assessment, giving this stage a score of zero; this sometimes resulted in unexpected results and requires further improvements.)
  • SOLVE is not yet field-tested at high scale. More experience and testing is needed to tell whether this score is a good predictor of challenge difficulty for human players. We think extensive work in comparing SOLVE scores to the performance of people with different expertise levels and backgrounds is important to verify the usefulness of the framework.
  • Some of the formulas combining the SOLVE statistics to a final number are too sensitive, or not sensitive enough, and should be adjusted. For example, the "code readability" score, which measures how much deobfuscation needs to be done, does not seem to have enough of an effect on the "code analysis difficulty" score.
  • Some non-standard cases can become SOLVE outliers. We've seen that specific vulnerabilities can sometimes receive SOLVE scores that don't fit our intuition regarding their difficulty. There is work in progress to minimize this.
  • We use SOLVE to discuss difficulty both for humans and for AI models. It's unclear, even in theory, that difficulty is similar to these different types of challenge-solvers. It may be that in the future, the scores for "difficulty for a model" and "difficulty for a human" will significantly diverge.
  • Given the judgement needed in applying SOLVE, different experts may give different scores to the same vulnerability. This is a product of SOLVE being a framework for judgement. However, this variance can be improved with additional precision and clarity of the definitions.

Acknowledgement

We would like to thank DownUnderCTF for making their statistics and code openly available, for sharing additional statistics with us, and for allowing us to use their competition as a case study in this research. Their openness allows us, as researchers, to better analyze how both humans and LLMs approach vulnerability discovery challenges, which is crucial for advancing our understanding of AI capabilities.

SOLVE: Scoring Obstacle Levels in Vulnerabilities & Exploits (Version 0.5) © 2025 by Pattern Labs Tech Inc. is licensed under CC BY-SA 4.0.

To cite this article, please credit Pattern Labs with a link to this page, or click to view the BibTeX citation.
@misc{pl-introducing2025,
  title={Introducing SOLVE: Scoring Obstacle Levels in Vulnerabilities \& Exploits (Version 0.5)},
  author={Pattern Labs},
  year={2025},
  howpublished={\url{https://patternlabs.co/blog/introducing-solve}},
}