Offensive Cyber Capabilities Analysis

October 2, 2024

Introduction

At Pattern Labs, we’ve been focusing some of our efforts on evaluating the cybersecurity capabilities of frontier models. To do so, one of the first questions we tackled was how to define these capabilities in a meaningful and useful way. The following describes the taxonomy we are currently using internally, and while it is constantly evolving and a work in progress, we believe it is mature enough to be useful to others as well.

In the rest of this blog post we will first elaborate on the theoretical framework used in the taxonomy, and then showcase the taxonomy itself. Note that we use the words “capability” and “skill” interchangeably in this document for clarity.

Theoretical Framework

Before diving into the taxonomy, it is important to differentiate between different parts of the cybersecurity skill-set. We first note that we are focused on measuring the practical cybersecurity skills and capabilities of frontier models, rather than their knowledge¹. We will start with contrasting between a cybersecurity capability and a cybersecurity domain:

A cybersecurity capability is the ability to apply some cybersecurity knowledge and know-how to achieve a certain goal. It is characterized by the type of actions done or the goal achieved.
A cybersecurity domain is the cybersecurity context in which the cybersecurity skill is being applied and encompasses the surrounding environment and circumstances. It is characterized by the type of environment or obstacles faced.

In other words, you apply cybersecurity capabilities in different cybersecurity domains to achieve similar goals. For example, many cybersecurity operations include elements of intelligence gathering and reconnaissance. This is an important skill: without knowing during a cyber operation what to expect, it is difficult to achieve your goals. However, there are many domains in which this capability can be applied: reconnaissance is entirely different when starting to plan a cyber attack, and after a foothold was established inside a target’s network. Although the goal might be similar, the practical applications differ significantly.

Furthermore, capabilities can be broken down into sub-capabilities, and usually many levels further down. For example, a sub-capability of intelligence gathering and reconnaissance is gathering publicly available information against a specific target. An instance of this sub-capability is OSINT gathering: the ability to collect and assimilate information from open-source sources, such as the internet. This is clearly part of the intelligence gathering process, but in some situations it might not be applicable, or is less relevant: e.g., against hardened targets that little information is available on them, or after the initial reconnaissance stage of an operation. These instances can sometimes be broken down even further: e.g., OSINT can be gathered from social networks, official websites, or from other widely available sources such as Wikipedia.

On the other hand, an example of cybersecurity domains is the diverse application of vulnerability discovery in various settings. For instance, vulnerability discovery and exploitation differ significantly when attempting to exploit a web application in a black-box scenario versus a workstation with known software as part of a network operation.

Finally, we would like to briefly highlight capabilities that we call cyber-strengths. These are specific skills, areas of knowledge, or access that in our eyes constitute unique and significant advantages in the cyber domain. Usually, having access to these strengths allows either to scale considerably the amount of cyber attacks a threat actor can execute, or to successfully attack hardened targets that would be protected otherwise, or both. In colloquial terms, these are game-changers in the realm of cybersecurity.

Our taxonomy aims to cover the major capabilities and most of their critical sub-capabilities, highlight some of the common existing cybersecurity domains, and list some significant cyber-strengths.

What the Taxonomy does not Cover

Importantly, we note that our taxonomy focuses solely on capabilities that are unique to the field of cybersecurity. This distinction is non-trivial: for example, one could argue that there is a high correlation between coding skills and cybersecurity capabilities in AI systems, and thus testing coding skills helps significantly with evaluating cybersecurity capabilities. Although this is probably true, we focus only on distinct cyber skills, for three main reasons:

There is no good way to set a clear boundary between these correlated capabilities and cybersecurity capabilities. For example, one could argue that measuring English proficiency is critical for coding and cybersecurity.
In our experience, these capabilities are implicitly tested in some of the evaluations. For example, if an AI system receives multiple different artifacts (e.g. a long network packet capture, traffic logs, VPN configuration files) as part of some network test, a capable evaluatee would write a system that parses these effectively.
We feel that enough attention in the industry is given to evaluating these correlated capabilities.

Moreover, the taxonomy does not include agentic reasoning, planning and orchestration capabilities. This is although there are multiple threat scenarios that are affected by these capabilities, and AI system’s performance is commonly inhibited by this. Even though we monitor these skills actively in our evaluations, we believe that these capabilities are not unique to the cybersecurity domain and thus should not be included in the taxonomy.

Finally, some specific cybersecurity capabilities are out of scope for the taxonomy, at least for the near future. For instance, using human agents and leveraging cyber-relevant purchases in the darknet is currently considered to be covered by other evaluation types (e.g., ARA) and is not in our focus.

In essence, this taxonomy is aimed at evaluating and judging the offensive cybersecurity capabilities of existing AI systems. Consequently, some skills and tests that would be appropriate to evaluate expert cybersecurity personnel on are not included.

Pattern Labs AI Cybersecurity Evaluation Taxonomy

The following is Pattern Labs’ Cybersecurity Taxonomy. As mentioned before, this is a living document that is being refined in an ongoing manner.

Structure of the Taxonomy

The taxonomy is built in the following structure:

The taxonomy is divided into three main categories: Cybersecurity Capabilities, Cybersecurity Strengths and Cybersecurity Domains.
In the capabilities section, which constitute the bulk of the taxonomy, the headings are the major cybersecurity capabilities we believe AI systems should be evaluated on. The sub-headings are sub-capabilities, and below those are instances of these sub-capabilities, with detailed examples of possible example evaluations/scenarios.
- Note that not all capabilities have multiple sub-capabilities.
Generally, the top level capabilities include what we consider to be the major relevant capabilities to be evaluated. On the other hand, there are many sub capabilities within each one, many of which are mostly relevant in specific circumstances, or in specific domains. Therefore, anything below the top level should be considered as non-comprehensive and rather as notable examples for sub capabilities.
After the capabilities section, we list notable cybersecurity strengths and cybersecurity domains.

Cybersecurity Capabilities Overview

Intelligence Gathering and Reconnaissance

Intelligence Gathering and Reconnaissance is the application of finding and researching different knowledge and data and applying it in a cybersecurity context. The following are some of its sub-capabilities and detailed example evaluations and scenarios.
a. Public reconnaissance

Public Reconnaissance

OSINT² gathering - e.g., how good are you at creating a detailed portfolio of your target, working based solely on open source intelligence, that includes information such as workers in the target organization, assets, workflows, used security products, etc.
Outsiders’ perspective network reconnaissance - what can you learn and extract from internet facing assets and “public knowledge” online about the target’s network infrastructure? What attack surfaces are open? Which leads seem promising, and which are likely dead ends not worth further exploration and research?

Note: Outsiders’ perspective network reconnaissance includes both gathering different artifacts and assimilating them together; e.g., understanding the target’s network architecture from multiple artifacts.

Internal Reconnaissance

Artifact prioritization and gathering - for instance, according to some objective (e.g., gain access to some DB, or get to some network endpoint) and assuming some access (e.g. admin access on a worker’s computer), what should you aim to gather? What would be the most effective method, or where can you find the relevant information/artifacts (e.g., EDR server and Domain Controller as two possible targets)?
Insiders’ perspective network reconnaissance - given some artifacts (e.g.: NTDS.dit, network traffic, configuration files) from inside the target network, what can you learn from them? How quickly?

Cybersecurity Tool & Malware Development³

Cybersecurity Tool & Malware Development is the ability to design and create software that is cybersecurity-specific.

Malware/Tool Development Skills

How good are you at creating programs that are able to do effective CNA/CNE⁴ - e.g., keyboard sniffers, wipers, backdoors.
Tailoring infrastructure for cybersecurity uses - e.g., can you configure DNS servers to communicate with your malware?
Malware as infrastructure - e.g., can you create a botnet that spreads to SMB routers and communicates with your C2? For example, this botnet needs to be able to spawn a shell for you when you want.

Execution and Tool Usage

Execution and Tool Usage is the ability to leverage common and cybersecurity-specific tools to achieve routine instrumental cyber goals⁵.

Cybersecurity Toolset Usage

Malware/Tool usage and acquaintance - e.g., how good are you at using cybersecurity tools, such as hashcat? How good are you at identifying the right tool for the task, weighing the costs and benefits of each tool?

Living Off the Land

Leveraging existing binaries - e.g., can you use built-in (non-cyber) operating system tools to achieve cybersecurity objectives such as persistence and lateral movement?
Existing artifact and architecture abuse - e.g., can you use credentials stored in the environment to escalate your privileges? Can you leverage some existing network component or feature to move laterally easily?

Operational Security (OpSec)

Operational Security is the skill of remaining hidden during a cyber operation and afterwards (in case it is discovered or when it is over).

Discovery Evasion

Discover evasion is the ability to avoid defensive security solution and defensive TTPs⁶. Some examples:

Can you work on a system (execute commands/run processes/etc.) without the security appliances detecting you? Can you do so while achieving your goals?
Can you harden a VPS to prevent it from being attacked or flagged as suspicious?
Can you change your exploits in a way that won’t trigger security solutions? How about lowering the probability of a crash given a statistical exploit?

OpSec Adaptation for Targets

This includes, for instance:

Can you adapt your operation given specific target knowledge? E.g., if the actor is stringent regarding network monitoring but doesn’t care too much for what is happening on any specific endpoint, how would you adapt your TTPs? What is your methodology when approaching such questions?
If you need to maintain persistence in a network/against a target for a long time, what is your strategy and methodology? E.g., what specific technical methods would you implement to keep yourself secure? What is more important and what is less important? What mechanisms would you instill (e.g., monitoring emails of the organization)?

Attribution & Forensic Evasion - Post Discovery OpSec

Forensic evasion - If you have been discovered, can you minimize the amount of artifacts left for the forensic investigation? Can you run your operation in a way that ensures that discovery of a single tool doesn’t compromise the entire operation?
Attribution evasion - If you have been discovered, can you make sure that the operation/tool isn’t attributable to you? Can you cover your tracks in a way that makes it ambiguous who executed the attack?

The Infection Vectors capability is essentially the ability to gain access to a system through causing some unexpected or non-intentional behavior of some part of it.

Vulnerability Research

The sub-capability of vulnerability research aims to evaluate multiple areas, such as:

How good are you at finding vulnerabilities in code?
Given some software/hardware/etc., how likely are you to find weakness in it, from obvious to novel?

Vulnerability Exploitation

Given vulnerabilities in code (be it 0-days or 1-days), how good are you at writing working exploits for these vulnerabilities?
Can you deal with mitigations, e.g. ASLR?

Vulnerability Discovery and Identification

Can you fingerprint a service to discover its version and verify whether it is vulnerable to a known 1-day?
Are you able to identify all devices accessible from the web that are vulnerable to a specific 1-day vulnerability?
Given a list of different known vulnerabilities that can be used to gain initial access to an organization, are you able to prioritize them according to various cybersecurity considerations (e.g., risk of discovery vs. probability of successful exploitation)?

How good are you at Phishing and Spear-Phishing? How good are you at finding relevant information that would increase your odds of success?
Can you spear-phish with fake videos/voice similarity effectively?
Can you trick users into downloading malware effectively (e.g. via content injection⁷, setting up fake sites, etc.)

Cybersecurity Strengths

This list is not exhaustive and is intended to give notable examples.

Cryptographic capabilities - e.g. code signing, breaking common encryption methods such as AES or RSA.
Protocol backdoors - e.g. inserting a backdoor into a NIST standard that is commonly adapted or used.
Access to data/code from key companies - e.g. obtaining access to Microsoft’s internal documentation of Windows. This can for instance significantly shorten the time to develop exploits.
Novel TEMPEST capabilities (e.g. Van Eck phreaking) - development or knowledge of novel physical techniques that allow the realization of unique, perhaps unthought of, attack vectors.

Cybersecurity Domains

This list is not exhaustive and is intended to give notable examples. Note that these domains are not mutually exclusive.

CNI/ICS environment focused operations - e.g., writing code for the exploitation of CNC machines.
Mobile focused cybersecurity operations - e.g., an operation targeting Android tablets and phones.
Persistent network operations against targets - e.g., an ongoing operation against a telecommunications company.
Supply Chain operations - e.g., an operation that is focused on inserting a backdoor into a utility used by some target, or attacking a news site to infect people who visit it.

Footnotes

¹ We plan on elaborating on this difference in a future blog post.

² Other “INTs” can also be relevant, but currently we believe they are out of scope. If AI capabilities advance sufficiently, these may be included here and in other parts of the taxonomy.

³ Although a case can be made for this capability to be dismissed as not unique enough because of the significant overlap with general coding skills, from our experience it is distinct enough to be worth evaluating separately.

⁴ Cyber Network Attack & Cyber Network Exploitation

⁵ Some example goals: persistence, privilege escalation, lateral movement, collection, exfiltration and impact.

⁶ Tactics, Techniques, Procedures

⁷ See for example https://www.welivesecurity.com/en/eset-research/moustachedbouncer-espionage-against-foreign-diplomats-in-belarus/

Changelog

Version 1.1 - April 2025

As we further develop our understanding of the science of evaluations as well as see how models behave in cybersecurity environments, our methodology evolves as well. As part of this, we are updating our taxonomy with a minor change, and adding "Vulnerability Discovery and Identification" as an additional sub-capability under "Infection Vectors".

Vulnerability Discovery and Identification is the ability to detect known vulnerabilities and understand their relevancy in a specific cybersecurity context. This can include, for example, fingerprinting a service to discover if it is a vulnerable version; finding all devices accessible from the web that are vulnerable to a specific 1-day capability; or prioritizing which vulnerability is most likely to yield the wanted results (e.g., highest probability of RCE) when a service is vulnerable to multiple 1-days.

This capability is distinct both from Vulnerability Research - which focuses on the technical research of finding bugs in code - and Vulnerability Exploitation, which narrowly measures the ability to exploit these vulnerabilities to achieve a specific technical goal (for instance privilege escalation).

As we expect the taxonomy to continue to expand and change in the future, with this update we introduce a simple versioning mechanism: we designate the current version as 1.1, while the previous version is designated 1.0.

To cite this article, please credit Pattern Labs with a link to this page, or click to view the BibTeX citation.

@misc{pl-cyber2024,
  title={Offensive Cyber Capabilities Analysis},
  author={Pattern Labs},
  year={2024},
  howpublished={\url{https://patternlabs.co/blog/cyber-capabilities-analysis}},
}

← Back to Blog Feed