Pattern Labs and Anthropic have completed an extensive cybersecurity evaluation of Claude Sonnet 4 and Claude Opus 4, which have shown significant improvements over previous generations. Using our evaluation suite, as described expansively in Anthropic's model card, the models were tested across 48 challenges covering web exploitation, cryptography, binary exploitation, reverse engineering, and network attacks.
Claude Opus 4 demonstrated enhanced cybersecurity intuition and adaptability rather than following scripted patterns. Notable advances include:
Both models still struggle with:
These findings highlight the evolution of AI cybersecurity capabilities, with Claude Opus 4 showing substantial progress in how frontier models approach offensive cybersecurity challenges.
Following the previous joint assessments of Claude 3, Claude 3.5, and Claude Sonnet 3.7, we have collaborated with Anthropic and have completed a thorough capability evaluation of Claude Sonnet 4 and Claude Opus 4. Using our extensive evaluation suite, both models were tested across 48 challenges spanning web exploitation, cryptography, binary exploitation (pwn), reverse engineering, and network attacks using the SOLVE scoring framework. Claude Opus 4 achieved 31/48 and Claude Sonnet 4 25/48, with detailed results available in Anthropic's model card.
Claude Opus 4 demonstrated sophisticated tactical reasoning, adapting its approach based on system context rather than following rigid procedures. In one evaluation requiring access to a domain controller after a domain admin connected to the compromised machine, Opus deviated from standard procedures. Instead of following the typical approach - dumping LSASS and extracting credentials with Mimikatz, it took another course of action.
Opus identified that the domain admin had active processes running on the system. It then successfully stole an access token directly from one of these processes, pivoting through Explorer, rdpclip, and other running applications. Finally, it leveraged the token to create a scheduled task to achieve its objective. This creative, iterative problem-solving mirrors how human penetration testers adapt to real-world constraints.
The rest of this blog post examines the behavioral patterns that emerged throughout our evaluation process. Rather than isolated observations, these insights represent a systematic analysis of how the models performed across diverse challenge categories, revealing consistent trends and underlying capabilities. Through this structured assessment, we provide concrete insights into how these frontier models approach cybersecurity challenges.
An ongoing challenge in cybersecurity evaluations is that models typically execute only a few isolated commands per message - a conservative approach that becomes problematic given context window constraints and evaluation efficiency requirements. To address this limitation, we instructed the models to execute multiple reconnaissance commands simultaneously within each turn.
Claude Opus 4 adhered to the instructions and demonstrated impressive command chaining capabilities, resulting in enhanced situational awareness, reduced turn counts, and higher likelihood of challenge completion rates compared to previous Claude generations:
This lack of improved situational awareness is potentially critical in cybersecurity operations and can mean the difference between success and failure.
In a web application security evaluation series, we observed a clear progression in how different Claude generations approach vulnerability research and exploitation. To showcase this evolution, we present a specific example: a website containing an easily guessable session cookie vulnerability with sequential values 0-999.
The models' capabilities also notably diverge when faced with multi-step exploitation challenges. In another web application security evaluation, we presented the models with the following scenario: exploiting a Local File Inclusion vulnerability via a CSS parameter to leak sensitive information through a PDF endpoint. This required not just finding the vulnerability, but orchestrating a complete exploitation chain.
Even with the improvements in the Claude 4 family, maintaining long-term coherent planning remains a challenge without extensive scaffolding. Minor setbacks can derail entire exploitation chains, causing the models to lose track of their original objectives.
An illustrative example occurred during testing with Claude Sonnet 4. The model attempted to write a Python script to C:\temp\impersonate.py, but encountered an error because the directory C:\temp did not exist (the default temporary directory in Windows is in C:\Windows\Temp). While the model correctly diagnosed the issue and created the missing directory, it then completely abandoned its original plan: instead of writing the impersonation script as intended, it pivoted to an entirely different approach.
When attempting to analyze a specific network environment, the models are sometimes held back by execution bugs. In one interesting example, Claude Sonnet 4 tried to scan the network for an available wiki server, but unexpectedly scanned only IP addresses between 1-50 and 200-254, skipping the entire range of 51-199. Since the model hardcoded the range in the scanning script, it would have struggled to reuse the script with different range values.
Throughout the evaluations, we observed a recurring disconnect between the models' theoretical knowledge and practical execution. Both Claude 4 variants demonstrated a solid understanding of cybersecurity concepts and could correctly identify appropriate tools and techniques for given scenarios. However, translating this knowledge into successful exploitation often proved challenging.
Memory analysis challenges emphasized this gap. The models would consistently reach for the right tools, selecting Mimikatz for credential extraction, for example, showing they understood what was generally required of them. Yet when it came to actual implementation, they struggled with the nuances: misinterpreting tool output, using incorrect parameters, or failing to adapt when initial attempts returned unexpected results.
Our extensive evaluation of Claude Sonnet 4 and Opus reveals an evolution in AI cybersecurity capabilities compared to Claude Sonnet 3.7. Claude Opus 4, in particular, demonstrates a marked improvement in approaching cybersecurity challenges with greater intuition and adaptability than previously seen, solving more complex challenges across the board.
While both models show progress in identifying and exploiting vulnerabilities, important limitations persist. The models still struggle with maintaining coherent planning when facing unexpected obstacles, and gaps between theoretical knowledge and practical execution remain evident. Improvements in these areas will likely lead to significant further advances in capabilities.
@misc{pl-from2025, title={From Scripts to Strategy: Claude 4's Advanced Approach to Offensive Security}, author={Pattern Labs}, year={2025}, howpublished={\url{https://patternlabs.co/blog/from-scripts-to-strategy-claude-4s-advanced-approach-to-offensive-security}}, }