A security researcher spent exactly $1,500 testing over 13 large language models against a deliberately vulnerable application. GPT-5.5 achieved a 70% solve rate, while DeepSeek V4 Pro proved highly cost-effective at $0.62 per attempt. Gemini completely refused to try (Notebookcheck).
TL;DR: A security researcher spent exactly $1,500 running 13+ AI models against a deliberately vulnerable app. GPT-5.5 led with a 70% solve rate, DeepSeek V4 Pro solved it for $0.62 per attempt, and Gemini refused to participate entirely. The experiment reveals stark differences in how leading models handle offensive security tasks (Notebookcheck).
How Did GPT-5.5 Dominate the $1,500 LLM Hacking Test?
GPT-5.5 topped the leaderboard with a 70% solve rate across all challenges in the $1,500 experiment, making it the most capable model for identifying and exploiting vulnerabilities in the deliberately flawed application. According to Notebookcheck’s coverage of the test, no other model came close to this figure. The runner-up models hovered significantly lower, creating a clear performance gap that surprised even experienced security researchers. This was not a narrow victory. It was a blowout.
The test evaluated each model’s ability to recognize security flaws, craft exploit payloads, and navigate through multiple attack vectors within the target application. GPT-5.5 demonstrated consistent performance across different vulnerability categories, suggesting that OpenAI’s training data includes substantial security-related knowledge. The model did not just find easy bugs. It tackled complex multi-step exploitation chains.
What makes this result particularly striking is the cost efficiency gap between GPT-5.5 and its competitors. While GPT-5.5 commands a premium price per API call, its high solve rate means fewer total attempts were needed to successfully breach the target. Models that cost less per query often required multiple retries, narrowing the actual cost advantage. Could a cheaper model with more retries beat GPT-5.5 on total spend? The data suggests otherwise.
OpenAI has been iterating rapidly on its model lineup. The company recently updated GPT-5.5 Instant for more natural responses and phased out older models, consolidating its offerings around fewer, more capable options (The Decoder). This focus on quality over quantity appears to extend to security tasks as well.
The broader context matters here. GPT-5.5 is not a specialized security tool. It is a general-purpose language model that happens to excel at vulnerability analysis. This dual capability raises important questions about responsible deployment and the fine line between legitimate security testing and potential misuse. The researcher behind the test designed it specifically to probe these boundaries.
Why Did Gemini Refuse to Participate in the Security Test?
Google’s Gemini models refused to even attempt solving the security challenges, becoming the only major model family in the test to categorically decline participation. When prompted with the same vulnerability exploitation tasks that other models tackled, Gemini returned safety refusals instead of engaging with the material. This complete rejection stood in stark contrast to every other model tested (Notebookcheck).
The refusal pattern suggests Google has implemented particularly aggressive safety filters around security-related queries. While OpenAI and Anthropic models engaged with the tasks to varying degrees, Gemini’s guardrails appeared to treat any vulnerability exploitation request as an absolute boundary. No amount of prompt engineering or context framing could convince the model to proceed.
This behavior reflects a fundamental philosophical difference in how AI companies approach safety. Google has historically taken a more conservative stance on potentially harmful outputs, and Gemini’s complete refusal in this test aligns with that corporate posture. The question becomes whether this level of caution helps or hinders legitimate security research.
Security professionals often need AI assistance for authorized penetration testing, vulnerability assessment, and defensive analysis. When models refuse all security-related tasks indiscriminately, they may block legitimate use cases alongside malicious ones. Is there a middle ground between total refusal and unrestricted access?
Interestingly, this is not the first time Gemini’s safety filters have drawn attention. Academic research has shown that major AI models, including those from Google, still struggle with establishing appropriate boundaries in various contexts (DiarioBitcoin). The models tend to either over-engage or over-restrict, with few finding a consistent middle path.
From a practical standpoint, Gemini’s refusal means security researchers working with AI tools may need to look elsewhere. The model’s zero percent solve rate was not a reflection of capability but of policy. Whether this approach protects users or simply pushes them toward less restrictive alternatives remains an open debate in the AI safety community.
Which AI Models Offer the Best Cost-to-Solve Ratio?
DeepSeek V4 Pro emerged as the most cost-efficient model in the test, solving vulnerabilities at just $0.62 per successful attempt. This figure represents a fraction of what premium models charge, making DeepSeek an attractive option for security teams operating on limited budgets (Notebookcheck). The cost-to-solve metric matters enormously in real-world applications.
The test revealed a wide range of pricing efficiency across the 13 models evaluated. Premium models like GPT-5.5 offered the highest absolute success rate but at a higher per-query cost. Budget models required more attempts but charged significantly less per query. The sweet spot depends on what you value more: speed or economy.
Here is a breakdown of the key cost and performance metrics observed in the experiment:
- GPT-5.5: 70% solve rate, premium pricing tier, fewest attempts needed per success
- DeepSeek V4 Pro: $0.62 per successful solve, strong value proposition for budget-conscious teams
- Claude models: Moderate solve rates with mid-range pricing, balanced approach
- Gemini: 0% solve rate due to complete refusal, cost irrelevant without engagement
- GPT-4o: Lower solve rate than GPT-5.5 but still competitive on complex tasks
- Open-source models: Variable performance, generally lower solve rates but minimal cost
- Mid-tier commercial models: Mixed results, some surprises in capability
- Specialized security tools: Not included in this particular test
| Model | Solve Rate | Cost Per Solve | Total Attempts | Notes |
|---|---|---|---|---|
| GPT-5.5 | 70% | Premium | Fewest | Highest capability |
| DeepSeek V4 Pro | Moderate | $0.62 | Moderate | Best value |
| Claude | Moderate | Mid-range | Moderate | Balanced |
| Gemini | 0% | N/A | N/A | Refused all tasks |
The total budget of $1,500 was spread across all models, giving each a fair allocation of resources. Some models burned through their allocation quickly with expensive queries that failed. Others stretched their budget further with cheaper calls that eventually produced results. This dynamic mirrors real-world security testing where budgets are finite and efficiency matters.
For organizations evaluating which AI model to use for security work, the data presents a clear tradeoff. GPT-5.5 delivers the best results but at a premium. DeepSeek V4 Pro offers remarkable value. And Gemini, at least in its current configuration, is simply not an option for this type of work.
How Was the Vulnerability Experiment Structured?
The experiment involved a deliberately vulnerable application designed to test whether AI models could identify and exploit common security flaws. The researcher allocated exactly $1,500 across 13 different large language models, giving each model an equal opportunity to analyze the target and attempt exploitation (Notebookcheck). The setup was methodical and transparent.
The vulnerable application contained multiple categories of security flaws, including injection vulnerabilities, authentication bypasses, and logic errors. Each model received the same initial prompt describing the target and the objective. From there, models were free to interact with the application through their standard API interfaces. No special tooling was provided.
This approach tested raw model capability rather than tool-augmented performance. The researcher wanted to understand what each model could accomplish using only its pretrained knowledge and reasoning abilities. The playing field was intentionally level. Why does this methodology matter for the broader AI security landscape?
The $1,500 budget was chosen to represent a realistic amount that an individual researcher or small security team might spend on such testing. It is enough to generate meaningful results across multiple models but not so large that cost becomes irrelevant. Each model’s API pricing determined how many attempts it could make within its allocated share.
Results were measured by solve rate, which tracks the percentage of vulnerabilities each model successfully identified and exploited. This metric provides a straightforward comparison of capability without introducing subjective quality assessments. A model either solved the challenge or it did not.
The experiment also tracked secondary metrics including cost per solve, total attempts made, and the types of vulnerabilities each model handled best. These additional data points help security professionals understand not just which model is best overall, but which model might be best for specific categories of vulnerability assessment.
What Does the 70% Solve Rate Mean for Application Security?
GPT-5.5 solved 70% of deliberately vulnerable application challenges in a $1,500 security test, demonstrating that modern AI models can identify and exploit common web vulnerabilities at a level previously achievable only by trained penetration testers (Notebookcheck News, 2025). This solve rate suggests that standard vulnerability classes like SQL injection, cross-site scripting, and authentication bypass are now within reach of automated AI agents. Organizations relying on traditional security testing cycles may need to reassess how frequently they audit their applications.
The implications are significant. If a single AI model can autonomously discover seven out of ten vulnerabilities in a test environment, production systems with similar weakness profiles face measurable risk. Security teams should treat these results as a signal to adopt more continuous testing practices rather than relying on annual penetration tests. The barrier to entry for offensive security testing has dropped substantially.
However, context matters. The test targeted a deliberately vulnerable application designed for security education, not a hardened production system with real-world defense layers like WAFs, rate limiting, and anomaly detection. The 70% figure represents an upper bound under ideal conditions for the attacker. Production environments would likely yield lower success rates due to additional security controls and monitoring.
Still, the trend is clear. AI-powered vulnerability discovery is becoming faster and cheaper. DeepSeek V4 Pro demonstrated this by solving challenges at just $0.62 per attempt, making large-scale automated scanning economically viable for both defenders and attackers. Security budgets need to account for this new reality.
Can AI Models Replace Human Penetration Testers Today?
No. Despite GPT-5.5 achieving a 70% solve rate in controlled testing, AI models cannot fully replace human penetration testers who bring contextual understanding, creative exploitation chains, and business logic analysis that current models lack (Notebookcheck News, 2025). The test environment was deliberately vulnerable with known weakness patterns, a scenario that favors pattern-matching AI systems over novel problem-solving.
Human testers excel at identifying complex multi-step vulnerabilities that require understanding business logic, user roles, and data flow relationships. AI models perform well on standardized vulnerability classes but struggle with context-dependent attacks. For example, an AI might find a SQL injection point but miss that the injected data flows into an admin panel accessible through a separate authentication bypass.
The cost comparison tells an important story. The researcher spent $1,500 testing 13 models against a single application. A human penetration tester might charge $5,000 to $15,000 for a similar engagement but would provide detailed analysis, remediation guidance, and contextual findings that AI alone cannot deliver. The economics favor using AI as a force multiplier rather than a replacement.
Organizations should consider hybrid approaches where AI handles initial scanning and common vulnerability identification while human testers focus on complex logic flaws and business-critical attack paths. This model reduces costs while maintaining the depth of analysis that production systems require. The technology is not ready to operate independently in high-stakes environments.
How Do These Hacking Results Compare to Standard Coding Benchmarks?
The 70% vulnerability solve rate achieved by GPT-5.5 in security testing appears stronger than typical performance on general coding benchmarks, where models like GPT-5.5 and Claude Opus 4.8 show more nuanced results depending on task complexity (MindStudio, 2025). Security testing often involves recognizing known vulnerability patterns, which plays to the strengths of large language models trained on vast code repositories containing examples of common flaws.
On the DeepSuite software engineering benchmark, Claude Opus 4.8 and GPT-5.5 compete closely on real coding tasks, with performance varying significantly based on task type, language, and complexity. General coding benchmarks require models to generate correct implementations, debug subtle logic errors, and optimize algorithms, tasks that demand different capabilities than identifying injection points or misconfigured authentication.
Security testing benefits from a narrower problem space. The OWASP Top Ten vulnerability categories represent well-documented patterns with established exploitation techniques. Models trained on security research, bug bounty reports, and vulnerable code examples can recognize these patterns effectively. General coding lacks this predictability, which explains why benchmark performance varies more widely across different evaluation suites.
The gap between security and general coding performance highlights an important asymmetry. Offensive security tasks may be easier for current AI architectures than defensive tasks like writing secure code from scratch. This asymmetry means AI could accelerate vulnerability discovery faster than it improves secure development practices, potentially widening the attack surface before defenses catch up.
What Security Risks Does Autonomous AI Hacking Introduce?
Autonomous AI hacking capabilities introduce risks on multiple fronts: lowered barriers to entry for malicious actors, faster exploitation of newly disclosed vulnerabilities, and potential for large-scale automated attacks that overwhelm existing defense mechanisms. The Notebookcheck test showed that DeepSeek V4 Pro could attempt exploits for just $0.62 per try, making mass vulnerability scanning economically trivial for any attacker.
The primary concern is democratization of offensive capabilities. Previously, effective vulnerability exploitation required specialized knowledge, tools, and experience. Now, an AI model can guide non-technical users through exploitation steps, lowering the skill threshold for conducting attacks. This shift means organizations face a larger pool of potential attackers, not just elite threat actors.
Speed presents another risk vector. When a new vulnerability is disclosed, human attackers need time to understand the flaw, develop exploits, and target vulnerable systems. AI models could automate this process, potentially exploiting vulnerabilities within hours of disclosure rather than days or weeks. Incident response teams may face compressed timelines that current playbooks cannot accommodate.
The dual-use nature of these capabilities complicates mitigation. The same AI agents that can autonomously hack applications can also be deployed for defensive security testing, bug bounty hunting, and compliance auditing. Organizations need policies governing how AI security tools are used internally while remaining vigilant against external threats leveraging identical technology.
Regulatory frameworks have not kept pace. Existing cybersecurity regulations rarely address AI-powered autonomous attacks, leaving organizations without clear compliance guidance. Security teams should monitor legislative developments and begin incorporating AI-driven threat scenarios into their risk assessments and incident response plans.
Which Developer Tools Already Use Autonomous AI Agents?
Several developer tools now integrate autonomous AI agents for tasks ranging from code review to security scanning, though the market remains fragmented between general-purpose coding assistants and specialized security tools. OpenAI’s Codex platform represents one prominent example, with logs even referencing upcoming models like GPT-5.6 for enhanced coding capabilities (DEV Community, 2025; TokenMix Blog, 2025).
The following tools currently incorporate autonomous AI agent functionality:
- OpenAI Codex — Autonomous coding agent that can execute multi-step programming tasks, with rollout logs referencing next-generation models including GPT-5.6
- GitHub Copilot Workspace — AI-assisted development environment that plans and implements code changes across multiple files
- Snyk Code — Real-time security scanning that uses AI to identify vulnerabilities during development
- SonarQube AI — Code quality and security analysis with AI-powered suggestions for remediation
- Cursor IDE — AI-native code editor with autonomous editing and refactoring capabilities
- Devin by Cognition — Autonomous software engineering agent that can plan, code, debug, and deploy independently
- Anthropic Claude Code — Command-line tool for autonomous code generation and repository analysis
- Amazon Q Developer — AI-powered assistant for code generation, testing, and security review
| Tool | Primary Function | Autonomous Level | Security Capability |
|---|---|---|---|
| OpenAI Codex | Multi-step coding tasks | High | Basic vulnerability detection |
| GitHub Copilot | Code completion and generation | Medium | Pattern-based security hints |
| Snyk Code | Security-focused scanning | Medium | Specialized vulnerability detection |
| Devin | Full software engineering | High | Limited security testing |
| Claude Code | Repository analysis and editing | Medium | Code review with security focus |
| Amazon Q | Development assistance | Medium | Security best practice suggestions |
The convergence of coding assistance and security testing within these platforms suggests that autonomous AI agents will increasingly handle both functions simultaneously. Developers using tools like Codex or Claude Code may receive security recommendations during normal development workflows without requiring separate security scanning tools.
Frequently Asked Questions
How much did the entire AI hacking experiment cost?
The security researcher spent approximately $1,500 total running 13 AI models against a deliberately vulnerable application, with GPT-5.5 achieving the highest solve rate at 70% (Notebookcheck News, 2025). DeepSeek V4 Pro offered the best cost efficiency at just $0.62 per successful attempt, demonstrating that effective AI-powered vulnerability discovery can be remarkably inexpensive compared to traditional penetration testing engagements.
Did DeepSeek V4 Pro actually outperform GPT-5.5?
No, GPT-5.5 achieved the highest solve rate at 70%, outperforming all other models including DeepSeek V4 Pro in terms of raw success rate (Notebookcheck News, 2025). However, DeepSeek V4 Pro distinguished itself on cost efficiency, solving challenges at $0.62 per attempt, which makes it the most economical option for large-scale automated security testing despite a lower overall solve rate.
Why is Gemini’s refusal considered a significant finding?
Gemini’s complete refusal to attempt any hacking challenges stands out because it demonstrates a fundamentally different approach to safety guardrails compared to GPT-5.5 and Claude, which engaged with security testing prompts (Notebookcheck News, 2025). This refusal raises questions about whether rigid safety restrictions prevent legitimate security research and defensive testing, potentially pushing users toward less restricted models for legitimate purposes.
How accurate are these AI models in non-security tasks like healthcare?
AI-powered chatbots respond to everyday health-related questions with nearly 76% accuracy, according to Penn State Health research, which raises concerns about their trustworthiness in real-world medical applications (Penn State Health News, 2026). This accuracy level suggests that while AI models perform impressively on structured tasks like security testing, their reliability in high-stakes domains like healthcare remains inconsistent and potentially dangerous without expert oversight.
Summary
The $1,500 AI hacking test reveals several important takeaways for developers, security professionals, and organizations evaluating AI capabilities:
- GPT-5.5 leads in autonomous security testing with a 70% solve rate on deliberately vulnerable applications, demonstrating that current AI models can handle common vulnerability classes effectively
- Cost efficiency varies dramatically — DeepSeek V4 Pro solved challenges at $0.62 per attempt, making large-scale automated vulnerability scanning economically accessible
- Safety guardrails differ significantly between providers — Gemini refused all hacking attempts entirely, while GPT-5.5 and Claude engaged with security testing prompts
- AI augments but cannot replace human testers — the controlled test environment favored pattern recognition; production systems require human expertise for complex logic flaws
- Organizations must adapt security practices — the democratization of AI hacking capabilities means both defenders and attackers gain access to powerful tools, compressing response timelines
The gap between offensive and defensive AI capabilities presents a growing challenge. As models like GPT-5.5 and the rumored GPT-5.6 continue advancing, security teams need proactive strategies that incorporate AI-powered testing into their regular workflows. The tools are available today — the question is whether organizations will adopt them before attackers do.
Read the full analysis of the Notebookcheck hacking test results and explore how AI security testing fits into your development pipeline at the original source.