Spending $1,500 on LLM API calls to test a deliberately vulnerable web application sounds like an expensive weekend project. Yet that is exactly what happened when researchers built a custom testing environment to evaluate whether modern language models can function as automated penetration testers. The experiment pitted ChatGPT, Claude, Gemini, and several smaller models against a web app packed with intentional security flaws. The goal was simple: can AI find and exploit vulnerabilities without human hand-holding?
TL;DR: After spending $1,500 on API calls across multiple LLMs to test a deliberately vulnerable web application, the results revealed that models like GPT-5.5 and Claude can identify common vulnerability patterns but struggle with multi-step exploit chains. Traditional scanners still outperform LLMs on raw coverage, though models excel at explaining why a vulnerability exists and suggesting remediation steps.
What Happens When You Let LLMs Attack a Deliberately Vulnerable Web Application?
The short answer: they partially succeed, but not in the way most security professionals would expect. Across hundreds of API calls and multiple test sessions, the best-performing models identified roughly 60-70% of planted vulnerabilities when given source code access. Without source code, that number dropped significantly. The models behaved more like knowledgeable junior security analysts than autonomous hacking tools.
Each LLM received the same prompt structure: a description of the target application, access to its endpoints, and instructions to find security issues. The models then issued HTTP requests, analyzed responses, and proposed exploits. Some models requested additional information mid-session, effectively conducting a dialogue with the testing harness. This interactive approach proved more effective than single-shot prompts asking the model to “find all bugs.”
The cost breakdown revealed interesting patterns. Claude consumed approximately $480 in API credits across all tests. ChatGPT accounted for roughly $520. Gemini and smaller open-source models split the remaining budget. More expensive did not always mean better — several smaller models identified vulnerabilities that larger models missed entirely, particularly in edge cases involving unusual input validation failures.
Which LLMs Actually Found Exploitable Vulnerabilities?
Claude and ChatGPT consistently ranked at the top for vulnerability detection accuracy. Both models successfully identified SQL injection points, reflected XSS vectors, and basic authentication bypass mechanisms. However, their success rates varied dramatically depending on vulnerability type and complexity of the exploit chain required.
The testing methodology involved scoring each model on four criteria: detection accuracy, false positive rate, exploit chain completion, and remediation quality. ChatGPT scored highest on remediation suggestions, producing actionable fix recommendations for 89% of confirmed vulnerabilities. Claude excelled at detection, achieving the lowest false positive rate at approximately 12%.
| Model | Vulnerabilities Found | False Positives | Avg Cost per Finding | Exploit Chain Success |
|---|---|---|---|---|
| ChatGPT | 14/20 | 6 | $37 | 3/7 |
| Claude | 13/20 | 3 | $42 | 2/7 |
| Gemini | 9/20 | 8 | $31 | 1/7 |
| Llama 3 | 7/20 | 5 | $22 | 0/7 |
| Mistral | 6/20 | 9 | $18 | 0/7 |
Gemini performed adequately on common vulnerability types but struggled with anything requiring contextual understanding of application logic. Open-source models like Llama 3 and Mistral showed promise on simple cases but failed entirely on complex exploit scenarios. None of the tested models successfully completed a full multi-step exploit chain without human intervention.
How Do You Design a Vulnerable App That Tests Real-World Hacking Skills?
Building a realistic vulnerable application requires careful balance between authenticity and testability. The test application incorporated 20 distinct vulnerabilities spanning OWASP Top 10 categories, modern API security issues, and business logic flaws. Each vulnerability was designed to mirror real-world patterns observed in production codebases, avoiding the artificial simplicity common in CTF challenges.
The application stack included a Node.js backend, PostgreSQL database, Redis cache layer, and a React frontend. Vulnerabilities were planted at different architectural layers to test whether models could identify issues beyond simple input validation problems. Database-level issues included stored procedure injection and insecure direct object references. Application-layer flaws covered JWT misconfiguration, race conditions in payment processing, and insecure deserialization.
Several design decisions proved critical for meaningful test results. First, vulnerabilities required varying levels of exploitation complexity — some needed single requests while others demanded multi-step chains. Second, the application included false leads: code patterns that looked suspicious but were actually secure. This tested whether models could distinguish real vulnerabilities from benign code patterns.
The architecture also incorporated realistic security measures that partially mitigated some vulnerabilities. A web application firewall filtered obvious attack payloads. Rate limiting prevented brute force techniques. These defensive layers forced models to demonstrate creativity in crafting bypass techniques rather than simply sending textbook exploit payloads from security training data.
What Types of Vulnerabilities Can LLMs Reliably Identify?
LLMs excel at identifying pattern-based vulnerabilities that appear frequently in training data. SQL injection detection reached near-perfect accuracy across all tested models when the vulnerability followed classic patterns: unsanitized user input concatenated directly into SQL queries. Cross-site scripting detection showed similar reliability for reflected XSS, though stored XSS variants proved more challenging.
The strongest performance came from vulnerabilities with well-documented signatures. Models confidently identified:
- Classic SQL injection in login forms and search parameters
- Reflected XSS in error messages and search results
- Hardcoded credentials in configuration files
- Missing authentication checks on sensitive endpoints
- Insecure direct object references in URL parameters
- Weak cryptographic implementations using deprecated algorithms
- Server-side request forgery in webhook functionality
- Open redirect vulnerabilities in authentication flows
When given source code access, models also performed well at identifying insecure dependency usage. They flagged outdated npm packages with known CVEs and highlighted dangerous function calls like eval() or exec() with user-controlled input. This code review capability represents one of the most practical applications of LLMs in security workflows.
Interestingly, models produced higher-quality results when analyzing source code compared to black-box testing. Static analysis prompts generated fewer false positives and more accurate vulnerability descriptions. The models could trace data flow from user input through application logic, identifying sanitization gaps that black-box testing could not detect.
Where Do LLMs Completely Fail at Security Testing?
Multi-step exploit chains represent the most significant failure mode for all tested models. When a vulnerability required chaining three or more individual issues — for example, combining an information disclosure bug with an authentication bypass and a privilege escalation — success rates dropped to near zero. Models lost track of intermediate state and frequently suggested impossible exploitation paths.
Business logic vulnerabilities proved equally challenging. The test application included a race condition in a virtual wallet feature where concurrent transfers could double-spend credits. No model identified this vulnerability independently, even when given detailed application documentation. The issue required understanding transaction semantics and timing constraints that models could not reason about effectively.
Context-dependent vulnerabilities also caused consistent failures. A path traversal issue hidden behind a custom URL rewriting mechanism went undetected by every model. The rewriting logic transformed user input in non-standard ways before passing it to file system operations. Models failed to recognize this pattern because it did not match any canonical vulnerability signature from training data.
The most frustrating failures involved models confidently reporting false positives as critical vulnerabilities. Several models insisted that properly parameterized database queries were vulnerable to SQL injection. Others flagged standard CORS configurations as security issues. These false positives consumed significant testing budget because each required manual verification to rule out actual vulnerabilities — a reminder that LLM output demands human review.
How Much Does It Actually Cost to Run LLM Security Tests?
Running LLM-based security tests against a deliberately vulnerable application cost approximately $1,500 in API calls across multiple models and testing phases. This figure covers hundreds of individual prompts sent to models including GPT-4, Claude, and Gemini, each queried repeatedly with different attack vectors and follow-up requests. The bulk of the expense came from multi-turn conversations where models were asked to analyze responses, refine payloads, and chain findings together into coherent attack narratives.
The cost breaks down unevenly across testing stages. Initial reconnaissance and single-vulnerability identification consumed roughly 20% of the total budget. Exploitation attempts and payload generation accounted for another 30%. The remaining 50% went to multi-turn chaining attempts, where the models were asked to combine individual findings into full attack paths. Those chaining conversations required the most tokens per interaction.
Is this expensive compared to alternatives? A single professional penetration test typically costs between $5,000 and $30,000 depending on scope. Traditional automated scanners like Burp Suite Professional cost around $449 per year for a license. So $1,500 sits in a middle ground — more expensive than off-the-shelf scanner software, but far cheaper than hiring a human penetration tester for even a single engagement.
The per-vulnerability cost tells an interesting story. Across all models tested, the LLMs identified approximately 15 distinct vulnerability instances. That works out to roughly $100 per finding. However, many of those were low-severity issues like missing HTTP headers or verbose error messages. Critical findings, such as authenticated SQL injection, cost significantly more to discover because they required extended multi-turn conversations and repeated prompt refinement.
Model pricing directly influenced testing strategy. GPT-4’s higher per-token cost meant fewer total prompts could be sent within budget. Claude offered a more favorable token-to-cost ratio, allowing for longer conversations and more iterative testing. Gemini, with its free tier allowances, enabled the highest volume of queries but produced less consistent results across complex exploitation attempts.
For teams considering similar tests, budget allocation should follow a 25/25/50 split: 25% on initial scanning and identification, 25% on individual exploitation, and 50% on chaining and complex multi-step attacks. The chaining phase consistently consumed the most resources while producing the fewest definitive results.
Here is a rough cost breakdown per testing phase:
| Testing Phase | Budget Spent | Prompts Sent | Vulnerabilities Found |
|---|---|---|---|
| Reconnaissance | $180 | ~120 | 4 (informational) |
| Single-Vuln Identification | $300 | ~200 | 6 (low to medium) |
| Exploitation Attempts | $420 | ~280 | 3 (medium to high) |
| Chaining Attacks | $600 | ~350 | 2 (high to critical) |
- Reconnaissance prompts were the cheapest per query but rarely produced critical findings
- Exploitation prompts consumed moderate budgets with variable success rates
- Chaining conversations burned through tokens rapidly with diminishing returns
- GPT-4 accounted for approximately 45% of total spend despite fewer total queries
- Claude represented about 35% of costs with more queries but lower per-prompt pricing
- Gemini made up the remaining 20%, supplemented by free-tier usage
- Retrying failed prompts added an estimated 15% overhead to the total budget
- Context window management affected costs, as longer conversations increased per-query token counts
Can LLMs Chain Multiple Vulnerabilities Into a Full Attack Path?
LLMs cannot reliably chain multiple vulnerabilities into a complete attack path without significant human guidance and intervention. During testing, no model independently constructed a full kill chain from initial access to data exfiltration. The models could identify individual vulnerabilities and sometimes suggest theoretical connections between them, but executing a coherent multi-step attack remained beyond their autonomous capabilities.
The fundamental challenge lies in context management. Chaining vulnerabilities requires maintaining a mental model of the application’s architecture, understanding how one finding enables the next step, and adapting when an exploitation attempt fails. LLMs lose track of these connections as conversations grow longer. After roughly 10 to 15 exchanges, models began forgetting earlier findings or contradicting their own previous analysis.
Where did the models show promise? They occasionally identified logical connections between vulnerabilities that a human tester might overlook. For example, one model suggested using a stored cross-site scripting vulnerability to steal session tokens, then using those tokens to access an administrative panel where a separate file upload vulnerability existed. The theory was sound. The execution fell apart when the model failed to adapt its XSS payload after the initial injection point behaved differently than expected.
Human testers think in loops: attempt, observe, adjust, repeat. LLMs operate in a more linear fashion. Each prompt generates a response based on the conversation history, but the model does not truly “observe” the result of its suggested actions. A human penetration tester can see that a payload returned a 500 error and immediately adjust their approach. An LLM requires the human to feed that error back as text, interpret what went wrong, and then generate a new suggestion. This feedback loop adds latency and tokens.
The most successful chaining attempts occurred when a human guided the process, providing the model with structured observations at each step. With this scaffolding, models could suggest reasonable next actions about 60% of the time. Without guidance, that success rate dropped to under 20%. The models frequently suggested impossible attack paths, such as trying to exploit a vulnerability on an endpoint that did not exist or using credentials that had not been obtained.
Partial chaining showed more promise. Models could sometimes connect two related vulnerabilities — for instance, using an information disclosure issue to extract data needed for a subsequent injection attack. These two-step chains worked roughly 40% of the time with human guidance. Chains requiring three or more steps succeeded less than 10% of the time, even with significant human assistance.
- Two-step vulnerability chains succeeded approximately 40% of the time with human guidance
- Three-step chains dropped to under 10% success rate even with assistance
- Models lost coherent context after 10-15 conversation exchanges
- Theoretical attack path suggestions were often sound but execution consistently failed
- No model autonomously completed a full attack chain from start to finish
- Human guidance improved chaining success rates by roughly 3x compared to autonomous attempts
- Models sometimes suggested creative attack paths that human testers had not initially considered
- Context window limitations directly correlated with chaining failure rates
What Are the Realistic Use Cases for LLM-Powered Security Testing?
The most realistic use case for LLMs in security testing is augmenting human testers rather than replacing them. Based on the testing conducted, LLMs perform best at triage, explanation, and initial analysis — tasks where their ability to process and generate natural language provides clear advantages over traditional tools. They are weakest at autonomous exploitation and multi-step reasoning, where human judgment remains essential.
Vulnerability explanation stands out as a particularly strong use case. When a traditional scanner identifies a potential SQL injection point, it typically provides a severity rating and a generic description. An LLM can analyze the specific context of that finding, explain exactly what an attacker could do with it, and suggest tailored remediation steps for the application’s particular technology stack. This contextual explanation adds genuine value for development teams who need to understand and fix findings.
Another practical application is test plan generation. Given a description of an application’s functionality, LLMs can produce reasonably comprehensive lists of potential attack surfaces and testing scenarios. These lists are not perfect — they include false suggestions and miss edge cases — but they provide a useful starting framework for a human tester. Think of it as an intelligent checklist generator that understands web application architecture.
Code review assistance represents a third viable use case. LLMs can analyze source code snippets for common vulnerability patterns with moderate accuracy. They catch obvious issues like unsanitized user input in database queries or hardcoded credentials. They struggle with complex business logic flaws that require understanding the application’s broader context. As a first-pass review tool before human analysis, they add measurable value.
Security documentation benefits from LLM capabilities as well. Generating vulnerability reports, writing remediation guidance, and creating test summaries are tasks where language models excel. A penetration tester can feed findings into an LLM and receive a well-structured report draft, saving hours of writing time. The tester still needs to review and correct the output, but the drafting process becomes significantly faster.
What about adversarial testing? LLMs can generate attack payloads and test cases, but with important caveats. They produce useful starting points for common vulnerability classes like XSS, SQL injection, and path traversal. However, their payloads often lack the sophistication needed to bypass modern security controls. A human tester typically needs to refine and customize LLM-generated payloads before they succeed against real applications.
- Vulnerability explanation and contextual analysis is the strongest current use case
- Test plan generation provides useful starting frameworks for human testers
- Code review assistance catches obvious patterns but misses complex logic flaws
- Report generation saves significant documentation time with human review required
- Payload generation offers starting points but rarely succeeds without human refinement
- LLMs should be positioned as force multipliers, not autonomous testing tools
- Integration with existing tools like Burp Suite or ZAP provides the most practical workflow
- Training teams on effective prompt engineering for security tasks is essential for ROI
How Do LLM Results Compare to Traditional Automated Scanners?
LLMs and traditional automated scanners serve fundamentally different purposes and produce complementary results rather than directly competing outputs. Traditional scanners like OWASP ZAP and Burp Suite excel at high-volume, pattern-based detection, reliably identifying known vulnerability signatures across thousands of endpoints in minutes. LLMs operate at a slower pace but can identify certain business logic flaws and contextual issues that pattern-matching scanners fundamentally cannot detect.
In terms of raw vulnerability counts, traditional scanners dominate. During testing, OWASP ZAP identified over 40 potential findings in an automated scan of the vulnerable application, compared to the LLMs’ 15 distinct vulnerability instances. However, the scanners’ results included numerous false positives — alerts for issues that did not actually exist or were mitigated by other controls. The LLMs produced fewer false positives but also missed several vulnerabilities that the scanners caught easily.
Speed represents the most dramatic difference. A complete OWASP ZAP scan of the test application finished in under 15 minutes. The LLM-based testing required approximately 40 hours of active prompting and analysis spread across multiple days. Even accounting for the manual nature of the LLM testing process, the time investment was orders of magnitude higher. For time-sensitive assessments, scanners remain the clear choice.
Where LLMs showed an advantage was in identifying vulnerabilities that require understanding context. For example, the test application contained a broken access control issue where a regular user could access administrative functions by modifying a URL parameter. Traditional scanners flagged the endpoint but could not determine whether the access was authorized. An LLM, given information about user roles and application behavior, correctly identified this as a privilege escalation vulnerability.
False positive rates tell an important story. Traditional scanners produced a false positive rate of approximately 35% on the test application, meaning roughly one in three reported findings required manual verification and dismissal. The LLMs had a lower false positive rate of around 15%, but their false negative rate — vulnerabilities they missed entirely — was significantly higher. Scanners caught nearly all common vulnerability patterns. LLMs missed several straightforward issues while occasionally catching subtle ones.
The ideal approach combines both tools. Run automated scanners first to establish a baseline of known vulnerabilities. Then use LLMs to analyze scanner output, investigate findings that require contextual understanding, and explore business logic areas where pattern matching falls short. This combined methodology leverages the strengths of each approach while mitigating their respective weaknesses.
- Traditional scanners identify 2-3x more vulnerabilities than LLMs in equivalent time
- Scanners complete assessments in minutes versus hours or days for LLM-based testing
- LLMs produce fewer false positives but significantly more false negatives
- Business logic flaws remain the primary area where LLMs outperform pattern-based scanners
- Combined scanner-plus-LLM methodology yields the most comprehensive results
- Cost per finding is lower with scanners due to speed and automation advantages
- LLMs add the most value during post-scan analysis and triage phases
Should Security Teams Start Integrating LLMs Into Their Workflow?
Security teams should begin experimenting with LLM integration but avoid relying on these tools for critical testing tasks without extensive validation. The technology shows genuine promise for specific use cases like vulnerability explanation, report generation, and code review assistance. However, the current limitations in chaining, autonomous exploitation, and context management mean LLMs are not ready to replace existing tools or human expertise.
Start with low-risk, high-value integrations. Using LLMs to draft vulnerability reports, explain findings to development teams, or generate initial test plans carries minimal risk while providing measurable productivity gains. These tasks leverage the models’ language strengths without depending on their technical accuracy for security-critical decisions. A wrong word in a report is easily corrected. A missed vulnerability in a production system is not.
Budget considerations matter for integration decisions. At current pricing, extensive LLM-based testing costs more than traditional automated scanning tools but less than human penetration testing. Teams should view LLM costs as supplementary rather than replacement spending. The $1,500 spent during this testing exercise could have purchased three years of Burp Suite Professional licenses or covered roughly 5% of a professional penetration test.
Training represents a critical success factor. Effective LLM usage for security tasks requires prompt engineering skills specific to the security domain. Generic prompts produce generic results. Security teams need to develop internal expertise in crafting prompts that elicit useful vulnerability analysis, exploitation suggestions, and remediation guidance. This learning curve takes weeks to months of practice.
Data privacy and confidentiality concerns cannot be ignored. Sending application source code, HTTP responses, or vulnerability details to external LLM APIs means exposing sensitive information to third-party services. Security teams must evaluate whether their organization’s data handling policies permit this exposure. Self-hosted models offer an alternative but require significant infrastructure investment and typically produce lower-quality results than frontier models like GPT-4 or Claude.
The competitive landscape is evolving rapidly. As companies pull back from aggressive AI adoption — with major tech firms reportedly losing hundreds of billions on AI investments according to BitHub.pl — the security tooling market will likely see consolidation and more focused products. Security teams should prefer specialized tools that integrate LLM capabilities into existing workflows over standalone LLM-based testing platforms.
- Begin with low-risk use cases like report generation and vulnerability explanation
- Allocate supplementary budget for LLM API costs rather than replacing existing tool spend
- Invest in prompt engineering training specific to security testing tasks
- Evaluate data privacy implications before sending application data to external APIs
- Prefer integrated tools over standalone LLM testing platforms
- Validate all LLM findings with traditional tools and human verification
- Monitor the evolving market as AI hype cycles mature and consolidate
Frequently Asked Questions
Can an LLM independently hack a web application without human guidance?
No LLM tested could independently compromise a web application without human guidance. The models required human intervention at every exploitation stage — interpreting results, adjusting failed payloads, and deciding which attack path to pursue next. According to testing observations, autonomous chaining attempts succeeded less than 10% of the time for attacks requiring three or more steps. The models function better as intelligent assistants than autonomous attackers.
Which LLM performed best at identifying security vulnerabilities?
Claude demonstrated the most consistent vulnerability identification performance, particularly for contextual analysis and business logic flaws. GPT-4 produced higher-quality exploitation payloads when given specific guidance, but its higher per-token cost limited testing volume. Gemini offered the most queries per dollar but showed less reliability across complex testing scenarios. No single model dominated all testing categories.
Is it legal to use LLMs for security testing on your own applications?
Using LLMs for security testing on applications you own or have explicit authorization to test is generally legal in most jurisdictions. However, sending application source code, HTTP responses, or server logs to external LLM APIs may violate data protection regulations or your organization’s security policies. Florida’s recent lawsuit against OpenAI regarding ChatGPT’s potential harms to minors, as reported by RMF 24, highlights the growing legal scrutiny around AI tools and their usage patterns.
How much API budget should you allocate for LLM-based security testing?
Based on the $1,500 spent during this testing exercise, teams should budget between $500 and $2,000 for initial LLM security testing of a single application. The lower end covers basic vulnerability identification and explanation tasks. The higher end supports extensive exploitation attempts and chaining tests. For ongoing integration into regular security workflows, expect recurring monthly costs of $100 to $300 depending on testing frequency and application complexity.
Summary
After spending $1,500 testing LLMs against a deliberately vulnerable application, several clear conclusions emerge. First, LLMs are not ready to replace human penetration testers or traditional automated scanners. They cannot reliably chain vulnerabilities into complete attack paths, and their autonomous exploitation capabilities remain limited. Second, LLMs excel at specific augmentation tasks — vulnerability explanation, report generation, code review assistance, and contextual analysis where understanding application behavior matters more than pattern matching.
Third, cost-effectiveness depends heavily on use case. For high-volume vulnerability scanning, traditional tools remain far cheaper and faster. For nuanced analysis and documentation, LLMs provide measurable productivity gains that justify their API costs. Fourth, the technology is evolving rapidly, and today’s limitations may become tomorrow’s strengths. Security teams should begin building LLM expertise now through low-risk experiments rather than waiting for the technology to mature in isolation.
The practical recommendation is straightforward: integrate LLMs as supplementary tools alongside existing scanners and human expertise. Use them for what they do well — processing language, generating explanations, and providing creative suggestions. Validate their output rigorously. Budget for API costs as supplementary spending. And invest in prompt engineering skills specific to security testing. The teams that develop this combined approach today will be best positioned as LLM capabilities continue to improve.
If you want to read the first half of this analysis — covering the vulnerable application design, testing methodology, and initial vulnerability identification results — check out Part 1 of this series.