AI-generated code has 2.7x more vulnerabilities than human-written code, and even the latest 2026 models only pass security checks 55% of the time. Worse, LLMs used for code review suffer from confirmation bias -- framing a malicious PR as a "security improvement" bypasses detection up to 88% of the time. You need SAST in every pipeline, humans reviewing security-critical code, pinned dependency versions (not >=latest), and a healthy distrust of anything an AI tells you is safe. The models write functional code. They don't write secure code. Those are different things.
Every developer I talk to is using AI to write code. Copilot, ChatGPT, Claude, Cursor, Amazon Q -- pick your flavor. And I get it. The productivity gains are real. You describe what you want, and working code appears in seconds. It feels like a superpower.
But here's what most people aren't paying attention to: that code has problems. Real, measurable, exploitable problems.
The numbers aren't great
Veracode's 2025 GenAI Code Security Report tested over 100 LLMs across Java, JavaScript, Python, and C#. The headline finding: AI-generated code contains 2.74x more vulnerabilities than code written by humans. That's not a rounding error. That's almost three times the attack surface, generated at machine speed.
It gets worse. Design flaws (think authentication bypass patterns, insecure direct object references, broken session management) increased 153%. Privilege escalation paths went up 322%. Secrets exposure climbed 40% because models learned from millions of public repos where developers hardcoded credentials, and now they reproduce that pattern like it's normal.
Then in March 2026, Armis Labs benchmarked 18 leading AI models across 31 test scenarios. Every single model failed to consistently generate secure code. The best performers still produced vulnerable output in over 30% of tests. Memory buffer overflows, insecure file uploads, broken authentication -- the classics, generated fresh and fast.
The security ceiling isn't moving
Here's the part that should really concern people. Veracode's Spring 2026 update tested over 150 LLMs (including GPT-5.1, GPT-5.2, Gemini 3, Claude 4.5 and 4.6) and found that while syntax correctness has climbed to 95%, the security pass rate is stuck at 55%. It's been hovering between 45-55% for two years now, regardless of model generation or size.
Read that again: models have gotten dramatically better at writing code that compiles. They have NOT gotten better at writing code that's safe. The gap between "works" and "works securely" is widening, not closing.
The language breakdown is interesting too. Python leads at 62% security pass rate. C# sits at 58%, JavaScript at 57%. Java? 29%. Less than a third of AI-generated Java code passes security checks. Veracode's hypothesis: models are over-trained on legacy Java patterns from millions of public repos that predate modern security frameworks. When you ask for Java, you get 2010-era patterns with 2010-era vulnerabilities.
The vulnerability-specific numbers tell an even clearer story. SQL injection prevention? 82% pass rate. Cryptographic algorithm selection? 86%. Cross-site scripting? 15%. Log injection? 13%. Models have learned the "big" security patterns that get talked about constantly. The subtler ones? They're essentially guessing.
Developers trust it more than they should
A Stanford study found something that should worry every security team: developers using AI assistants wrote less secure code while being MORE confident in its security. The developers with the least secure code rated their trust in AI at 4 out of 5. The ones writing the most secure code? 1.5 out of 5.
That's the worst possible combination. Overconfidence plus under-competence. The people who trust the tool the most are producing the worst results, and they don't know it.
I see this in my own students. They'll paste AI-generated code into an assignment, and when I ask them to explain what a specific function does, they can't. They didn't write it and they didn't read it carefully enough to catch the SQL injection sitting right there in plain sight.
The supply chain angle is scarier
Direct code generation is one problem. Supply chain poisoning is a different beast.
Researchers at USENIX Security 2024 showed that contaminating just 0.2% of a model's training data (160 files out of 80,000) could embed backdoors that passed every standard detection tool. Anthropic's own research demonstrated that models could be trained to write clean code under normal conditions but inject vulnerabilities when triggered by specific signals. Those backdoors survived standard safety training.
In July 2025, someone exploited a flaw in Amazon Q Developer's build process and injected malicious instructions into the official VS Code extension. It reached over 964,000 installations. The payload was supposed to wipe systems and delete cloud resources. The only reason it didn't cause damage? A syntax error in the attacker's code. We got lucky.
AI code review has its own blind spots
Some teams respond to the "AI writes insecure code" problem by adding more AI: using LLMs to review the code that other LLMs generated. Sounds reasonable. It's not enough.
A recent paper from researchers at the University of Athens and Athens University of Economics studied confirmation bias in LLM-based code review. They tested four models across 250 real CVE vulnerability-patch pairs and found that simply framing a code change as "bug-free" in the commit message reduced vulnerability detection rates by 16 to 93 percentage points. The bias was asymmetric too -- false negatives (missed vulnerabilities) spiked while false positive rates barely changed. The model was more likely to miss real bugs, not less likely to flag clean code.
The practical implications are worse. When they simulated adversarial pull requests -- reverting known security fixes while describing them as "security improvements" or "urgent functionality fixes" -- the attack succeeded 35% of the time against GitHub Copilot's interactive review. Against Claude Code running as an autonomous review agent? 88% success rate. The agent would approve reverted security fixes while explaining that the change "removes unnecessary defensive overhead while maintaining security guarantees."
Think about that. An attacker can reintroduce a known vulnerability, describe it as a security improvement in the PR description, and have an AI reviewer not only approve it but provide a confident justification for why it's safe. That's not a theoretical risk. That's a reproducible attack.
The good news: debiasing works. Stripping PR metadata and explicitly instructing the model to ignore framing restored detection in 100% of interactive cases and 94% of autonomous ones. But most teams aren't doing that. Most teams are running default configurations and trusting the output.
You still need humans in the loop
I'm not saying stop using AI for code. That ship has sailed, and honestly, AI is a legitimate productivity multiplier when used correctly. But "correctly" means treating every line of AI output the way you'd treat a pull request from a junior developer you've never worked with before: review it, question it, test it.
The problem is that most teams aren't doing that. They're accepting AI suggestions the same way they accept autocomplete, without thinking.
Human review catches the things SAST tools miss: architectural flaws, business logic errors, authentication patterns that technically pass a lint check but create real-world vulnerabilities. A scanner won't flag that your AI just implemented "authentication" by checking if a user-supplied token matches a hardcoded string. A human will.
Pin your dependencies. Stop chasing latest.
This one is basic hygiene, but AI-assisted development is making it worse. When you ask an AI to add a library to your project, it typically suggests the latest version or no version at all. pip install requests with no version pin. npm install express pulling whatever's newest today. That's a supply chain risk waiting to happen.
Case in point: the TeamPCP supply chain attack that hit the security community literally last week (March 19-24, 2026). Attackers compromised Trivy -- Aqua Security's widely-used vulnerability scanner -- and cascaded the attack across GitHub Actions, npm, PyPI, Docker Hub, and the VS Code extension marketplace. They injected credential-stealing malware into trusted security tools that harvested SSH keys, cloud credentials, and Kubernetes tokens, all while keeping the tools functional so nobody noticed immediately.
The attack spread to Checkmarx's GitHub Actions, then to LiteLLM on PyPI (97 million monthly downloads, used by organizations like NASA, Netflix, and Stripe). Malicious releases 1.82.7 and 1.82.8 shipped with persistence mechanisms and exfiltration capabilities. If your project had litellm>=1.80.0 in its requirements and your CI pipeline ran pip install --upgrade, you pulled the compromised version automatically.
This is why version pinning matters. If you're a Python developer, put exact version numbers in your requirements.txt. Not requests>=2.28, but requests==2.31.0. Same for package-lock.json in Node projects. Lock files exist for a reason -- use them and commit them.
The broader principle: don't adopt new versions the day they drop. Let things shake out for a week or two. Let other people find the problems. I know it's tempting to always run the latest and greatest, especially when AI tools suggest the newest version, but there's real value in running one version behind on dependencies that aren't security-critical. The teams that got burned by the LiteLLM compromise were the ones auto-upgrading. The teams pinned to 1.82.6 were fine.
Practical steps that cost you almost nothing:
- Pin exact versions in requirements files.
==not>=. - Commit your lock files.
package-lock.json,poetry.lock,Pipfile.lock-- all of them. - Use hash verification where possible (
pip install --require-hashes). - Review dependency updates intentionally. Run Dependabot or Renovate, but read the changelogs before merging. Don't auto-merge.
- Wait a beat on major releases. A week of patience can save you from being an early adopter of a compromised package.
SAST in the pipeline is non-negotiable
That said, human review alone doesn't scale. You need automated scanning at every stage of the pipeline:
- Pre-commit: Run secrets scanning (Gitleaks, TruffleHog, git-secrets) before anything gets pushed. This catches the 40% increase in hardcoded credentials AI loves to generate.
- CI/CD pipeline: SAST tools like CodeQL, Semgrep, or SonarQube should run on every pull request. Block merges that fail security checks. No exceptions.
- Dependency scanning: Use Dependabot or Snyk to catch vulnerable dependencies. AI models suggest outdated packages all the time because their training data isn't current.
- Infrastructure as code: If AI generates your Terraform or CloudFormation, scan it. Models love to create overly permissive firewall rules and unencrypted storage buckets as "defaults."
The key: these checks need to run automatically, on every commit, with no way to skip them. If a developer can bypass the security gate, some developer will bypass the security gate.
What actually works
After working with teams dealing with this in real environments, here's what I've seen make a difference:
- Treat AI output as untrusted input. Same as you'd treat user input in a web form. Validate, sanitize, verify. Every time.
- Mandatory security training that covers AI-specific risks. Your developers need to understand prompt injection, hallucinated dependencies, and the tendency for models to reproduce insecure patterns from training data.
- Mandatory code review for anything touching auth, access control, or data handling. Whether AI wrote it or a human did. These areas are too critical for "looks good to me" approvals.
- Track what's AI-generated. You can't manage risk you can't see. Some teams tag AI-generated code in commit messages or use tooling that flags it automatically.
- Run red team exercises against AI-generated code specifically. The failure patterns are different from human-written bugs. Test for them intentionally.
The bottom line
AI code generation is here to stay. It's already in your pipeline whether you planned for it or not. The question isn't whether to use it. The question is whether your security controls have caught up to the speed at which you're now shipping potentially vulnerable code.
For most teams, the honest answer is no. The tools write code faster than your review process can evaluate it, and that gap is where vulnerabilities live.
Close the gap. Automate the scanning. Keep humans in the loop. And stop assuming the AI knows what it's doing, because the data says it doesn't.
← Back to Blog