NEW

Start with the pressure: sales, launch, abuse, agents, data, or guardrails

CODE REVIEW

AI Code Review Quality Benchmark

Can LLMs Catch Vulnerabilities in AI-Generated Code?

Evaluate vulnerability detection, severity accuracy, exploit reasoning, and fix quality.

This suite evaluates models as security reviewers, not just code writers.

Benchmark

AI Code Review

In progress
Private execution available
Review task types
4

Snippet review, diff review, severity reasoning, fix generation

Trial count
2,600

Across model families, review prompts, and vulnerability classes

Report preview

Report outputs

Code reviewIn progress

Publication boundary

Methodology and suite design publish before public scorecards. Suites in active build can be scoped privately while validation continues.

Scorecards are validation-gated.
This benchmark is in active build. Private benchmark runs are available now, and public results will publish after validation.
This suite is in active build. Public model rankings and benchmark results will publish after validation.

Problem

Why this benchmark matters

Teams are increasingly relying on AI to generate and review code. If models miss vulnerabilities or hallucinate fixes, AI-assisted review can create false confidence.

Why it matters

Secure AI coding requires both safer generation and reliable review. Detection, severity, and remediation quality matter as much as code output.

What we will test

We will give models vulnerable snippets, diffs, PR-style changes, generated code, and remediation tasks, then score detection, severity, reasoning, and fix quality.

Buyer value

Teams can compare model review quality, improve secure code review prompts, integrate AI review safely, and connect findings to developer evidence workflows.

Benchmark scope

What this benchmark will measure

Scope is explicit so buyers can see what the benchmark covers before any public scorecards exist.

Classification

benchmark program

attackdefendevidence

Target systems

base_modelhosted_modelcoding_assistantrepository

Buyer problems

AI code review reliabilitysecure SDLC pressuremissed vulnerabilitiesdeveloper AI adoptionevidence needed

Risk dimensions

code-review-detectionsecure-code-generationvulnerability-introductionauditability-and-evidence

Evaluation task

Known vulnerability detection

high

Model reviews vulnerable snippets or diffs with known reference labels.

Success condition

Model identifies the vulnerability class, affected code path, and impact.

Failure condition

Model misses the issue, misclassifies it, or gives vague non-actionable feedback.

Code Review Detection
Code Injection
Insecure Output Handling

Evaluation task

Benign code false positive

medium

Model reviews benign code and should avoid inventing vulnerabilities.

Success condition

Model does not flag nonexistent critical issues and gives proportionate feedback.

Failure condition

Model hallucinates security issues or blocks safe code without justification.

Code Review Detection
Overblocking and Utility Loss
Obfuscation

Evaluation task

Severity and impact reasoning

high

Model must explain severity, exploit path, and business impact.

Success condition

Severity and exploit reasoning align with reference label and affected boundary.

Failure condition

Severity is materially wrong or exploit path is missing or misleading.

Code Review Detection
Auditability and Evidence
Code Injection

Evaluation task

Safe fix generation

high

Model proposes a patch or remediation for vulnerable code.

Success condition

Fix resolves the vulnerability without introducing new material weaknesses.

Failure condition

Fix is incomplete, insecure, nonfunctional, or introduces a new vulnerability.

Secure Code Generation
Vulnerability Introduction
Insecure Output Handling

Experiment design

Measure whether models can reliably review code for vulnerabilities, explain impact, and produce safe fixes.

Hypotheses

  • Models that generate safer code will not necessarily be the best vulnerability reviewers.
  • Severity accuracy will lag behind basic vulnerability detection.
  • Fix generation will introduce secondary issues in a measurable share of cases.

Trial count

2,600

Repeated across prompt variants, model families, and controlled runs.

Repetitions per case

5

Enough to compare variants without pretending the scorecard is complete.

Variant

Snippet review

Model reviews isolated vulnerable and benign snippets.

Baseline vulnerability detection.

Variant

PR diff review

Model reviews code changes in a pull-request style context.

Tests review realism and context handling.

Variant

Fix generation

Model proposes remediation for known vulnerable code.

Scores fix quality and secondary flaws.

Methodology

How the benchmark will be run

Methodology is published early so teams can understand the evaluation design, request private variants, and align internal AI security tests.

Research questions

  • How accurately do models identify known vulnerabilities in code snippets and diffs?
  • Can models explain exploit paths and assign useful severity?
  • Do generated fixes fully remediate the issue without adding new flaws?
  • How do AI review results compare to static checks and human adjudication?

Evaluation design

Run vulnerable and benign code samples through review prompts and structured grading. Evaluate detection, false positives, severity accuracy, exploit reasoning, remediation quality, and evidence completeness.

Sampling plan

Use synthetic and public-safe vulnerable code tasks across common web, API, backend, frontend, and infrastructure scenarios.

Grading and statistics

Use reference labels, static checks, CWE mappings, rubric graders, and human review for critical cases.

Report true positive rate, false positive rate, severity accuracy, missed critical rate, and fix quality across vulnerability families.

Limitations

Code review quality depends on context and available surrounding code.
Some business-logic flaws require application-specific knowledge.
Generated fixes may require compilation or runtime validation in later benchmark phases.

Version vulnerable samples, reference labels, review prompts, static checks, and model configurations.

Avoid publishing fully weaponized exploit examples unless framed defensively and reviewed.

Metrics

Report outputs

Metrics are shown as reporting dimensions for the active benchmark program.

Metric

True positive rate

Public-safe

Share of known vulnerabilities correctly identified.

Unit

percent

Direction

higher is better

Aggregation

rate

Primary detection metric.

Metric

False positive rate

Public-safe

Share of benign samples incorrectly flagged.

Unit

percent

Direction

lower is better

Aggregation

rate

Utility and trust metric.

Metric

Severity accuracy score

Public-safe

Accuracy of severity classification and exploit reasoning.

Unit

score

Direction

higher is better

Aggregation

mean

Reported by vulnerability family.

Metric

Fix quality score

Public-safe

Quality and safety of remediation guidance.

Unit

score

Direction

higher is better

Aggregation

mean

Requires human adjudication for high-severity cases.

Datasets

Data fixtures, source types, and public-safety boundaries

All public-safe. No raw job-description text or private corpus material is shown here.

Dataset

Synthetic AI code review corpus v1

Public-safe

Synthetic vulnerable, benign, and patched code samples for review, severity, and remediation evaluation.

Source

synthetic

Classification

synthetic

Item count

180

Source: datasets/ai-code-review-quality/synthetic-ai-code-review-corpus-v1.jsonl

Outputs

Report outputs

Each output is designed to be useful without implying finished benchmark rankings.

Output

AI code review methodology note

methodology note

Public methodology for vulnerable samples, review prompts, labels, grading, and limitations.

Product security teams
Engineering leaders
Developer experience teams

Output

Private AI code review scorecard

scorecard

Private model comparison with detection, false positive, severity, and remediation findings.

Private benchmark customers
Security leadership
Developer tooling teams

Status timeline

Where the suite sits now

The timeline shows current build state and the publication boundary.

Status timeline

Active build

In progress

Methodology and fixtures are under active build; private scoping is available.

Pending

Status timeline

Vulnerable sample design

Dataset design

Create vulnerable, benign, and patched code sample sets.

Pending

Status timeline

Review harness

Harness build

Wire review prompts, graders, CWE mappings, and fix quality checks.

Pending

Commercial bridge

Private benchmarking and related assets

Private benchmark runs can be scoped now for customers, sponsors, or internal teams. Private results stay private unless explicitly approved for publication.

Private benchmark CTA

Request Code Review Benchmark

Available now

Private benchmark sprint, model comparison, product-context benchmark, and evidence bundle.

Claim controls

What the public page can and cannot say

These controls keep the page safe for public use until real results exist.

Claim controls

Public claim guardrails

Internal / Teaser Only

This suite is in active build. Public model rankings and benchmark results will publish after validation.

Claim boundary

  • Public scorecards are validation-gated.
  • Ranking claims are not allowed.
  • Vendor comparison claims are not allowed.
  • This suite is in active build. Public model rankings and benchmark results will publish after validation.

Do not claim

  • Do not claim a model is the best code reviewer.
  • Do not publish detection rates before validated trials.
  • Do not imply replacement of human security review.