June 3, 2026 · Blue Lantern Security Team · Practical Applications

Comparing the accuracy of AI models on the OWASP code scanning benchmarks

Prelude

There is a lot of fear in the security industry concerning AI for several reasons:

Business risk - Incumbent code scanning businesses fear that AI will perform better than existing technologies that have long been in place. If this is the case then they fear losing business to AI code scanners.
Employee job risk - Security analysts fear that AI will cause them to lose their jobs if AI can perform at their level.
Practical security risk - The non-deterministic nature of AI means that sometimes it will perform incorrect actions, so how do we provide proper guardrails for AI in our organizations?

The third point is challenging, especially since there is an accountability challenge with AI. Who is responsible and accountable when an agent makes a mistake? The security engineer in me ultimately wants to believe this is an Identity and Access Management problem to resolve. Ultimately I think there is not a ton of practical measurements happening that can help let us know if these fears are actually substantiated. In this post I will explore:

How do we measure how good security scanners are?
Can we apply a similar measure to ai concerning code scanning, and how does it perform?
What are some preliminary results and conclusions from testing
Does the performance justify the price of tokens?

Introduction to OWASP benchmarks

When you're evaluating the quality of code scanners, you need to have a set of criteria and tests to measure scanners against each other. Fortunately, OWASP provides some foundation for this with their Project Benchmark. This tool provides a set of code containing commonly encountered weaknesses (as a side note OWASP also brings to the table its set of top 10 web app vulnerabilities https://owasp.org/www-project-top-ten/). Effectively, OWASP measures the quality of code scanners by taking the true positive rate (true positives count/total positives), false positive rate (false positives count/total negatives), and then taking the difference between as TPR-FPR.

This is a fairly effective measure for the quality of outcomes with code scanners. However, there are two other critical measures we need to consider when we assess tools:

The total time it took to run the tests (we can't have git pipelines taking hours)
The total cost to run the tests (we can't have security bankrupt the company)

We are talking about enterprise security, so delays shipping code or extreme costs likely are not practical.

The testing

We wanted to test a few different candidate models as well as our own home grown code scanner (see our on prem code scanner if interested in learning more). We went with GPT-5.5 (for quality results) and Gemini 3.5-flash (for fast results) for the candidate models, your mileage may vary depending on other models. The other piece we wanted to test was how consistent were the AI results - in security considering non-determinism is a risk we wanted to understand how variable results could possibly be (we set the temperature to 0 so we expected results close to deterministic).

One additional challenge is the context size of large repositories for AI. To accomplish this testing we scripted sending pieces of the data to the AI to measure it's answers without overflowing the AI's context. There is a balance that needed to be struck between context size and token efficiency to ensure the models had the best chance to be accurate with the assessments. Note, however, this is extremely impractical in the case of trying to support enterprise code repositories.

Results table

Suite	Engine	TPR	FPR	Youden	Cost	Runtime
java	bls-code-scanner	0.948	0.023	0.926	$1	18s
java	gpt-5.5	0.931	0.109	0.821	$52.10	56m 8s
java	gemini-3.5	0.941	0.371	0.571	$3.70	15m 26s
python	gpt-5.5	0.819	0.062	0.757	$20.67	50m 16s
python	bls-code-scanner	0.808	0.125	0.683	$1	2s
python	gemini-3.5	0.708	0.222	0.486	$0.86	7m 0s

TPR is the calculated True Positive Rate. FPR is the calculated False Positive Rate. Youden is calcualted as TPR - FPR. Cost was how much we were directly charged by the model. In the Java test suite there is a total of 2740 tests. In the python test suite there is a total of 1230 tests.

One additional result: we did run gemini-3.5-flash twice on the python set, and surprisingly the results were fairly consistent but we did notice 2 different assignments of specific CWEs (the positive or negative results did not change). This was just a preliminary test on AI deterministic results, but even at temperature of 0 it is possible to see a difference in output results.

Note - Blue Lantern Security caps out at $1 when scanning large repositories - the maximum cost per run is set to $1 regardless of number of files.

Conclusions

GPT-5.5 performed well above the other scanners for the python benchmark, while the Blue Lantern Security code scanner scored highest for the Youden index on Java. Some more interesting observations:

The temperature being set to 0 and the fact that these are well known test sets could influence the results here. It's unclear how AI would perform on an unknown test set... since it's unknown. In my opinion there is a market opportunity here for third party accuracy validation that hides true results from AI tools to provide objective measurements. Public benchmarks to this effect are extraordinarily difficult to find and leverage.
AI models took a very long time to run. An hour build time for just security scanning is likely unacceptable with todays pace of development.
The price point is pretty steep if we're discussing per build scanning, but if you're looking at doing this for a one time scan every once in a while then perhaps the price can be justified. There is also a dependency here on the size of the repository, this cost is directly tied to token usage for submitting all benchmark tests from OWASP.

Hopefully this discussion sets off some important follow on considerations for any of our readers. I'd like to leave you with the following two questions:

Are you using AI in a way that tools that are purpose-built can actually perform better and at a better price?
Can we start applying more direct measures to try and reduce fears surrounding AI and move towards more pragmatic discussions around tooling?
What is the correct role for determinism when discussing AI tooling in security, such as adjusting temperature, and is there a point for using less deterministic tooling when talking about more advanced adversaries?

Let us know your thoughts [email protected] or checkout our products at bluelanternsecurity.io

See our blog posts on other technical topics!

See how we recommend automating email analysis with our email analyzer tool
See how you can check if you have any malicious chrome extensions
Check out scanning your ai skills for malware using Blue Lantern Security

See how we compare to other tools!