Introducing span-detect-1

Henry Liu

•

Sep 16, 2025

Today we announced our AI Code Detector, the industry’s first tool to identify AI-assisted vs. human-written code with over 95% accuracy across all AI coding tools.

Our code detector is powered by span-detect-1, a state-of-the-art machine learning model for chunk-level detection of AI generated code. span-detect-1 ties AI coding assistants to code shipped into production for the first time, creating clearer visibility into impact and ROI.

We’re equipping leaders with the credible metrics they need to optimize license spend, ensure quality as AI usage increases, and report to stakeholders with confidence.

Visibility challenges with AI coding assistants

With the rising cost of AI-assisted coding, it has become critical for engineering leaders to understand not only adoption but also its holistic impact. Yet visibility remains limited.

The vendor landscape is evolving rapidly, with new entrants and a long tail of niche tools emerging. These tools provide varying degrees of telemetry, and most offer little to no support for code-level visibility. Meanwhile, a significant number of developers continue to use ChatGPT as their primary workhorse, which is completely invisible to IDE-based integrations.

Faced with these gaps, many companies rely on unreliable proxies (e.g., surveys, anecdotes), or resort to installing trackers on developer machines to approximate usage, with significant downsides to both approaches.

ML-powered detection at 95% accuracy

How span-detect 1 works

Our model receives a code document (e.g. files or PR diffs), semantically chunks it, and generates a classification for each chunk. span-detect-1 was trained on a large, curated corpus based on public GitHub repositories, paired with a carefully tuned set of AI-generated code from a diverse pool of leading AI coding models. Its architecture was designed with the parameter capacity to recognize subtle latent features inherent in AI-generated code.

For each chunk, the model produces one of three classifications: AI-generated, human-authored, or abstain. By default, span-detect-1 abstains when the chunk does not contain enough signal—for example, when it consists only of import statements.

At a chunk size of 3,000 characters, span-detect-1 achieves an accuracy rate of 95% with an abstain rate of 5%.

Languages supported

Presently, we support Python, TypeScript, JavaScript, Java, Ruby, C# and are actively adding support for additional languages. We initially trained models specific to each language, but discovered that the model generalizes well when unified.

Evaluations

span-detect-1 was evaluated by an independent team within Span. The team’s objective was to create an eval that’s free from training data contamination and reflecting realistic human and AI authored code patterns. The focus was on 3 sources: real world human, AI code authored by Devin crawled from public GitHub repositories, and AI samples that we synthesized for “brownfield” edits by leading LLMs. In the end, evaluation was performed with ~45K balanced datasets for TypeScript and Python each, and an 11K sample set for TSX.

How it compares to other models

Based on our research and literature review, we found that most academic models were trained on narrow and unrealistic datasets—such as those based on coding competitions. Their reported accuracy was based on in-cohort validation, rather than evaluations designed to capture real-world usage patterns. None had open weights available for evaluation across the languages we currently support.

We only found one credible, accessible, and commercially available AI Code Detector that advertised its accuracy at 98%. However, when we ran it through our evaluations, it had a significant human classification bias and performed no better than chance at correctly classifying samples of known AI-code authorship.

Smaller chunks tradeoff granularity vs. accuracy

span-detect-1 is designed for chunk-level rather than line-level detection, since individual lines of code rarely contain enough signal to classify accurately. By operating at larger spans of 2,000–3,000 characters, the model achieves significantly higher accuracy. At the same time, it remains flexible: for applications such as pull request–level detection, span-detect-1 supports smaller chunk sizes with initially modest tradeoff in accuracy. Still, at the 500 character chunk size we start seeing a rapid degradation boundary and the appearance of a strong human bias. In our use, we have found that the 700-1,000 chunk size is practical for actual production use cases.

How you can use span-detect-1

Give the model a try here. You can use the model commercially by contacting us for an API key or by being a customer of Span. Span is the developer intelligence platform that gives you a complete picture of engineering impact and health. We provide out-of-the-box reporting connected to GitHub/GitLab, Jira, and other tools in the SDLC to help you understand the impact of AI coding assistants on developer productivity, including:

Core adoption KPIs such as the proportion of shipped code that is AI-generated
Improve utilization and proficiency of use of AI tools at the team and individual level
The dose-response relationship between AI code ratio and velocity, quality, and code review outcomes
AI saturation at the file, module, or repository level for reporting or risk management purposes
The ability to understand the escaped defect rate of AI vs. human-authored code (coming soon)

What's coming next

We are currently focused on adding additional language support to span-detect-1. In future releases, we plan to enhance the accuracy and chunk-size granularity with new model architectures, richer training data, and inference improvements.

Everything you need to unlock engineering excellence

GET A DEMO