AI Detection Bias Against Non-Native English: Stanford Study

In 2023, five Stanford researchers tested seven AI detectors on essays by non-native English speakers. The result: 61.22% were falsely flagged as AI-generated. In 20% of cases, all seven detectors unanimously got it wrong. Native English essays achieved near-perfect accuracy. This article explains the full methodology — TOEFL vs US eighth-grade datasets, the perplexity mechanism behind the bias, the symmetry test proving vocabulary complexity drives false positives, the bypass experiment showing ChatGPT can evade its own detectors with one prompt, vendor responses, and whether anything has changed by 2026.

The implications of this April 2023 paper by five Stanford University researchers should have changed every institutional policy for detecting AI use worldwide: Seven of the most popular text-based AI detectors flagged 61.22% of essays by non-native English speakers as being written by AI, but achieved near-perfect accuracy on essays by native English-speaking students in the United States. The paper, titled "GPT Detectors are Biased Against Non-native English Writers" by Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and Professor James Zou, was published in July 2023 in the peer-reviewed journal "Patterns" and has been cited in over 400 other studies. Stanford HAI summary of the study. It explains the essential discovery in simple terms: "The same statistical feature that makes AI writing predictable, or of low perplexity, is also a feature of non-native English writing. The detectors cannot tell them apart."

The paper, published three years ago, remains the essential reference for every serious discussion of fairness in AI detection. This article will describe exactly what the paper did, what each experiment found, why the bias occurs at a technical level, how the vendors and institutions responded, what has been found since, and what it means for non-native English writers in detection-enabled environments in 2026.

Key Takeaways

  1. The main result: seven AI detectors detected 61.22% of essays written by non-native English speakers as AI-generated. These same seven detectors were shown to have near-perfect accuracy in classifying essays written by US-born eighth-grade students on the same set of experiments. In 20% of essays written by non-native English speakers, all seven detectors unanimously determined they were written by AI, thus eliminating tool error as a factor. The full study text is available via the PMC open-access publication.

  2. The mechanism is not a design flaw that can be fixed with a simple patch. Rather, it is a fundamental characteristic of how perplexity-based detection operates. Writing by non-native English speakers has lower lexical diversity, simpler syntactic structures, and less predictable vocabulary than writing by fluent native English speakers, because second-language writing relies on a restricted subset of a language's vocabulary and grammatical structures. Similarly, writing by AI language models has low perplexity, but for a different and related reason: they always choose the most probable word given their context.

  3. The study included a symmetry test that is particularly illuminating. When TOEFL essays were rewritten to use more sophisticated vocabulary, misclassification rates dropped. When US student essays were rewritten to use simpler vocabulary, misclassification rates rose. The bias follows vocabulary complexity rather than writer identity. This means that any writer, native or non-native, who writes in a simpler, more constrained style faces a higher risk of false positives.

  4. The bypass experiment verified that the same lexical mechanism applies in both directions. When AI-generated college application essays were self-edited by ChatGPT using the prompt 'Elevate the provided text by employing literary language,' the detection rate declined to near zero against all seven detectors. All a cheating student had to do was ask ChatGPT to make its own text sound more sophisticated. Markup's investigation into how the study's findings affect international students found that these findings underscored a fundamental inequity: the detection systems made cheating easier for students who knew how to prompt AI, while systematically accusing honest students who wrote with a constrained vocabulary.

  5. Three years of follow-up research have broadly confirmed the study's findings. Subsequent independent tests, institutional deployments, and commercial platform audits have documented elevated false-positive rates for non-native English writers relative to native English writers, with only a subset of platforms demonstrating meaningful improvement. How professors use AI detection in 2026 and why ESL bias remains a documented concern confirm that the bias pattern persists in real educational deployment, particularly for short submissions, formal academic writing styles, and writers from East Asian and Middle Eastern educational traditions.

The Study: Methodology Explained

The paper presented by Stanford was intended to validate a particular hypothesis: that AI detectors exhibit differential performance across writer populations, specifically native and non-native English writers. The researchers selected two very different human-written text datasets and evaluated seven commercial AI detectors on them.

ai_bias_stanford_research.png

The Research Team

Lead author Weixin Liang was a doctoral student in computer science at Stanford, working under Professor James Zou. Liang had learned Cantonese and Mandarin before English and was personally motivated to examine whether detection tools worked fairly for writers with linguistic backgrounds like his. Co-authors Mert Yuksekgonul, Yining Mao, and Eric Wu were PhD students in computer science and electrical engineering at Stanford. Professor James Zou held joint appointments in computer science, biomedical data science, and electrical engineering and served as senior author on the study.

The Datasets

The non-native English writing dataset consisted of 91 TOEFL essays sourced from a publicly available student forum. The TOEFL (Test of English as a Foreign Language) is the standardized English proficiency examination taken by non-native English speakers seeking admission to English-language academic institutions. These essays were written by adult university-level students writing in their second (or later) language under examination conditions.

The native English writing dataset consisted of 88 essays written by US-born eighth-grade students. This comparison group was selected from a publicly available dataset of middle school students' essays. The choice of this comparison group later became a point of methodological criticism: eighth-grade US students are younger and less educationally advanced than adult TOEFL test-takers, which may confound language background and writing maturity. The paper's authors acknowledged this limitation.

The Seven Detectors Tested

The study tested seven AI detection tools that were widely available at the time: GPTZero, Originality.AI, Writer.com AI Content Detector, Crossplag, Copyleaks, Sapling, and Content at Scale. Turnitin was not included in the primary experiment; the paper was published in April 2023, approximately the same time Turnitin launched its AI detection feature. The finding, therefore, describes the seven tested tools specifically, though subsequent research and real-world reporting documented similar patterns in Turnitin deployments.

The Experimental Design

Each essay from both datasets was submitted to all seven detectors and classified as AI-generated or human-written based on each tool's output. The researchers then analyzed the classification rates across the two populations and ran the additional experiments described in the findings table below.

The Findings: All Six Experiments

Experiment

Dataset

Result

Significance

Primary bias test

91 TOEFL essays by non-native English speakers vs. 88 essays by US-born 8th-grade students

Seven AI detectors flagged 61.22% of TOEFL essays as AI-generated; near-perfect accuracy on 8th-grade US essays

The disparity between 61% and a near-zero false positive rate represents a structural failure, not a calibration edge case

Unanimity check

Same TOEFL essay dataset

Approximately 20% of non-native writer essays were flagged as AI-generated unanimously by all seven detectors tested

Unanimous misclassification rules out individual tool error; it confirms the signal is present consistently across all tested platforms

ICLR 2023 academic paper analysis

Research papers from the International Conference on Learning Representations, split by whether the first author was from a non-English-speaking country

Papers with non-native-English first authors showed systematically lower perplexity scores than papers with native-English first authors

Extends the bias finding beyond student essays to professional academic writing; the same mechanism affects published researchers

Vocabulary enhancement test

TOEFL essays rewritten with the prompt: Enhance the word choices to sound more like those of a native speaker

After vocabulary enhancement, misclassification rates dropped substantially, confirming that lexical simplicity was the causal mechanism

Demonstrates that the bias is not about writing quality or topic familiarity, but specifically about the statistical properties of vocabulary choice

Simplification test

8th-grade US student essays rewritten with the prompt: Simplify word choices as if written by a non-native speaker

When US student essays were simplified, misclassification rates rose substantially, mirroring the ESL false positive pattern

Provides causal confirmation in both directions: the bias follows vocabulary complexity, not writer identity

AI bypass test

ChatGPT-3.5-generated college application essays, then the same essays after the prompt "Elevate the provided text by employing literary language."

Initial AI essays detected correctly; after self-editing, detection rates fell to near zero across all seven detectors

Demonstrates the symmetry of the problem: the same lexical simplicity that makes ESL writing look AI-generated makes AI writing look human when complexity is added

ai_bias_perplexity_mechanism.png

The Technical Mechanism: Why This Bias Exists

The Core Explanation: AI detectors identify AI writing by measuring how statistically predictable the text is. Non-native English writers produce statistically predictable text, not because they use AI, but because they write from a more constrained vocabulary. The detector cannot distinguish these two sources of predictability. It is not identifying AI; it is identifying simple vocabulary and incorrectly attributing that simplicity to machine generation.

Understanding why the bias exists requires understanding what perplexity actually measures in the context of AI detection. Perplexity is a statistical measure of how surprised a language model is by a sequence of words. Low perplexity means the words were predictable given their context; high perplexity means the word choices were unexpected. AI language models produce low-perplexity text because they are designed to select the most statistically probable next word at each generation step. Human writers, drawing on personal experience, idiomatic expression, and creative vocabulary, tend to produce higher-perplexity text. This asymmetry is the basis for perplexity-based AI detection. Pangram's analysis of ESL false positive rates explains the precise failure mode: non-native English speakers do not have the same depth of vocabulary or command of complex syntactic constructions as fluent native writers. Their writing exhibits lower perplexity, not because they use AI, but because second-language writing draws on a more restricted portion of the language's vocabulary and structure.

Four Properties That Drive the Bias

The Stanford study found four specific linguistic features in which non-native English writers performed worse than native English writers, and in which AI-generated text also performed worse. The presence of all four features together is what the detectors recognize as AI-generated text:

This mechanism has been succinctly explained by Professor Zou in his interview with Stanford HAI: It comes down to how these detectors detect AI. They score based on a metric called "perplexity," which is related to how complex the writing is, a factor in which non-native speakers will naturally lag behind their US-born counterparts.

The Training Data Problem

At the heart of the perplexity measure is the dataset issue, which actually increases bias. Most commercial AI detection tools were developed and tested using datasets of writing samples primarily produced by native English speakers. These data sets essentially determine 'what human writing looks like.' When non-native speakers are underrepresented in the dataset that determines 'what human writing looks like,' the model is essentially taught to recognize 'what human writing looks like,' which is the statistical average of native English speakers' writing. It is essentially taught to recognize writing that falls below the statistical average as AI-generated. It is not wrong to recognize what native English speakers' writing looks like; it is wrong to assume that this is the only writing humans produce. UCLA's analysis of AI detector imperfections notes that this is not unique to this particular AI detector but a general trend: when using predominantly English-speaking, Western data, results are less reliable for underrepresented populations.

The academic paper experiment in the Stanford study expands this concept from students taking language proficiency tests to professional researchers in their field, wherein papers submitted to a major machine learning conference with a first author who is not a native English speaker have a lower text perplexity than those with a first author who is a native English speaker.

This is not just about students taking language proficiency tests; it is about professional researchers in their field, simply because their first language is not English.

Vendor and Institutional Responses

The Stanford study generated substantial responses from AI detection vendors, educational institutions, and the broader research community. Those responses varied considerably in their candour and in the actions they produced. Originality.ai's critique of the study methodology represents the most detailed vendor pushback and raises legitimate methodological questions, while the independent replication evidence broadly confirms the core finding across multiple platforms.

Respondent

Position Taken

Response to the Bias Finding

Turnitin (Annie Chechitelli, Chief Product Officer)

Disputed the finding

Stated that Turnitin's tool was trained on writing by English speakers in the US and abroad, as well as multilingual students, and therefore should not have the identified bias. The company stated that it was conducting its own research into whether the tool is less accurate for non-native writers; published results have not confirmed the absence of bias.

Originality.ai

Disputed the methodology

Published a detailed critique arguing the Stanford study used an underpowered sample (91 TOEFL essays), a flawed comparison group (8th-grade US students vs. adult TOEFL test-takers introduces age and education confounds), and misclassified some GPT-4-modified human text as human. Conducted its own study using IELTS essays and reported lower bias on its platform.

Pangram Labs

Confirmed and addressed the finding

Reproduced the study on four public ESL datasets and found that perplexity-based detectors do exhibit the bias described. Reported a near-zero false positive rate on the same TOEFL dataset in their own detector, attributing the improvement to training on a broad spectrum of text, including non-native and casual English, rather than exclusively academic writing.

OpenAI

Implicitly confirmed through product decision

Had already shut down its AI classifier in July 2023 before the Stanford study was widely circulated, citing low accuracy. The classifier's documented 9% false-positive rate on human writing was consistent with the bias pattern described by the Stanford study.

GPTZero

Acknowledged and addressed

Committed to ESL bias reduction in model updates, 2025-2026 updates specifically targeted non-native English false-positive rates, with the platform reporting a 2% higher rate for non-native writers than its native English baseline in current models.

Academic institutions

Mixed responses

Vanderbilt, UCLA, UC San Diego, and others cited ESL bias as one of the reasons for disabling AI detection features. Many institutions that continued using detection tools did not update their policies to account for the documented ESL risk.

What Subsequent Research Has Confirmed

The Stanford study was published with a relatively small sample size (91 TOEFL essays) and with methodological flaws in the comparison group, as its critics pointed out. Three years of research have addressed these flaws, and the findings are now supported to the same extent.

Methodological Criticisms and How They Hold Up

The Stanford study has received several methodological criticisms, especially from Originality.ai, which published a thorough analysis and conducted its own counter-study. To assess these criticisms, we will need to distinguish those that are methodologically valid from those that, although correct, are irrelevant to the original study's conclusion.

Valid Methodological Limitations

Criticisms That Do Not Undermine the Core Finding

What This Means in Practice in 2026

The results of the Stanford study have significant practical implications for every institution that uses AI detection tools in environments that include non-native writers and for every non-native writer whose work is reviewed by detection screening. How professors use AI detection in 2026

For Non-Native English Writers

For Educators and Institutions

Conclusion

The most significant discovery of the Stanford study, that 61% of non-native essays were flagged by the seven most popular AI detection tools, achieving near-perfect accuracy on native essays, marks one of the most significant recorded instances of unfairness in widely used testing technology. The mechanism by which this unfairness occurs is simple: these tools detect AI-generated writing by looking for statistical patterns. Non-native writing is statistically predictable for reasons unrelated to the use of AI. Further studies have verified this fact across different platforms, writing contexts, and populations. Most platforms have admitted this fact. Most policies have not adjusted proportionally in response. The disconnect between research in 2023 and the reality in 2026 for every non-native English writer whose work is graded by a technology never designed to serve them fairly is the lived reality of every non-native English writer.

Frequently Asked Questions

What exactly did the Stanford study find?

The research, conducted by Liang, Yuksekgonul, Mao, Wu, and Zou, involved testing seven popular AI detection tools on two sets of known human-written content. The content consisted of 91 essays written by non-English-speaking TOEFL test-takers and 88 essays written by US-born eighth-grade students. The tools detected 61.22% of the TOEFL essays as being written by AI. The tools were nearly perfect in identifying the essays written by 8th-grade students. In about 20% of the TOEFL essays, AI detection was unanimous across all 7 tools. The research also showed that making native English writing simpler increases misidentification, whereas making ESL writing more complex with high-level vocabulary decreases misidentification.

Why does AI detection flag ESL writing as AI-generated?

Texts written by both AI and non-native English speakers are found to have low perplexity, a key indicator used in detection algorithms. This is so because, in the case of AI, a word with high statistical probability is always chosen, whereas in the case of non-native English writers, their vocabulary is restricted, which results in less lexical variety, less syntactic complexity, and more grammatical regularity in their writing. The detector, designed to work on a large number of texts written in native English, identifies this common statistical feature in both AI and ESL writing and gives false alarms on ESL writing that has nothing to do with AI use.

Has the bias been fixed since 2023?

Partially, on some platforms. GPTZero promises to reduce ESL bias in its model updates, reporting an elevated 2% false-positive rate for non-native writers compared to its native English baseline in its 2025-2026 models, a significant improvement over the much higher rates in prior versions. The ESL inclusion-from-the-start approach used by the Pangram Lab detector yields a near-zero false-positive rate on the TOEFL dataset from the Stanford study. The majority of the remaining commercial platforms do not publish verifiable data on equivalent performance improvements. The bias is still present and impactful in the majority of institutional implementations as of 2026.

What were the main criticisms of the study?

The first set of issues was methodological. Originality.ai pointed out that the comparison group, consisting of 8th-grade students in the US, was younger and less educationally advanced than the TOEFL test-takers, who were adults applying to university. This could, in theory, affect both language background and educational level. The number of TOEFL essays, 91, was also too small to support a claim with such strong population-level implications. The paper's decision to include some human text modified by GPT-4 as "human" also created confusion in some experimental conditions. While these are legitimate methodological issues, they fail to address the 61% misclassification rate, and no well-controlled, large-sample study has demonstrated comparable accuracy for both non-native and native English speakers with the widely deployed detection tools. What should I do if I am an ESL writer flagged by an AI detector?

Cite the Stanford study directly: Liang et al. (2023), published in Patterns, found that seven detectors for AI writing systematically detected 61% of non-native English essays as written by AI, compared to near-perfect accuracy for native English essays. Provide documentation of your process: timestamped drafts, research notes, version history, and any documentation of your writing process. Note that the detection tool itself is known to have a structural bias against non-native English writing and requests that the institution rely on human judgment rather than the detection tool. If the institution is using a tool, ask which tool and version are in use and whether that version is known to reduce ESL bias. Most will not be able to answer the third question, which itself is evidence of the institution’s lack of knowledge of the tool’s known limitations.

The information in this article is based on research available as of March 2026. The study by Liang et al. (2023) is used in this article based on the officially published version in Patterns, Cell Press (DOI: 10.1016/j.patter.2023.100779), an open-access publication. The direct quotations from the study authors were taken from Stanford HAI and The Markup reports on the study.