In 2023, five Stanford researchers tested seven AI detectors on essays by non-native English speakers. The result: 61.22% were falsely flagged as AI-generated. In 20% of cases, all seven detectors unanimously got it wrong. Native English essays achieved near-perfect accuracy. This article explains the full methodology — TOEFL vs US eighth-grade datasets, the perplexity mechanism behind the bias, the symmetry test proving vocabulary complexity drives false positives, the bypass experiment showing ChatGPT can evade its own detectors with one prompt, vendor responses, and whether anything has changed by 2026.
The implications of this April 2023 paper by five Stanford University researchers should have changed every institutional policy for detecting AI use worldwide: Seven of the most popular text-based AI detectors flagged 61.22% of essays by non-native English speakers as being written by AI, but achieved near-perfect accuracy on essays by native English-speaking students in the United States. The paper, titled "GPT Detectors are Biased Against Non-native English Writers" by Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and Professor James Zou, was published in July 2023 in the peer-reviewed journal "Patterns" and has been cited in over 400 other studies. Stanford HAI summary of the study. It explains the essential discovery in simple terms: "The same statistical feature that makes AI writing predictable, or of low perplexity, is also a feature of non-native English writing. The detectors cannot tell them apart."
The paper, published three years ago, remains the essential reference for every serious discussion of fairness in AI detection. This article will describe exactly what the paper did, what each experiment found, why the bias occurs at a technical level, how the vendors and institutions responded, what has been found since, and what it means for non-native English writers in detection-enabled environments in 2026.
The main result: seven AI detectors detected 61.22% of essays written by non-native English speakers as AI-generated. These same seven detectors were shown to have near-perfect accuracy in classifying essays written by US-born eighth-grade students on the same set of experiments. In 20% of essays written by non-native English speakers, all seven detectors unanimously determined they were written by AI, thus eliminating tool error as a factor. The full study text is available via the PMC open-access publication.
The mechanism is not a design flaw that can be fixed with a simple patch. Rather, it is a fundamental characteristic of how perplexity-based detection operates. Writing by non-native English speakers has lower lexical diversity, simpler syntactic structures, and less predictable vocabulary than writing by fluent native English speakers, because second-language writing relies on a restricted subset of a language's vocabulary and grammatical structures. Similarly, writing by AI language models has low perplexity, but for a different and related reason: they always choose the most probable word given their context.
The study included a symmetry test that is particularly illuminating. When TOEFL essays were rewritten to use more sophisticated vocabulary, misclassification rates dropped. When US student essays were rewritten to use simpler vocabulary, misclassification rates rose. The bias follows vocabulary complexity rather than writer identity. This means that any writer, native or non-native, who writes in a simpler, more constrained style faces a higher risk of false positives.
The bypass experiment verified that the same lexical mechanism applies in both directions. When AI-generated college application essays were self-edited by ChatGPT using the prompt 'Elevate the provided text by employing literary language,' the detection rate declined to near zero against all seven detectors. All a cheating student had to do was ask ChatGPT to make its own text sound more sophisticated. Markup's investigation into how the study's findings affect international students found that these findings underscored a fundamental inequity: the detection systems made cheating easier for students who knew how to prompt AI, while systematically accusing honest students who wrote with a constrained vocabulary.
Three years of follow-up research have broadly confirmed the study's findings. Subsequent independent tests, institutional deployments, and commercial platform audits have documented elevated false-positive rates for non-native English writers relative to native English writers, with only a subset of platforms demonstrating meaningful improvement. How professors use AI detection in 2026 and why ESL bias remains a documented concern confirm that the bias pattern persists in real educational deployment, particularly for short submissions, formal academic writing styles, and writers from East Asian and Middle Eastern educational traditions.
The paper presented by Stanford was intended to validate a particular hypothesis: that AI detectors exhibit differential performance across writer populations, specifically native and non-native English writers. The researchers selected two very different human-written text datasets and evaluated seven commercial AI detectors on them.

Lead author Weixin Liang was a doctoral student in computer science at Stanford, working under Professor James Zou. Liang had learned Cantonese and Mandarin before English and was personally motivated to examine whether detection tools worked fairly for writers with linguistic backgrounds like his. Co-authors Mert Yuksekgonul, Yining Mao, and Eric Wu were PhD students in computer science and electrical engineering at Stanford. Professor James Zou held joint appointments in computer science, biomedical data science, and electrical engineering and served as senior author on the study.
The non-native English writing dataset consisted of 91 TOEFL essays sourced from a publicly available student forum. The TOEFL (Test of English as a Foreign Language) is the standardized English proficiency examination taken by non-native English speakers seeking admission to English-language academic institutions. These essays were written by adult university-level students writing in their second (or later) language under examination conditions.
The native English writing dataset consisted of 88 essays written by US-born eighth-grade students. This comparison group was selected from a publicly available dataset of middle school students' essays. The choice of this comparison group later became a point of methodological criticism: eighth-grade US students are younger and less educationally advanced than adult TOEFL test-takers, which may confound language background and writing maturity. The paper's authors acknowledged this limitation.
The study tested seven AI detection tools that were widely available at the time: GPTZero, Originality.AI, Writer.com AI Content Detector, Crossplag, Copyleaks, Sapling, and Content at Scale. Turnitin was not included in the primary experiment; the paper was published in April 2023, approximately the same time Turnitin launched its AI detection feature. The finding, therefore, describes the seven tested tools specifically, though subsequent research and real-world reporting documented similar patterns in Turnitin deployments.
Each essay from both datasets was submitted to all seven detectors and classified as AI-generated or human-written based on each tool's output. The researchers then analyzed the classification rates across the two populations and ran the additional experiments described in the findings table below.
Experiment | Dataset | Result | Significance |
Primary bias test | 91 TOEFL essays by non-native English speakers vs. 88 essays by US-born 8th-grade students | Seven AI detectors flagged 61.22% of TOEFL essays as AI-generated; near-perfect accuracy on 8th-grade US essays | The disparity between 61% and a near-zero false positive rate represents a structural failure, not a calibration edge case |
Unanimity check | Same TOEFL essay dataset | Approximately 20% of non-native writer essays were flagged as AI-generated unanimously by all seven detectors tested | Unanimous misclassification rules out individual tool error; it confirms the signal is present consistently across all tested platforms |
ICLR 2023 academic paper analysis | Research papers from the International Conference on Learning Representations, split by whether the first author was from a non-English-speaking country | Papers with non-native-English first authors showed systematically lower perplexity scores than papers with native-English first authors | Extends the bias finding beyond student essays to professional academic writing; the same mechanism affects published researchers |
Vocabulary enhancement test | TOEFL essays rewritten with the prompt: Enhance the word choices to sound more like those of a native speaker | After vocabulary enhancement, misclassification rates dropped substantially, confirming that lexical simplicity was the causal mechanism | Demonstrates that the bias is not about writing quality or topic familiarity, but specifically about the statistical properties of vocabulary choice |
Simplification test | 8th-grade US student essays rewritten with the prompt: Simplify word choices as if written by a non-native speaker | When US student essays were simplified, misclassification rates rose substantially, mirroring the ESL false positive pattern | Provides causal confirmation in both directions: the bias follows vocabulary complexity, not writer identity |
AI bypass test | ChatGPT-3.5-generated college application essays, then the same essays after the prompt "Elevate the provided text by employing literary language." | Initial AI essays detected correctly; after self-editing, detection rates fell to near zero across all seven detectors | Demonstrates the symmetry of the problem: the same lexical simplicity that makes ESL writing look AI-generated makes AI writing look human when complexity is added |

The Core Explanation: AI detectors identify AI writing by measuring how statistically predictable the text is. Non-native English writers produce statistically predictable text, not because they use AI, but because they write from a more constrained vocabulary. The detector cannot distinguish these two sources of predictability. It is not identifying AI; it is identifying simple vocabulary and incorrectly attributing that simplicity to machine generation. |
Understanding why the bias exists requires understanding what perplexity actually measures in the context of AI detection. Perplexity is a statistical measure of how surprised a language model is by a sequence of words. Low perplexity means the words were predictable given their context; high perplexity means the word choices were unexpected. AI language models produce low-perplexity text because they are designed to select the most statistically probable next word at each generation step. Human writers, drawing on personal experience, idiomatic expression, and creative vocabulary, tend to produce higher-perplexity text. This asymmetry is the basis for perplexity-based AI detection. Pangram's analysis of ESL false positive rates explains the precise failure mode: non-native English speakers do not have the same depth of vocabulary or command of complex syntactic constructions as fluent native writers. Their writing exhibits lower perplexity, not because they use AI, but because second-language writing draws on a more restricted portion of the language's vocabulary and structure.
The Stanford study found four specific linguistic features in which non-native English writers performed worse than native English writers, and in which AI-generated text also performed worse. The presence of all four features together is what the detectors recognize as AI-generated text:
Lexical richness is the level of vocabulary diversity in the text. Non-native writers are likely to use only those words that are most familiar to them and avoid any synonyms or technical words. The result is a smaller vocabulary than that of a native English speaker.
Lexical diversity: the proportion of unique words out of the total number of words. While this is somewhat similar to lexical richness, it is a global property of vocabulary use rather than a property of the words per se. Native writers and AI models exhibit greater lexical diversity than non-native writers in second-language contexts.
Syntactic complexity: The level of grammatical construction complexity. Native English writers use embedded clauses, the passive voice, participial phrases, and other devices that add to syntactic complexity. At intermediate levels of proficiency, non-native writers will generally use less complex sentences. Similarly, while optimizing for clarity and coherence, AI models will generally use simpler sentences, though they can use more complex ones when prompted.
Grammatical complexity: This is similar to syntactic complexity, but in this case, we are looking at the variety of grammatical constructions used in a text. High grammatical complexity is associated with native language proficiency, whereas both non-native writers and AI models tend to exhibit lower grammatical complexity in their writing.
This mechanism has been succinctly explained by Professor Zou in his interview with Stanford HAI: It comes down to how these detectors detect AI. They score based on a metric called "perplexity," which is related to how complex the writing is, a factor in which non-native speakers will naturally lag behind their US-born counterparts.
At the heart of the perplexity measure is the dataset issue, which actually increases bias. Most commercial AI detection tools were developed and tested using datasets of writing samples primarily produced by native English speakers. These data sets essentially determine 'what human writing looks like.' When non-native speakers are underrepresented in the dataset that determines 'what human writing looks like,' the model is essentially taught to recognize 'what human writing looks like,' which is the statistical average of native English speakers' writing. It is essentially taught to recognize writing that falls below the statistical average as AI-generated. It is not wrong to recognize what native English speakers' writing looks like; it is wrong to assume that this is the only writing humans produce. UCLA's analysis of AI detector imperfections notes that this is not unique to this particular AI detector but a general trend: when using predominantly English-speaking, Western data, results are less reliable for underrepresented populations.
The academic paper experiment in the Stanford study expands this concept from students taking language proficiency tests to professional researchers in their field, wherein papers submitted to a major machine learning conference with a first author who is not a native English speaker have a lower text perplexity than those with a first author who is a native English speaker.
This is not just about students taking language proficiency tests; it is about professional researchers in their field, simply because their first language is not English.
The Stanford study generated substantial responses from AI detection vendors, educational institutions, and the broader research community. Those responses varied considerably in their candour and in the actions they produced. Originality.ai's critique of the study methodology represents the most detailed vendor pushback and raises legitimate methodological questions, while the independent replication evidence broadly confirms the core finding across multiple platforms.
Respondent | Position Taken | Response to the Bias Finding |
Turnitin (Annie Chechitelli, Chief Product Officer) | Disputed the finding | Stated that Turnitin's tool was trained on writing by English speakers in the US and abroad, as well as multilingual students, and therefore should not have the identified bias. The company stated that it was conducting its own research into whether the tool is less accurate for non-native writers; published results have not confirmed the absence of bias. |
Disputed the methodology | Published a detailed critique arguing the Stanford study used an underpowered sample (91 TOEFL essays), a flawed comparison group (8th-grade US students vs. adult TOEFL test-takers introduces age and education confounds), and misclassified some GPT-4-modified human text as human. Conducted its own study using IELTS essays and reported lower bias on its platform. | |
Pangram Labs | Confirmed and addressed the finding | Reproduced the study on four public ESL datasets and found that perplexity-based detectors do exhibit the bias described. Reported a near-zero false positive rate on the same TOEFL dataset in their own detector, attributing the improvement to training on a broad spectrum of text, including non-native and casual English, rather than exclusively academic writing. |
OpenAI | Implicitly confirmed through product decision | Had already shut down its AI classifier in July 2023 before the Stanford study was widely circulated, citing low accuracy. The classifier's documented 9% false-positive rate on human writing was consistent with the bias pattern described by the Stanford study. |
GPTZero | Acknowledged and addressed | Committed to ESL bias reduction in model updates, 2025-2026 updates specifically targeted non-native English false-positive rates, with the platform reporting a 2% higher rate for non-native writers than its native English baseline in current models. |
Academic institutions | Mixed responses | Vanderbilt, UCLA, UC San Diego, and others cited ESL bias as one of the reasons for disabling AI detection features. Many institutions that continued using detection tools did not update their policies to account for the documented ESL risk. |
The Stanford study was published with a relatively small sample size (91 TOEFL essays) and with methodological flaws in the comparison group, as its critics pointed out. Three years of research have addressed these flaws, and the findings are now supported to the same extent.
A 2024 independent audit of Copyleaks, ZeroGPT, Scribbr, and QuillBot Premium on 303 ESL graduate student texts written prior to 2021 and 307 texts generated by AI tools found increased false positive rates on ESL texts consistent with those found in the Stanford paper on all four tools.
Pangram Labs independently replicated the experiment on four public ESL datasets, including the original TOEFL dataset used in the Stanford paper, and verified that the bias exists.
The company's own detector, trained on a more diverse set of texts, achieved a near-zero false positive rate on those same datasets.
Copyleaks conducted an independent study of 2,116 FCE (First Certificate in English) exam texts, finding a 5.04% false-positive rate for texts written by non-native English speakers, compared to their general baseline of 1-2%.
This is a smaller difference than found in the Stanford paper, but in a footnote, Copyleaks acknowledged that the 5.04% figure represents a statistically significant increase above our general baseline.
Institutional use in multiple contexts, as documented in various academic media and court documents, further confirmed the bias. Taylor Hahn, a professor at Johns Hopkins University, observed over the course of a semester that Turnitin was significantly more likely to flag international students' writing than domestic students', which inspired the Stanford team's initial research. The same phenomenon was also observed at UC Davis, where the linguistics professor noted that 15 of 17 flagged essays were written by ESL or second-language writers.
A 2026 follow-up study confirmed that the underlying structural mechanism remains present in the current generation of detectors. The mean false positive rate for essays by Chinese students on TOEFL essays was 61.3%, while the mean false positive rate for essays by US students on the same essays was 5.1%. AI text humanizer tools that help non-native English writers produce content with higher lexical variation address this problem at the level of writing output, but the underlying detection bias persists across most deployed platforms.
The Stanford study has received several methodological criticisms, especially from Originality.ai, which published a thorough analysis and conducted its own counter-study. To assess these criticisms, we will need to distinguish those that are methodologically valid from those that, although correct, are irrelevant to the original study's conclusion.
Sample Size: 91 TOEFL essays. The sample size is too small to generalize the conclusion to a worldwide population of non-native English writers. The authors acknowledged this as a limitation. Subsequent replications have verified the results, however, across a range of sample sizes.
Comparison Group Confound: Comparing a group of adult TOEFL writers to a group of US eighth-grade writers introduces age, educational maturity, and writing experience as confounding variables, in addition to language background. The counter-study by Originality.ai compared IELTS essays written by adult non-native writers with those written by adult native English writers.
Classification Question by GPT-4: The paper classified some GPT-4-altered human text as 'human,' leading to ambiguity about exactly what is being measured in some cases. This is a legitimate criticism by Originality.ai of the methodology employed in the paper, and it is a valid criticism of the interpretation of specific results.
The lack of bias, as shown by the counter-study conducted by Originality.ai, is not representative of other platforms. The fact that Stanford tested seven platforms and found this to be true of those seven platforms is a description of those seven platforms, whereas Originality.ai tested its own platform and found lower bias, which is not representative of other platforms.
The fact that having more or less biased training data can resolve this bias is true, both in theory and in practice, but this is a prescription for future work, not a description of current reality. The fact is, most commercial platforms have not demonstrated this.
The fact that some non-native English writers use AI as a writing tool, which could skew results on ESL bias, is true, but this does not explain why TOEFL essays written before ChatGPT was available are still being classified incorrectly. Nor why essays written by graduate students before 2021 are still being classified incorrectly in follow-up studies.
The results of the Stanford study have significant practical implications for every institution that uses AI detection tools in environments that include non-native writers and for every non-native writer whose work is reviewed by detection screening. How professors use AI detection in 2026
The false-positive risk is considerably higher than any tool's accuracy rate would indicate. The Stanford study found a 12-fold difference in false-positive rates between non-native and native English writers on the same writing tasks. Even tools that have addressed ESL bias report false-positive rates 2-5% higher than their rates for native English writers. A comprehensive documentation of your writing process will carry much more weight than any statistical estimation. Timestamped drafts, notes, version history, and writing session logs provide evidence independent of any writing detector's scoring.
If you are flagged, you can cite the Stanford study by name. The study, authored by Liang et al. (2023) in Patterns, a journal with over 400 citations, directly provides evidence that the class of detector most likely used to flag your writing has a documented structural flaw: they have a documented tendency to incorrectly identify non-native English writing as AI-generated text at a rate higher than 61%.
Never rely on AI detection results as presumptive evidence against a student whose first language is not English without additional corroborating evidence. The Stanford study, replicated multiple times, has conclusively demonstrated that such tools yield a disproportionate number of false positives against such students. Relying on a flag generated by such a tool as a basis for an academic integrity proceeding, without regard to such a well-documented limitation, not only risks misidentifying an innocent student but also exposes you to legal liability.
Carefully evaluate the detection tool you're using, keeping ESL false-positive rates in mind. Detection tools such as GPTZero, with its updated models, and Pangram, which has demonstrated bias reduction, perform significantly more accurately on ESL students than tools that have not demonstrated such improvements.
Supplement detection results with a process-based approach to verification. Requiring a student to maintain a draft history, submit an annotated outline, or discuss their work briefly in a follow-up conversation provides a form of verification that is not subject to the same statistical biases as a detection tool. The authors of the Stanford study recommended moving away from product-based detection tools toward a more process-based approach.
The most significant discovery of the Stanford study, that 61% of non-native essays were flagged by the seven most popular AI detection tools, achieving near-perfect accuracy on native essays, marks one of the most significant recorded instances of unfairness in widely used testing technology. The mechanism by which this unfairness occurs is simple: these tools detect AI-generated writing by looking for statistical patterns. Non-native writing is statistically predictable for reasons unrelated to the use of AI. Further studies have verified this fact across different platforms, writing contexts, and populations. Most platforms have admitted this fact. Most policies have not adjusted proportionally in response. The disconnect between research in 2023 and the reality in 2026 for every non-native English writer whose work is graded by a technology never designed to serve them fairly is the lived reality of every non-native English writer.
The research, conducted by Liang, Yuksekgonul, Mao, Wu, and Zou, involved testing seven popular AI detection tools on two sets of known human-written content. The content consisted of 91 essays written by non-English-speaking TOEFL test-takers and 88 essays written by US-born eighth-grade students. The tools detected 61.22% of the TOEFL essays as being written by AI. The tools were nearly perfect in identifying the essays written by 8th-grade students. In about 20% of the TOEFL essays, AI detection was unanimous across all 7 tools. The research also showed that making native English writing simpler increases misidentification, whereas making ESL writing more complex with high-level vocabulary decreases misidentification.
Texts written by both AI and non-native English speakers are found to have low perplexity, a key indicator used in detection algorithms. This is so because, in the case of AI, a word with high statistical probability is always chosen, whereas in the case of non-native English writers, their vocabulary is restricted, which results in less lexical variety, less syntactic complexity, and more grammatical regularity in their writing. The detector, designed to work on a large number of texts written in native English, identifies this common statistical feature in both AI and ESL writing and gives false alarms on ESL writing that has nothing to do with AI use.
Partially, on some platforms. GPTZero promises to reduce ESL bias in its model updates, reporting an elevated 2% false-positive rate for non-native writers compared to its native English baseline in its 2025-2026 models, a significant improvement over the much higher rates in prior versions. The ESL inclusion-from-the-start approach used by the Pangram Lab detector yields a near-zero false-positive rate on the TOEFL dataset from the Stanford study. The majority of the remaining commercial platforms do not publish verifiable data on equivalent performance improvements. The bias is still present and impactful in the majority of institutional implementations as of 2026.
The first set of issues was methodological. Originality.ai pointed out that the comparison group, consisting of 8th-grade students in the US, was younger and less educationally advanced than the TOEFL test-takers, who were adults applying to university. This could, in theory, affect both language background and educational level. The number of TOEFL essays, 91, was also too small to support a claim with such strong population-level implications. The paper's decision to include some human text modified by GPT-4 as "human" also created confusion in some experimental conditions. While these are legitimate methodological issues, they fail to address the 61% misclassification rate, and no well-controlled, large-sample study has demonstrated comparable accuracy for both non-native and native English speakers with the widely deployed detection tools. What should I do if I am an ESL writer flagged by an AI detector?
Cite the Stanford study directly: Liang et al. (2023), published in Patterns, found that seven detectors for AI writing systematically detected 61% of non-native English essays as written by AI, compared to near-perfect accuracy for native English essays. Provide documentation of your process: timestamped drafts, research notes, version history, and any documentation of your writing process. Note that the detection tool itself is known to have a structural bias against non-native English writing and requests that the institution rely on human judgment rather than the detection tool. If the institution is using a tool, ask which tool and version are in use and whether that version is known to reduce ESL bias. Most will not be able to answer the third question, which itself is evidence of the institution’s lack of knowledge of the tool’s known limitations.
The information in this article is based on research available as of March 2026. The study by Liang et al. (2023) is used in this article based on the officially published version in Patterns, Cell Press (DOI: 10.1016/j.patter.2023.100779), an open-access publication. The direct quotations from the study authors were taken from Stanford HAI and The Markup reports on the study.