Fooled by gibberish and highly susceptible to human bias, automated essay-scoring systems are being increasingly adopted, a Motherboard investigation has found
Every year, millions of students sit down for standardized tests that carry weighty consequences. National tests like the Graduate Record Examinations (GRE) serve as gatekeepers to higher education, while state assessments can determine everything from whether a student will graduate to federal funding for schools and teacher pay.
Traditional paper-and-pencil tests have given way to computerized versions. And increasingly, the grading process—even for written essays—has also been turned over to algorithms.
Natural language processing (NLP) artificial intelligence systems—often called automated essay scoring engines—are now either the primary or secondary grader on standardized tests in at least 21 states, according to a survey conducted by Motherboard. Three states didn't respond to the questions.
Of those 21 states, three said every essay is also graded by a human. But in the remaining 18 states, only a small percentage of students' essays—it varies between 5 to 20 percent—will be randomly selected for a human grader to double check the machine's work.
But research from psychometricians—professionals who study testing—and AI experts, as well as documents obtained by Motherboard, show that these tools are susceptible to a flaw that has repeatedly sprung up in the AI world: bias against certain demographic groups. And as a Motherboard experiment demonstrated, some of the systems can be fooled by nonsense essays with sophisticated vocabulary.
Essay-scoring engines don't actually analyze the quality of writing. They're trained on sets of hundreds of example essays to recognize patterns that correlate with higher or lower human-assigned grades. They then predict what score a human would assign an essay, based on those patterns.
AI has the potential to exacerbate discrimination, experts say. Training essay-scoring engines on datasets of human-scored answers can ingrain existing bias in the algorithms. But the engines also focus heavily on metrics like sentence length, vocabulary, spelling, and subject-verb agreement—the parts of writing that English language learners and other groups are more likely to do differently. The systems are also unable to judge more nuanced aspects of writing, like creativity.
Nevertheless, test administrators and some state education officials have embraced the technology. Traditionally, essays are scored jointly by two human examiners, but it is far cheaper to have a machine grade an essay, or serve as a back-up grader to a human.
The nonprofit Educational Testing Service is one of the few, if not the only, vendor to have published research on bias in machine scoring. Its "E-rater" engine is used to grade a number of statewide assessments, the GRE, and the Test of English as a Foreign Language (TOEFL), which foreign students must take before attending certain colleges in the U.S.
In studies from 1999, 2004, 2007, 2008, 2012, and 2018, ETS found that its engine gave higher scores to some students, particularly those from mainland China, than did expert human graders. Meanwhile, it tended to underscore African Americans and, at various points, Arabic, Spanish, and Hindi speakers—even after attempts to reconfigure the system to fix the problem.
E-rater tended to give students from mainland China lower scores for grammar and mechanics, when compared to the GRE test-taking population as a whole. But the engine gave them above-average scores for essay length and sophisticated word choice, which resulted in their essays receiving higher overall grades than those assigned by expert human graders. That combination of results, Williamson and the other researchers wrote, suggested many students from mainland China were using significant chunks of pre-memorized shell text.
African Americans, meanwhile, tended to get low marks from E-rater for grammar, style, and organization—a metric closely correlated with essay length—and therefore received below-average scores. But when expert humans graded their papers, they often performed substantially better.
Several years ago, Les Perelman, the former director of writing across the curriculum at MIT, and a group of students developed the Basic Automatic B.S. Essay Language (BABEL) Generator, a program that patched together strings of sophisticated words and sentences into meaningless gibberish essays. The nonsense essays consistently received high, sometimes perfect, scores when run through several different scoring engines
Motherboard replicated the experiment. We submitted two BABEL-generated essays—one in the "issue" category, the other in the "argument" category—to the GRE's online ScoreItNow! practice tool, which uses E-rater. Both received scores of 4 out of 6, indicating the essays displayed "competent examination of the argument and convey(ed) meaning with acceptable clarity."
Here's the first sentence from the essay addressing technology's impact on humans' ability to think for themselves: "Invention for precincts has not, and presumably never will be undeniable in the extent to which we inspect the reprover."
"The BABEL Generator proved you can have complete incoherence, meaning one sentence had nothing to do with another," and still receive a high mark, Perelman told Motherboard.