Flawed Algorithms Are Grading Millions of Students’ Essays

signal · Aug 30, 2019

https://www.vice.com/en_us/article/pa7dj9/flawed-algorithms-are-grading-millions-of-students-essays

Fooled by gibberish and highly susceptible to human bias, automated essay-scoring systems are being increasingly adopted, a Motherboard investigation has found

Every year, millions of students sit down for standardized tests that carry weighty consequences. National tests like the Graduate Record Examinations (GRE) serve as gatekeepers to higher education, while state assessments can determine everything from whether a student will graduate to federal funding for schools and teacher pay.

Traditional paper-and-pencil tests have given way to computerized versions. And increasingly, the grading process—even for written essays—has also been turned over to algorithms.

Natural language processing (NLP) artificial intelligence systems—often called automated essay scoring engines—are now either the primary or secondary grader on standardized tests in at least 21 states, according to a survey conducted by Motherboard. Three states didn't respond to the questions.

Of those 21 states, three said every essay is also graded by a human. But in the remaining 18 states, only a small percentage of students' essays—it varies between 5 to 20 percent—will be randomly selected for a human grader to double check the machine's work.

But research from psychometricians—professionals who study testing—and AI experts, as well as documents obtained by Motherboard, show that these tools are susceptible to a flaw that has repeatedly sprung up in the AI world: bias against certain demographic groups. And as a Motherboard experiment demonstrated, some of the systems can be fooled by nonsense essays with sophisticated vocabulary.

Essay-scoring engines don't actually analyze the quality of writing. They're trained on sets of hundreds of example essays to recognize patterns that correlate with higher or lower human-assigned grades. They then predict what score a human would assign an essay, based on those patterns.

AI has the potential to exacerbate discrimination, experts say. Training essay-scoring engines on datasets of human-scored answers can ingrain existing bias in the algorithms. But the engines also focus heavily on metrics like sentence length, vocabulary, spelling, and subject-verb agreement—the parts of writing that English language learners and other groups are more likely to do differently. The systems are also unable to judge more nuanced aspects of writing, like creativity.

Nevertheless, test administrators and some state education officials have embraced the technology. Traditionally, essays are scored jointly by two human examiners, but it is far cheaper to have a machine grade an essay, or serve as a back-up grader to a human.

The nonprofit Educational Testing Service is one of the few, if not the only, vendor to have published research on bias in machine scoring. Its "E-rater" engine is used to grade a number of statewide assessments, the GRE, and the Test of English as a Foreign Language (TOEFL), which foreign students must take before attending certain colleges in the U.S.

In studies from 1999, 2004, 2007, 2008, 2012, and 2018, ETS found that its engine gave higher scores to some students, particularly those from mainland China, than did expert human graders. Meanwhile, it tended to underscore African Americans and, at various points, Arabic, Spanish, and Hindi speakers—even after attempts to reconfigure the system to fix the problem.

E-rater tended to give students from mainland China lower scores for grammar and mechanics, when compared to the GRE test-taking population as a whole. But the engine gave them above-average scores for essay length and sophisticated word choice, which resulted in their essays receiving higher overall grades than those assigned by expert human graders. That combination of results, Williamson and the other researchers wrote, suggested many students from mainland China were using significant chunks of pre-memorized shell text.

African Americans, meanwhile, tended to get low marks from E-rater for grammar, style, and organization—a metric closely correlated with essay length—and therefore received below-average scores. But when expert humans graded their papers, they often performed substantially better.

Several years ago, Les Perelman, the former director of writing across the curriculum at MIT, and a group of students developed the Basic Automatic B.S. Essay Language (BABEL) Generator, a program that patched together strings of sophisticated words and sentences into meaningless gibberish essays. The nonsense essays consistently received high, sometimes perfect, scores when run through several different scoring engines

Motherboard replicated the experiment. We submitted two BABEL-generated essays—one in the "issue" category, the other in the "argument" category—to the GRE's online ScoreItNow! practice tool, which uses E-rater. Both received scores of 4 out of 6, indicating the essays displayed "competent examination of the argument and convey(ed) meaning with acceptable clarity."

Here's the first sentence from the essay addressing technology's impact on humans' ability to think for themselves: "Invention for precincts has not, and presumably never will be undeniable in the extent to which we inspect the reprover."

"The BABEL Generator proved you can have complete incoherence, meaning one sentence had nothing to do with another," and still receive a high mark, Perelman told Motherboard.

Mivey · Aug 30, 2019

Training humans to produce texts that are "approved" by current algorithms, regardless of their coherency, is definitely one way to improve the quality of current AI: instead of improving and fixing the algorithms, we just change how humans write and think.

dark_prinny · Aug 30, 2019

Jokers trick?

Mendrox · Aug 30, 2019

You drink water, I drink anarchy

signal · Aug 30, 2019

Mendrox said:
You drink water, I drink anarchy

🤔

Mendrox · Aug 30, 2019

signal said:
🤔

It's from the thread you didn't like https://www.resetera.com/threads/gu...write-a-batman-movie-of-its-own.134676/page-4

Sulik2 · Aug 30, 2019

This is fucking ludicrous. You cannot grade an essay with a bot. Everything is fucking broken in this country and how awful our education system is is behind a lot of the problems.

Deleted member 8561 · Aug 30, 2019

Quick, somebody write in some SQL injections in your opening paragraph and wipe the standardized testing database!

Dennis8K · Aug 30, 2019

I didn't even know algorithms were being used to grade essays.

What an absolutely stupid idea and complete overreach of what algorithms can be expected to do.

HYBRIS. Who made this decision? Who are these idiots? You are smart enough to understand what an algorithm is but not smart enough to realize the limitations?

signal · Aug 30, 2019

Mendrox said:
It's from the thread you didn't like https://www.resetera.com/threads/gu...write-a-batman-movie-of-its-own.134676/page-4

Whoa called out. Sorry (._. )

kmfdmpig · Aug 30, 2019

Essays are much too complex to grade by computer. If you're making tests and are too damn cheap to pay for human graders then make non-essay based tests.

signal · Aug 30, 2019

kmfdmpig said:
Essays are much too complex to grade by computer. If you're making tests and are too damn cheap to pay for human graders then make non-essay based tests.

Even some of the non essay ones seem weird. Like this is trying to dictate a particular response phrasing just so it can be properly graded by a machine lol.

Quixzlizx · Aug 30, 2019

Essay-scoring engines don't actually analyze the quality of writing. They're trained on sets of hundreds of example essays to recognize patterns that correlate with higher or lower human-assigned grades. They then predict what score a human would assign an essay, based on those patterns.

So you don't actually have to know how to think or write, you just have to be a mediocre imitation of someone who does.

Sounds about right for this country.

Ether_Snake · Aug 30, 2019

To train the algorithms they use real essays for which the grades are already known, then the AI reads the essays and guesses a score. Eventually after a lot of training it reaches a point where it is able to predict which score an essay it never read could get based on what it was trained on previously. Half of the essays are used for the training, the other half to test its capacity against. The AI's prediction accuracy can then be evaluated.

Of course there comes a point where you have too little data to train the AI on, so as new essays come in small variables could throw it off completely.

gaugebozo · Aug 30, 2019

This is crazy at this point. I work in automated scoring for math, and the state of the art makes many mistakes so it's used more to help students than for assessment. Math you can actually use logic and exact solution to tease out, there's nothing like points for nuance or creativity, and machine learning is inherently probabilistic.

Encephalon · Aug 30, 2019

Who on earth thinks this is a good idea? Grading an essay involves ... judging the overall structure, message, argument, etc. of the essay. Not individual sentences, words, etc.

Also, wtf at that "label your inference" example.

Dennis8K · Aug 30, 2019

signal said:
Even some of the non essay ones seem weird. Like this is trying to dictate a particular response phrasing just so it can be properly graded by a machine lol.

Manatees can swim under water (observation)

Manatees are adapted to breathe under water for at least some time (inference)

DID I PASS?

Z-Beat · Aug 30, 2019

When I was in gradeschool, some of my teachers had us submit our essays to an anti-plagiarism software that basically just looks over your sentences and gives a percentage of how much of it was word for word from somewhere in the internet.

hateradio · Sep 2, 2019

signal said:
Even some of the non essay ones seem weird. Like this is trying to dictate a particular response phrasing just so it can be properly graded by a machine lol.

That manatee ain't trifling with some stale ass bread.

Wulfric · Sep 2, 2019

signal said:
Even some of the non essay ones seem weird. Like this is trying to dictate a particular response phrasing just so it can be properly graded by a machine lol.

A. This manatee is a heckin' chonker.

just a slowpoke · Sep 2, 2019

the purpose of algorithms like this is to launder biases against groups of people through the maaaagic of linear algebra

the goal is to enforce conformity with the plausible deniability of pretending that their judgments must be unbiased, because a maaaagic algorithm said so

Cation · Sep 2, 2019

Why are standardized tests still testing actual essay writing...?

Back when I took them, no school I applied cared for the entire writing portion.

low-G · Sep 2, 2019

Dennis8K said:
Manatees can swim under water (observation)

Manatees are adapted to breathe under water for at least some time (inference)

DID I PASS?

FAIL. ZERO. DID NOT IDENTIFY TEXT. FAILURE TO FOLLOW DIRECTIONS. EXPELLED. REPORT TO DISTRICT COURT BY 1PM EASTERN TIME TODAY.

molnizzle · Sep 2, 2019

Cation said:
Why are standardized tests still testing actual essay writing...?

Back when I took them, no school I applied cared for the entire writing portion.

Back when I took them, the writing portions didn't even exist. I had to take the GRE for business school last year and couldn't believe that it had a stupid essay.

TheOMan · Sep 2, 2019

low-G said:
FAIL. ZERO. DID NOT IDENTIFY TEXT. FAILURE TO FOLLOW DIRECTIONS. EXPELLED. REPORT TO DISTRICT COURT BY 1PM EASTERN TIME TODAY FOR BORG ASSIMILATION

ENHANCED SENTENCE FOR YOUR CONSUMPTION

Yourfawthaaa · Sep 2, 2019

Hmm So what detects plagiarism?

Baccus · Sep 2, 2019

A boring dystpia indeed

Nothing Loud · Sep 2, 2019

God damn it people. More and more industries are misusing machine learning and AI because it's the "hot thing to do" whether it makes sense or not

samoyed · Sep 2, 2019

Yourfawthaaa said:
Hmm So what detects plagiarism?

About 5-7 years ago, it would've been a program that just runs searches on your essay to see any duplicates on the internet. I don't think there's anything wrong with that because it produces a verifiable concrete trail. Having the algorithm actually grade essays is madness seeing how easy it is to game.

Min · Sep 2, 2019

Mivey said:
Training humans to produce texts that are "approved" by current algorithms, regardless of their coherency, is definitely one way to improve the quality of current AI: instead of improving and fixing the algorithms, we just change how humans write and think.

I've thought about this in the past and in regards to how news and headlines are written. Fascinating and Terrifying.

Ziltoidia 9 · Sep 2, 2019

But... how will the algorithms protect kids during a school shooting?

Nassudan · Sep 2, 2019

Dennis8K said:
HYBRIS. Who made this decision? Who are these idiots? You are smart enough to understand what an algorithm is but not smart enough to realize the limitations?

"There's an algorithm for everything!"
- Tech bros

samoyed · Sep 2, 2019

Ziltoidia 9 said:
But... how will the algorithms protect kids during a school shooting?

The algorithm will detect when a shooting is likely to take place, and then you just keep your kid at home on that day.

Flawed Algorithms Are Grading Millions of Students’ Essays

signal

Mivey

dark_prinny

Attempted to circumvent ban with alt account

Mendrox

signal

Mendrox

Sulik2

Deleted member 8561

user requested account closure

Dennis8K

signal

kmfdmpig

signal

Quixzlizx

Ether_Snake

gaugebozo

Encephalon

Dennis8K

Z-Beat

One Winged Slayer

hateradio

Wulfric

just a slowpoke

Cation

low-G

molnizzle

TheOMan

Yourfawthaaa

Baccus

Nothing Loud

Literally Cinderella

samoyed

Min

Ziltoidia 9

Nassudan

samoyed