• Ever wanted an RSS feed of all your favorite gaming news sites? Go check out our new Gaming Headlines feed! Read more about it here.

signal

Member
Oct 28, 2017
40,184

Fooled by gibberish and highly susceptible to human bias, automated essay-scoring systems are being increasingly adopted, a Motherboard investigation has found

Every year, millions of students sit down for standardized tests that carry weighty consequences. National tests like the Graduate Record Examinations (GRE) serve as gatekeepers to higher education, while state assessments can determine everything from whether a student will graduate to federal funding for schools and teacher pay.

Traditional paper-and-pencil tests have given way to computerized versions. And increasingly, the grading process—even for written essays—has also been turned over to algorithms.
Natural language processing (NLP) artificial intelligence systems—often called automated essay scoring engines—are now either the primary or secondary grader on standardized tests in at least 21 states, according to a survey conducted by Motherboard. Three states didn't respond to the questions.

Of those 21 states, three said every essay is also graded by a human. But in the remaining 18 states, only a small percentage of students' essays—it varies between 5 to 20 percent—will be randomly selected for a human grader to double check the machine's work.

But research from psychometricians—professionals who study testing—and AI experts, as well as documents obtained by Motherboard, show that these tools are susceptible to a flaw that has repeatedly sprung up in the AI world: bias against certain demographic groups. And as a Motherboard experiment demonstrated, some of the systems can be fooled by nonsense essays with sophisticated vocabulary.
Essay-scoring engines don't actually analyze the quality of writing. They're trained on sets of hundreds of example essays to recognize patterns that correlate with higher or lower human-assigned grades. They then predict what score a human would assign an essay, based on those patterns.

DYW3cfM.png


AI has the potential to exacerbate discrimination, experts say. Training essay-scoring engines on datasets of human-scored answers can ingrain existing bias in the algorithms. But the engines also focus heavily on metrics like sentence length, vocabulary, spelling, and subject-verb agreement—the parts of writing that English language learners and other groups are more likely to do differently. The systems are also unable to judge more nuanced aspects of writing, like creativity.

Nevertheless, test administrators and some state education officials have embraced the technology. Traditionally, essays are scored jointly by two human examiners, but it is far cheaper to have a machine grade an essay, or serve as a back-up grader to a human.
The nonprofit Educational Testing Service is one of the few, if not the only, vendor to have published research on bias in machine scoring. Its "E-rater" engine is used to grade a number of statewide assessments, the GRE, and the Test of English as a Foreign Language (TOEFL), which foreign students must take before attending certain colleges in the U.S.

In studies from 1999, 2004, 2007, 2008, 2012, and 2018, ETS found that its engine gave higher scores to some students, particularly those from mainland China, than did expert human graders. Meanwhile, it tended to underscore African Americans and, at various points, Arabic, Spanish, and Hindi speakers—even after attempts to reconfigure the system to fix the problem.
E-rater tended to give students from mainland China lower scores for grammar and mechanics, when compared to the GRE test-taking population as a whole. But the engine gave them above-average scores for essay length and sophisticated word choice, which resulted in their essays receiving higher overall grades than those assigned by expert human graders. That combination of results, Williamson and the other researchers wrote, suggested many students from mainland China were using significant chunks of pre-memorized shell text.

African Americans, meanwhile, tended to get low marks from E-rater for grammar, style, and organization—a metric closely correlated with essay length—and therefore received below-average scores. But when expert humans graded their papers, they often performed substantially better.
Several years ago, Les Perelman, the former director of writing across the curriculum at MIT, and a group of students developed the Basic Automatic B.S. Essay Language (BABEL) Generator, a program that patched together strings of sophisticated words and sentences into meaningless gibberish essays. The nonsense essays consistently received high, sometimes perfect, scores when run through several different scoring engines

Motherboard replicated the experiment. We submitted two BABEL-generated essays—one in the "issue" category, the other in the "argument" category—to the GRE's online ScoreItNow! practice tool, which uses E-rater. Both received scores of 4 out of 6, indicating the essays displayed "competent examination of the argument and convey(ed) meaning with acceptable clarity."
Here's the first sentence from the essay addressing technology's impact on humans' ability to think for themselves: "Invention for precincts has not, and presumably never will be undeniable in the extent to which we inspect the reprover."

"The BABEL Generator proved you can have complete incoherence, meaning one sentence had nothing to do with another," and still receive a high mark, Perelman told Motherboard.
 

Mivey

Member
Oct 25, 2017
17,818
Training humans to produce texts that are "approved" by current algorithms, regardless of their coherency, is definitely one way to improve the quality of current AI: instead of improving and fixing the algorithms, we just change how humans write and think.
 

Sulik2

Banned
Oct 27, 2017
8,168
This is fucking ludicrous. You cannot grade an essay with a bot. Everything is fucking broken in this country and how awful our education system is is behind a lot of the problems.
 

Deleted member 8561

user requested account closure
Banned
Oct 26, 2017
11,284
Quick, somebody write in some SQL injections in your opening paragraph and wipe the standardized testing database!
 

Dennis8K

Banned
Oct 25, 2017
20,161
I didn't even know algorithms were being used to grade essays.

What an absolutely stupid idea and complete overreach of what algorithms can be expected to do.

HYBRIS. Who made this decision? Who are these idiots? You are smart enough to understand what an algorithm is but not smart enough to realize the limitations?
 

kmfdmpig

The Fallen
Oct 25, 2017
19,350
Essays are much too complex to grade by computer. If you're making tests and are too damn cheap to pay for human graders then make non-essay based tests.
 
OP
OP
signal

signal

Member
Oct 28, 2017
40,184
Essays are much too complex to grade by computer. If you're making tests and are too damn cheap to pay for human graders then make non-essay based tests.
Even some of the non essay ones seem weird. Like this is trying to dictate a particular response phrasing just so it can be properly graded by a machine lol.

ySA1mC2.png
 

Quixzlizx

Member
Oct 25, 2017
2,591
Essay-scoring engines don't actually analyze the quality of writing. They're trained on sets of hundreds of example essays to recognize patterns that correlate with higher or lower human-assigned grades. They then predict what score a human would assign an essay, based on those patterns.

So you don't actually have to know how to think or write, you just have to be a mediocre imitation of someone who does.

Sounds about right for this country.
 

Ether_Snake

Banned
Oct 29, 2017
11,306
To train the algorithms they use real essays for which the grades are already known, then the AI reads the essays and guesses a score. Eventually after a lot of training it reaches a point where it is able to predict which score an essay it never read could get based on what it was trained on previously. Half of the essays are used for the training, the other half to test its capacity against. The AI's prediction accuracy can then be evaluated.

Of course there comes a point where you have too little data to train the AI on, so as new essays come in small variables could throw it off completely.
 

gaugebozo

Member
Oct 25, 2017
2,828
This is crazy at this point. I work in automated scoring for math, and the state of the art makes many mistakes so it's used more to help students than for assessment. Math you can actually use logic and exact solution to tease out, there's nothing like points for nuance or creativity, and machine learning is inherently probabilistic.
 

Encephalon

Member
Oct 26, 2017
5,851
Japan
Who on earth thinks this is a good idea? Grading an essay involves ... judging the overall structure, message, argument, etc. of the essay. Not individual sentences, words, etc.

Also, wtf at that "label your inference" example.
 

Dennis8K

Banned
Oct 25, 2017
20,161
Even some of the non essay ones seem weird. Like this is trying to dictate a particular response phrasing just so it can be properly graded by a machine lol.

ySA1mC2.png
Manatees can swim under water (observation)

Manatees are adapted to breathe under water for at least some time (inference)

DID I PASS?
 

Z-Beat

One Winged Slayer
Member
Oct 25, 2017
31,838
When I was in gradeschool, some of my teachers had us submit our essays to an anti-plagiarism software that basically just looks over your sentences and gives a percentage of how much of it was word for word from somewhere in the internet.
 
Oct 25, 2017
1,705
the purpose of algorithms like this is to launder biases against groups of people through the maaaagic of linear algebra

the goal is to enforce conformity with the plausible deniability of pretending that their judgments must be unbiased, because a maaaagic algorithm said so
 

Cation

The Fallen
Oct 28, 2017
3,603
Why are standardized tests still testing actual essay writing...?

Back when I took them, no school I applied cared for the entire writing portion.
 

molnizzle

Banned
Oct 25, 2017
17,695
Why are standardized tests still testing actual essay writing...?

Back when I took them, no school I applied cared for the entire writing portion.
Back when I took them, the writing portions didn't even exist. I had to take the GRE for business school last year and couldn't believe that it had a stupid essay.
 

Nothing Loud

Literally Cinderella
Member
Oct 25, 2017
9,975
God damn it people. More and more industries are misusing machine learning and AI because it's the "hot thing to do" whether it makes sense or not
 

samoyed

Banned
Oct 26, 2017
15,191
Hmm So what detects plagiarism?
About 5-7 years ago, it would've been a program that just runs searches on your essay to see any duplicates on the internet. I don't think there's anything wrong with that because it produces a verifiable concrete trail. Having the algorithm actually grade essays is madness seeing how easy it is to game.
 

Min

Member
Oct 25, 2017
4,068
Training humans to produce texts that are "approved" by current algorithms, regardless of their coherency, is definitely one way to improve the quality of current AI: instead of improving and fixing the algorithms, we just change how humans write and think.

I've thought about this in the past and in regards to how news and headlines are written. Fascinating and Terrifying.