Flawed Algorithms Are Grading Millions of Students’ Essays

dahart · on Aug 30, 2019

> Utah has been using AI as the primary scorer on its standardized tests for several years. “It was a major cost to our state to hand score, in addition to very time consuming,” said Cydnee Carter, the state’s assessment development coordinator. The automated process also allowed the state to give immediate feedback to students and teachers, she said.

Yes, education takes time and costs money. Yes, not educating is both cheaper and faster. Note how the rationalizing ignores the needs of the students and the quality of the education.

I live in Utah and my children have been subjected to this automated essay scoring here. One night I came home from work and my son and wife were both in tears, frustrated with each other and frustrated with the essay scoring which refused to give a high enough score to meet what the teacher said was required, no matter how good the essay was. My wife wrote versions herself from scratch and couldn’t get the required score. When I got involved, I did the same with the same results.

Turns out the instructions said the essay would be scored on verbal efficiency; getting the point across clearly with the fewest words. I started playing around and realized that the more words I added, the higher the score, whether they were relevant or grammatical or not. Random unrelated sentences pasted in the middle would increase the score. We found a letter of petition online for banning automated scoring for the purposes of grades or student evaluation of any kind. It was very long, so it got a perfect score. I encouraged my son to submit it, and he did. Later I visited his teacher to explain and to urge her to not use automated scoring. She listened and then told me about how much time it saves and how fast students get feedback. :/

piokoch · on Aug 30, 2019

Frankly, I can't believe what I am reading. The idea that some "AI" grades essays automatically is idiotic and has nothing to do with education. Where is the place for discussions? Where is the place for ideas confrontation? Where is the place for writing style development? How this AI is supposed to grade things like repetitions (that can be either good rhetorical tool or a mistake, depending on context), etc?

Who the hell came out with such an idea. I would even hesitate to use "AI" for automatic spell checking as it is sufficient to give some character unusual name and it will be marked as error.

My guess is that soon or later people will learn how to game that AI. I wouldn't be surprised if there were some software that will generate essay that Utah "AI" likes.

dagw · on Aug 30, 2019

My guess is that soon or later people will learn how to game that AI.

Already been done. http://lesperelman.com/writing-assessment-robo-grading/babel...

Here's a sample essay that is complete nonsense and got a perfect score on the GRE.

http://lesperelman.com/wp-content/uploads/2015/12/6-6_ScoreI...

thombat · on Aug 30, 2019

The final paragraph from that example is steaming gibberish that nobody could mistake for English:

"Calling has not, and undoubtedly never will be aggravating in the way we encounter mortification but delineate the reprimand that should be inclination. Nonetheless, armed with the knowledge that the analysis augurs stealth with propagandists, almost all of the utterances on my authorization journey. Since sanctions are performed at knowledge, a quantity of vocation can be more gaudily inspected. Knowledge will always be a part of society.Vocation is the most presumptuously perilous assassination of mankind."

Yet the robo-scoring acclaims it as:

* articulates a clear and insightful position on the issue in accordance with the assigned task * develops the position fully with compelling reasons and/or persuasive examples * sustains a well-focused, well-organized analysis, connecting ideas logically * conveys ideas fluently and precisely, using effective vocabulary and sentence variety * demonstrates superior facility with the conventions of standard written English (i.e., grammar, usage, and mechanics) but may have minor errors

Any teacher faced with the requirement to use such tools would be better placed instructing their class on civil disobedience.

crankylinuxuser · on Aug 30, 2019

The let me posit another idea...

There's 2 ways of finding out these artifacts of AI essay grading: pure luck, and being able to afford extensive test-prep (rich).

The luck one can't be accounted for. So I im lead to believe that the purpose of these essays and their AI grading is to find and escalate rich people.

JustSomeNobody · on Aug 30, 2019

> So I im lead to believe that the purpose of these essays and their AI grading is to find and escalate rich people.

Well, of course. How many poor people are allowed to decide what is good for children's education?

crankylinuxuser · on Aug 30, 2019

The standard US response is:

'There's a reason why they're poor. Better pull themselves up by the bootstraps."

Mixed alongside with poverty stricken neighborhoods are the primary funding, resulting in poor school systems. And those students obviously wont have the money or the access to get the test-prep needed to "succeed".

It's all too laid out to be accidental.

mannykannot · on Aug 30, 2019

Professor Perelman had also previously demonstrated that this sort of scoring was going on when essays were scored by humans [1].

I suspect that, in addition to the scoring rules being written for speed and frugality, they were shaped by a poorly thought out attempt to make the scoring 'objective', and independent of the scorer's beliefs, attitudes and unconscious biases.

In one sense, this software (I will not call it 'AI') is an extension of all those bad ideas, only greatly amplified in a way that only software can.

In evolutionary biology, there is the concept of 'honest signalling'[2], a true and unfakable indicator, to a potential mate or predator, of an animal's fitness. That is what we are missing here.

[1] https://www.bostonglobe.com/opinion/2014/03/13/the-man-who-k...

[2] https://en.wikipedia.org/wiki/Signalling_theory

DoctorOetker · on Aug 31, 2019

Anpther issue is one of copyright: obviously the student is the author. And we all know that the ML scoring subcontractor is keeping copies, with human ratings for later retraining purpouses.

At the time the student takes the test, he should be prompted with the informed choice asking him to grant either 1) no license to keep a copy for training purpouses 2) a non-exclusive license, and the website where he can get a copy of his own essay 3) a public domain license, again with the relevant domain linked so he can find his own and other's essays. 4) any of the above or other as a function of the resulting grade!

At the same time he should also specify his desire for or against attribution, again probably best as a function of the resulting grade. And under what moniker he wishes this contribution to exist.

These options to be filled out during exam time should have no default options (no opt-out), and preferably should be standardized by the community and lobbied for to be mandatorily enforced at state or federal level (forcing examinations to present the student with an informed choice)

A public dataset of legally obtained essays (without scores or names) would already be a very important first step to invite others to make actual performant ML grading systems.

I don't believe the current datasets in these "non-profit" organizations actually comply with copyright law, organizations who don't profit from the grading service towards the state, but do provide a stable ML job on the income from charging the people with financial means to test submissions, enabling a stealth class based society.

maze-le · on Aug 30, 2019

That reads like it was written by a GPT-2 bot.

sidpatil · on Aug 30, 2019

A Markov chain would have been able to produce a text like that.

C1sc0cat · on Aug 30, 2019

Wow - should have been marked for improper use of : though :-)

maze-le · on Aug 30, 2019

>> Who the hell came out with such an idea.

I'd guess this is a product of dwindling state finances and contempt for any form of real education. AI's are orders of magnitude cheaper than real teachers. They also don't form unions and wouldn't voice any opposition against changes in the curiculum.

They are also pretty useless, as you have pointed out. The consequences of this policy will be postponed until the students reach a certain age -- that'll be like 10-15 years in the future.

C1sc0cat · on Aug 30, 2019

Until they end up at Uni or Work and find that they haven't developed the right skills.

Balgair · on Aug 30, 2019

> My guess is that soon or later people will learn how to game that AI.

To be fair, the GP here is specifically describing that he gamed the AI via a copy-paste of a critique of the AI, his kid submitted it on their own accord, it was graded without comment, and then when the GP went into comment on the gaming of the AI, the teacher not only did not care that the AI was gamed, but expressed gratitude for the AI saving hours of work, still ignoring that the AI fundamentally made things worse, all at the expense of the entire point of being a teacher in the first place.

The issue, for the teacher, is that in 'the system' in which they collect a pay-check, the AI works flawlessly. The point, for the teacher, is not to educate children. It is to have assignments that children pass with some sort of distribution that can be sent in and calculated by some person in a beige suit, wide tie, and hair troubles. The difference is subtle at first, but when you get further along to the point where the GP is sitting, then the difference is comical.

The AI allows the teacher to increase their effiency in processing assignments, ones that never really mattered to the teacher in the first place. In valley-speak: the incentives are not aligned.

de_watcher · on Aug 30, 2019

I can't believe either, it's completely ridiculous. They're basically claiming that they've developed a general AI. It's like some part of population is living in different fantasy worlds and makes policy decisions accordingly.

amiga_500 · on Aug 30, 2019

Is the USA tenable going forward? Your cost of everything, value of nothing culture appears to be very destructive.

ativzzz · on Aug 30, 2019

It's progress built on top of an assumption of never-ending, unsustainable growth. Unfortunately for everyone else much of the world has been dragged into the rat race along with us.

chessturk · on Aug 30, 2019

I'd argue that the expectation of perpetual population growth is one of the big problems (the unsustainability of social security bottlenecking at the baby boomers being an obvious example).

There is a compelling case for immigration for that sake alone.

derg · on Aug 30, 2019

Honestly, no. It's not.

Granted I already have a very fatalistic view on my future already so take my opinion with a grain of salt.

chessturk · on Aug 30, 2019

Not without change, but I'd argue further change is certain, and likely to be of large magnitude.

I do not know if it will be sustainable, but I doubt it will be static.

lr4444lr · on Aug 30, 2019

I agree with you wholeheartedly but I think there's a stronger argument to be made here: the algorithms being used "work" only on a correlation based on an ignorance of the scoring metric. If the students under test knew even sketchily how the system worked, e.g., points deducted if your average sentence word length > 7, points added if your word length stddev is greater than 2, and the students could meaningfully push their scores up by focusing on these proxies that don't _actually_ measure what a human would say is quality work - or even they can even get gibberish[0] rated highly - then the whole thing is a fraud. No one will stand for a grading system that only works by virtue of obscurity.

[0] https://www.nytimes.com/2012/04/23/education/robo-readers-us...

Nasrudith · on Aug 30, 2019

It is classic beancounter thinking in the worst way, the worst stereotype of a MBA trying to minimize cost beyond all reason cutting corners. Even when it saws at the branch they sit upon.

It is frankly a sign of a diseased culture to use it in any capacity except an exercise to improve AI.

wisty · on Aug 30, 2019

Teachers often score by similarly sensible criteria.

ALittleLight · on Aug 30, 2019

When I was a child I was obsessed with the "grade level" function in Microsoft Word. It was a preference you could enable on spell check to tell you the "grade level" of your writing.

Every essay I wrote, I'd always force myself to reach the max "12.0" grade level. While writing I'd struggle over word choice, sentence structure, rearranging paragraphs, working on my tone etc, all in pursuit of the 12th grade way to phrase things. All my revisions were subject to the approval of the Grade Level checker.

Whenever I could I would check the grade levels of my friend's writing - usually by showing them a "neat feature" they could enable. Then, I'd smugly applaud myself for being the better writer whenever their grade level was below 12.0.

The Grade Level feature fascinated me and to try and master it, I found a book about Microsoft Word and looked through it in a bookstore. I was absolutely gobsmacked at how simple the formulas was. I had childishly been expecting something, like perhaps Utah educators imagine they have. I genuinely expected the method to be complex beyond my understanding.

Instead, Word used a variant of Flesch-Kincaid. There was a direct relationship between sentence length and grade score, and polysyllabic words and grade score. Meaning, the longer your sentences and words, the higher your grade score.

As soon as I got home from the bookstore I loaded a draft of something I had written. It was "pre-12.0" writing from me. I simply deleted all the periods but one and checked again. 12.0.

Automatic grading is a wonderful lure. It's nice to imagine that there's some objective writing quality easy to tap into. At the moment, I think we're far from that ability.

Personally, I feel the solution to insufficient teacher time is to use peer grading much more, and spot checks. Get kids to read and revise each other's works frequently, and teachers should aim to grade at least N papers per student where N is much less than the number of papers a student writes.

Revising is a really vital part of writing. Getting more chances to do revision, plus having to write something good enough to show your peers, plus having the risk of any paper count for your grade should compensate for incomplete teacher grading.

nerdponx · on Aug 30, 2019

The fact that you were literally still a child when this happened, but automated grading is being foisted on us by grown adults who are ostensibly professionals, says a lot about the situation.

rwbcxrz · on Aug 30, 2019

> Personally, I feel the solution to insufficient teacher time is to use peer grading much more, and spot checks. Get kids to read and revise each other's works frequently, and teachers should aim to grade at least N papers per student where N is much less than the number of papers a student writes.

That's how it's done in creative writing courses. I've always found it infinitely more helpful than only having feedback from the instructor, even if the instructor's feedback was generally more helpful/useful than peer feedback.

chessturk · on Aug 30, 2019

Things like this story, Word's auto-grader, and Grammerly's style preferences are all surreal to me. We are asking a computer to validate prose meant for human consumption.

Not a reflection of physical reality like sensor data or even accounting information, but the method of communication explicitly invented for production and consumption by humans.

Of course feedback from humans is more valuable than feedback computers, it would be irrational/miraculous if anything was better at giving feedback than a human.

It is a shame it isn't self evident to instructors how poor of a solution this is, and how much better the results are when using critique by peers and instructors -- the classic way of doing things.

pzs · on Aug 30, 2019

Arguably, Hemingway's texts are well written. One of the sources of power of his prose is the use of simple words, and basic sentence structures. I bet Word would classify that as below 12th grade.

The point I am trying to make in agreement with the parent is: there are qualities that are very hard to score with algorithms. The difficulty of solving this problem equals if not exceeds that of automated translation, which still only works properly for specialized and limited domains, e.g. weather forecasts.

regrub · on Aug 30, 2019

All that grade-level gaming paid off, I reckon! This was a funny, informative personal account of it :)

jhanschoo · on Aug 30, 2019

It's interesting that the tool (and system) is designed to aid people trying for the opposite result, i.e. for publicists and other authors striving to word their message to be as widely understood as possible.

C1sc0cat · on Aug 30, 2019

Exactly I am doing this as part of some work I am doing with some content changes we are experimenting with on a major brands site.

mnky9800n · on Aug 30, 2019

You just ruined my childhood Thanks.

bryanrasmussen · on Aug 30, 2019

I went to high school in Utah, long before this automated scoring. It sounds awful but considering the quality of the education I received there perhaps not that bad after all.

My best Utah education anecdote - In the first day of British literature class the teacher came in and asked "Does anyone here know what A.D means?" someone said After Death - she said no. I figured this was my time to shine so I raised my eager hand and said "Anno Domini, in the year of the lord" - she said no.

Then she announced: "A.D means after the Deluge, and B.C means before Christ".

She also totally lied to me one time about whether she would be considering a particular textbook question as applying to Rosencrantz or Guildenstern.

Anyway I think that was one of the many classes I got an F in after stopped going and would walk past it every day on my way to play chess with my German teacher.

C1sc0cat · on Aug 30, 2019

How is this relevant in a Lit class - presumably you hid the fact you where a Catholic / Anglican from her.

bryanrasmussen · on Aug 31, 2019

Actually I am and was an atheist, but I had recently been on an "I'm going to read the encyclopedia!" mission.

bryanrasmussen · on Aug 30, 2019

I don't know the relevance, I guess she just felt enthused about sharing some of her learning with us impressionable minds.

clay_the_ripper · on Aug 30, 2019

Wow this is pretty shocking. I can understand using automated systems for something like math problems, it makes sense. There’s (usually) one right answer. But essays? This should be banned.

Baeocystin · on Aug 30, 2019

Wait 'til you see a kid in tears because the math answer they submitted was supposed to equal zero, but the algorithms behind the scenes are so bad that the float math failed the equality check.

Note: This is not hyperbole, I have seen this exact scenario more than once.

There may be a place for a well-designed one, but if it exists, I've never seen it.

dvdbloc · on Aug 30, 2019

I know this is just my experience but I can confirm the automated math scoring system I was using in a large US university in 2012 had bugs where many times I would enter a complex solution with fractions and it would tell me I was wrong and the correct response was some other form of the same equivalent fraction... Talk about frustrating after pouring over a question for 10 minutes.

black6 · on Aug 30, 2019

I remember to this day when I went round and round with a calculus teacher in high school who told me "sin(x) + 2" was incorrect. Then answer she wanted was "2 + sin(x)". I argued that addition was commutative, "2 + 3 = 3 + 2" and so forth. She wouldn't budge. She also said it was impossible to calculate the exact area under a curve, because you can't draw an infinite number of boxes under the curve.

Having an ignorant teacher is almost as bad as a flawed, black-box algorithm.

crankylinuxuser · on Aug 30, 2019

My gods, that is stupid.

Math is only not commutative in very rare areas. A specific area is concatenation of 2 strings, or 2 rotations of a rubiks cube.

> said it was impossible to calculate the exact area under a curve

And this is why a lot of people are against unions and similar protections for teachers. How do you get rid of someone who blatantly lies and informs falsely to students? How do you get rid of horrible teachers?

DoctorOetker · on Aug 30, 2019

what was the allowed form of the fraction? a ratio of literal integers, or also including irrationals? variables? did the scoring system need to know formulas? as in (2/3)v =?= (2/3) omega*r ?

if only the last one is not required of the scoring system, then it's been available for ages, and its just a really poor implementation. That's not programming but plumbing data-pipes.

Aromasin · on Aug 30, 2019

Having been forced to use an online math software for all my homework while at school, I vehemently disagree. It was so poor that it became a meme within my year group.

It would mark you as incorrect for using too many decimal places, even though it wouldn't tell you how many significant figures was required. I often remember it marking my answer as incorrect, even though it was identical to the answer they gave. Sometimes you'd have to show your working, but it couldn't handle brackets. Once I put the answer as "1+x=y" but they wanted the answer "y-1=x", and they marked it as incorrect.

I'm sure academic software design is leaps and bounds above what it was in the early 2000's, but to have a pupils futures hinge on what generally seems to be poorly tested code is dangerous.

jcranberry · on Aug 30, 2019

That's just poor software. As long as the software and teachers or professors allow for going over answers and checking for correctness it should be alright.

crankylinuxuser · on Aug 30, 2019

That sounds like "My Math Lab" by Pearson. It is horrible like everyone talking in here states.

It's also a bundle with the book sold by unis, because the code 'allows' you to submit homework required for the class. So they're doing both resale prevention, AND horrible grading.

People I know who have to interact with it call it "My Meth Lab" - because you have to be high to like it.

wickedsickeune · on Aug 30, 2019

I have often solved many hard math problems with very unconventional solutions (eg geometrical proof for algebraic problems). Trust me, a piece of software is decades away from being able to accurately determine the future of children and massively impact their self esteem / trust in society.

LorenPechtel · on Aug 30, 2019

Unconventional solutions can stump human teachers, also. One day on a chemistry test I was being stupid about how to define "stereoisomer". I knew what it was (two compounds that are mirror images of each other), I was just having trouble expressing it properly. Running out of time I put down "two molecules that are identical if and only if you permit rotation through the fourth dimension." This is extremely unconventional but it is correct--except not only did the teacher not understand it but I couldn't find any help in the mathematics department, either.

chessturk · on Aug 30, 2019

On one hand, I agree with you. I remember having to argue whether I showed my work or not by using imaginary numbers instead of standard formulas in high school physics.

But even with these examples, the path of appeal and rectification of mistakes is much easier with all humans involved. I fear soon people will side with the machine out of ignorance or to be justified in an incorrect stance.

The idea that we could be so poorly taught by broken automated systems, that we become incapable of detecting the system is broken seems like a possibility with AI that is much less likely in pure human systems of education (though not impossible).

crankylinuxuser · on Aug 30, 2019

"Fast, good, cheap"

The state chose fast and cheap. Well, its cheaper than more teachers.

UweSchmidt · on Aug 30, 2019

It is important for the teacher to see where part of the class took the wrong turn, where the students' understanding ended. It is important to distinguish between careless errors, wrong memorizing of a formula and lack of understanding.

yoz-y · on Aug 30, 2019

> Turns out the instructions said the essay would be scored on verbal efficiency; getting the point across clearly with the fewest words. I started playing around and realized that the more words I added, the higher the score, whether they were relevant or grammatical or not.

Frankly, this does not change anything from my experience in school decades ago. The teachers always said that the length does not matter and we should not pad the papers. However students who wrote more pages got better scores every single time.

bradstewart · on Aug 30, 2019

Is it possible that the students who wrote shorter papers were in fact presenting incomplete arguments and/or thoughts? Writing clearly and concisely is extremely difficult.

yoz-y · on Aug 31, 2019

It is possible but all of them? Also by “short” I usually mean four pages which was the usual recommendation for a paper length. To have a good grade one would need to write ten. The subject was usually something vapid so most of that must have been drivel.

MertsA · on Aug 30, 2019

Historically on the written portion of the SAT length is substantially correlated to the final score.

chessturk · on Aug 30, 2019

I'm not sure that's an issue by itself. If the prompt is broad enough, a minimum length can be reasonable for essays.

It certainly could be a problem if the prompt was too narrow, or time constraints, or some other factor.

Do you think the correlation by itself indicates something negative/inefficient I'm missing?

yoz-y · on Aug 31, 2019

I think that the required length should be bounded on both sizes and penalised if the paper is either too short or too long.

raxxorrax · on Aug 30, 2019

You have automated systems that rate essays without any human actually reading them?

Kids, forget everything you know because crime does indeed pay off. Best grades will be reserved for those that try to cheat this system however it is implemented. Botting your essays is the way to go in the 21st century.

dagw · on Aug 30, 2019

Given that the stack ranking at your future job will also be done by an "AI" (probably developed by the same company that graded you tests) this is a very useful skill to have.

bryanrasmussen · on Aug 30, 2019

Good point, I better teach my kid this skill so she can go on to a job programming one of these "AIs"

idiocracy will happen not because people get any stupider, but because the bots reward the stupid ones first.

xamuel · on Aug 30, 2019

Maybe Idiocracy was right about the whole "It's got electrolytes!" meme, except swap out electrolytes for neural networks.

bryanrasmussen · on Aug 31, 2019

Now I need a video of Luke Wilson saying "What are neural networks? Do you even know?"

_0ffh · on Aug 30, 2019

>Turns out the instructions said the essay would be scored on verbal efficiency; getting the point across clearly with the fewest words. I started playing around and realized that the more words I added, the higher the score

Apart from the fact that your story is straight up frightening, isn't this part completely backwards, too? I mean, clearly using more words to convey the same message is /less/ efficient, not more so?

dahart · on Aug 30, 2019

Yes. Exactly. Before I figured out how to game the program, my son and wife were editing shorter. That’s what the instructions said to do. And, that’s also a major strategy for decent writing: brainstorm a lot, then edit down to the good parts. What this means is the software’s scoring is an anti-incentive to good writing. Used as a teaching aid, it’s actually doing pure damage, not good. Not only can it not score reliably, nor provide meaningful feedback, it’s actually actively teaching a very wrong way to write. But it is cheaper than humans, and it does give immediate feedback, so there’s that.

bigred100 · on Aug 30, 2019

This is a problem I have with a lot of human behavior. Instead of admitting you don’t have the resources to do something or aren’t willing to prioritize it, people come up with a bad version that’s not worth doing. Lots of things are worth doing poorly, but many of them I believe you just need to admit are not worth it unless a certain level of performance is met.

What’s even cheaper than AI? Tell the students to write some pages, have the teacher glance at the number of pages written, give full credit if the mark was met, and throw the papers out without reading them. It sounds like it would be similarly effective and less aggravating. Unfortunately, this would require humility on the part of the educators.

analyst74 · on Aug 30, 2019

Think from a positive angle, students today are learning useful life skills to game computer systems, which they will have to deal with when they grow up.

edit: ...just like how previous generations have to learn how to game social systems.

alexanderdmitri · on Aug 30, 2019

Except the algorithm being gamed can change suddenly, drastically and without the gamer's knowledge.

When such changes occur, the gamer will be docked until they can reverse-engineer the new algorithm. There's also the risk that all their previous inputs "gaming" the system might be reconsumed to terrible results as well, effectively rewriting their historical performance disasterously.

As always, those with the social standing and power to have insider knowledge or guidance will be in the best position to profit off such systems.

YeGoblynQueenne · on Aug 30, 2019

Ho ho. Wait. You mean you were able to submit multiple versions of the essay? So that anyone can basically game the test, by submitting multiple essays until they get the best score they can wring out of it?

That is just mad.

c3534l · on Aug 30, 2019

It's be easier and equally fair to just grade student's essays by rolling a pair of dice.

ejk314 · on Aug 30, 2019

It's arguably more fair. At least purely random scoring doesn't incentivize cheating.

nraynaud · on Aug 30, 2019

wait, you get the score in real time? Like some kind of objective function you can train a machine to maximise on?

christophilus · on Aug 30, 2019

Heh heh. I like the way you think. Hackers of Utah unite!

nyxtom · on Aug 30, 2019

This needs to just be outright banned

imtringued · on Aug 30, 2019

I can't even comprehend how someone can use automation for a task like this... It completely goes against human nature. In a world where all jobs have been automated teachers would be the last ones to go before humanity is completely obsolete.

daveFNbuck · on Aug 30, 2019

Do you just get to keep submitting the essay to see what score it will get before you turn it in? That sounds like a bigger problem than any of the particulars about how the grading is done.

dahart · on Aug 30, 2019

In this case, there was a limit to the number of times the essay could be submitted, and there was a required score that needed to be obtained within that limit, otherwise the grade would go down. The limit was something like 20 tries, and when I got there they’d already used maybe 14 of them.

I could perhaps see value in having unlimited tries, as a teaching aid, if the result wasn’t being used for grading. That would at least leave room for curiosity and exploration. And, more importantly, I could see value if the software wasn’t essentially a scam that fundamentally is not able do what is advertised. If the software really could grade essays reliably, and provide meaningful suggestions for improvement, then maybe it could be used to help educate students, in conjunction with the teacher’s guidance. But the software does not grade reliably, and it absolutely does not offer meaningful constructive feedback, and the teachers were using it to avoid reading essays, not to supplement their own expertise.

One of the several amusing ironies here is how the software company has convinced the state and teachers to willingly replace themselves with bots, despite obvious evidence that the humans can do the job better.

daveFNbuck · on Aug 30, 2019

I can see how giving students multiple tries would be a great teaching aid if we had human-level AI. With what we have now, I'd bet it's just training them to produce essays that hit the flaws in the AI as hard as possible to produce unreadable garbage with high scores. 20 tries per essay adds up to a lot across 12 years of schooling.

Teachers are extremely overworked, underpaid, and underappreciated. I'm not surprised that it was easy to convince them to offload the difficult and time-consuming work of manually grading essays. This also means they don't have to deal with complaints about unfair grading. A machine did it and it's out of their hands.

pault · on Aug 30, 2019

The instant feedback mechanism is just begging for someone to turn it into a GAN by writing the other half. I would absolutely love to hear that some particularly clever high school student was able to train an ML algorithm to consistently fool the grading algorithm, thus instantly rendering all of their efforts worthless and dragging the administrators through the mud at the same time.

seanmcdirmid · on Aug 30, 2019

That really sounds like Utah, they have lots of students (due to LDS influences) with a conservative government (ditto), so the pupil/teacher ratio is insane. I can guess the teacher really doesn’t have any other choice.

VikingCoder · on Aug 29, 2019

My mother worked grading standardized tests. It was a hellish job for many reasons (limited breaks, etc.)

One question she had to grade was essentially, "What's something you want your teacher to know about you?"

It was an essay answer, and she was supposed to grade it on grammar, etc. Just the mechanical aspects of writing. (The real question explained the details more, but that was the core of the question.)

She saw answers that would make you weep.

"My daddy touches me."

"I haven't eaten today. I don't know when I'm going to eat again."

Stuff like that.

And my mother was going to be the only human who ever saw their responses. Their teacher had no chance to see their responses, just my mom.

So she goes to her supervisor and asks, "What can we do to help these kids?"

The supervisor said there was nothing you can do. Just grade the answers.

dredmorbius · on Aug 30, 2019

The US has federal child abuse mandatory reporting requirement laws which include teachers and school staff and personnel, as well as additional state requirements which vary but include, for 11 states, faculty, staff, and volunteers at public or private higher education institutions. Computer and IT professionals are also covered in cases.

Faculty, administrators, athletics staff, or other employees and volunteers at institutions of higher learning, including public and private colleges and universities and vocational and technical schools (11 States).

https://www.childwelfare.gov/topics/systemwide/laws-policies...

https://www.childwelfare.gov/pubPDFs/manda.pdf

This includes penalties for failure to report in multiple states:

https://www.childwelfare.gov/topics/systemwide/laws-policies...

DoctorOetker · on Aug 30, 2019

so ... nobody wonders about the obvious rammification: then any ML scoring systems ... must detect child abuse signals!

mlyle · on Aug 30, 2019

Hey uh, that actually seems valuable.

I'd believe that ML could spot abuse that humans miss pretty well from signals like non-overt references in homework and school records, if one could come up with an adequate training set.

Much more likely than teaching ML to score reasoned and creative activity in any reasonable way.

kwhitefoot · on Aug 30, 2019

What do you think the false positive rate is likely to be?

mlyle · on Aug 30, 2019

Depends upon what false negative rate you're willing to tolerate. ;) And I don't know how good of a signal there is. This is pure handwaving.

But this type of thing seems like the exact kind of spooky correlation that ML is good at spotting compared to humans.

YeGoblynQueenne · on Aug 30, 2019

Machine learning techniques are going to be absolutely awful in detecting something like this, the reason being it's exceedingly rare (at least I'm guessing it is; if we're talking about child sexual abuse by one's own parents, it sure sounds extremely unlikely- but even child abuse in general is probably rare [1]). Machine learning systems are awful at identifying rare events. Like the OP seems to suggest, the false positive rate would most likely be very high.

"Spooky" machine learning results happen when a correlation is abundant in a dataset [2]. Otherwise, machine learning techniques will probably miss it altogether.

______________

[1] Quick online search: https://www.inquirer.com/philly/blogs/healthy_kids/What-is-t...

[2] The archetypal spooky machine learning story is surely the one about Target sending baby item coupos to a girl in high school before her father knowing she was pregnant:

https://www.forbes.com/sites/kashmirhill/2012/02/16/how-targ...

dragonwriter · on Aug 30, 2019

> if we're talking about child sexual abuse by one's own parents, it sure sounds extremely unlikely-

Child sexual abuse isn't extremely rare and familial abuse is a very large minority of child sex abuse.

mlyle · on Aug 30, 2019

Humans are awful at rare events and vigilance tasks, too. That's part of why we're seeing machine vision and machine learning starting to outperform humans in e.g. grading radiology screening scans.

The total incidence of child abuse of all types from infancy to adulthood is on the order of 1 in 3. This is not terrifically rare-- it's of higher prevalence than pregnancy and of positive screening events.

A much bigger concern is non-causative correlations. It'd be pretty easy to train ML to be racist or look for e.g. indicators of class, which are correlates of abuse.

As to false positive rates-- you can pick your false positive rate to be whatever you want it to be, by twiddling the threshold for a positive result. I'm not sure false positives are of that great of a concern, if the output from a system is a notification to school administrators that they may want to keep an eye out for this student.

dragonwriter · on Aug 30, 2019

> But this type of thing seems like the exact kind of spooky correlation that ML is good at spotting compared to humans.

How? Particularly, where do you get training data at the required scale?

mlyle · on Aug 30, 2019

You take samples of hundreds or thousands of past students' schoolwork, e.g. submissions of essays for standardized tests.

You survey those kids in adulthood about whether and how they were subject to abuse and other types of relevant adversity.

You attempt to control the data so that you don't just latch onto other correlates of abuse (e.g. social class).

mschuster91 · on Aug 30, 2019

Who cares? The positives must be evaluated by a human anyway.

tonyedgecombe · on Aug 30, 2019

Who cares?

The people whose lives are ruined by being mis-identified by the system.

The positives must be evaluated by a human anyway.

Those same people whose lack of competence people are bemoaning throughout these comments.

mschuster91 · on Aug 30, 2019

> The people whose lives are ruined by being mis-identified by the system.

When a child writes "daddy touches me between the legs" in an essay, it doesn't matter if a human spots it or an AI that forwards it to a human, this needs to be investigated either way.

> Those same people whose lack of competence people are bemoaning throughout these comments.

It's not a lack of competence that's bemoaned, it's a massive amount of understaffing (and resulting overwork) in teachers and other school resources, as well as a drastic lack of financing because it's easy to cut budgets for schools for politicians as the effects only show up two decades afterwards.

tonyedgecombe · on Aug 30, 2019

Sprinkling some AI over it won't fix those issues, I'd argue it will make them worse as people blindly accept the results.

There were some cases in the UK about a decade ago where bugs in software the Royal Mail was using led to incorrect accusations of fraud. People actually went to jail over this, it took years to resolve.

mlyle · on Aug 30, 2019

> When a child writes "daddy touches me between the legs" in an essay, it doesn't matter if a human spots it or an AI that forwards it to a human, this needs to be investigated either way.

When a child writes a set of things that individually are not very concerning, they may have cues that could say "hey, this kid, you should maybe keep an eye out for evidence of abuse."

Particularly attuned, experienced individuals might spot these cumulative cues, but we all know that this is not all people dealing with children.

It's an interesting problem.

rocqua · on Aug 30, 2019

Humans will still let false positives through. And false accusations of child Abuse have significant ramifications.

chessturk · on Aug 30, 2019

This is so dangerous.

Society's bigotry is going to flood that bad boy so quick you might as well name it Gobbels.

I love ML. I want children to be safe. This is not the place for ML or AI or Quantum or any tech.

What needs to exist is better resources for those children, that mother grading the tests, the teachers of those children, and social services that are meant to support them. If you want to make a difference about this, look there.

Don't go building a automaton King Solomon who decides why this kid should be taken from these parents because speaking Spanish was worth -0.1 on some goddamn weight trained on data generated from a racist society.

This isn't a "spooky" correlation a cool algorithm can detect, it's a serious, layered social problem.

mlyle · on Aug 30, 2019

> Don't go building a automaton King Solomon who decides why this kid should be taken from these parents because speaking Spanish was worth -0.1 on some goddamn weight trained on data generated from a racist society.

Totally what I advocated for and not a strawman attack /s. Indeed, the chance that such an algorithm could be racist or classist and there being needs to avoid bad correlations and have appropriate controls is important.

I think there are opportunities here. Ideally ed-tech doesn't take humans out of the loop, but asks schoolteachers and administrators questions like, "Hey, are you sure students A, B, and C are being supported correctly for subject Z? Are you sure students D, E doesn't have some kind of abuse or other significant home problem? It sure looks like student F is in this subpopulation that research shows benefits from educational intervention Y. You might want to keep your eye out for that."

And then the teacher goes "Oh, crap. Now that I think about D, there were always these little things 1, 2, and 3 that seemed off... maybe this is worth a referral to social services to check on what's up."

Or "Oh, ... maybe F's struggles in reading really are a speech problem and we should handle that"

belorn · on Aug 30, 2019

That is not how the law work. The law states that if people at a school is made aware or suspect abuse then they must act on that knowledge. A ML scoring system is obviously unable to be made aware or having suspicions, but the administrators could be help responsible if they happen to see something and chooses to not act.

It would be interesting to know if a child psychiatrist could be held liable if incompetence prevented them from seeing obvious sign of abuse, but I doubt that is covered under the cited law above.

harry8 · on Aug 30, 2019

Some of these will be 100% true as well. But don't make the mistake that there are no kids who go for shock value or are wantonly manipulative when they know it can't come back to them.

So how many are true and how many false? I have no clue. Literally none. And no it doesn't make me feel any better about the screams of existential agony even if that were a low percentage. Could be high too.

dmoy · on Aug 30, 2019

For the not eating, it's pretty easy to get data. It's like 1 in 5 children live in food-insecure households in the US and maybe 1 in 20 of those very insecure, so not eating before school provided lunch is common enough that if you're grading tons of papers you'll run into kids like that.

jcranberry · on Aug 30, 2019

https://www.ers.usda.gov/topics/food-nutrition-assistance/fo...

Food Security Status of U.S. Households with Children in 2017 Among U.S. households with children under age 18:

84.3 percent were food secure in 2017. In 8.0 percent of households with children, only adults were food insecure. Both children and adults were food insecure in 7.7 percent of households with children (2.9 million households). Although children are usually protected from substantial reductions in food intake even in households with very low food security, nevertheless, in about 0.7 percent of households with children (250,000 households), one or more child also experienced reduced food intake and disrupted eating patterns at some time during the year.

stochastic_monk · on Aug 30, 2019

It could also be a student suffering from anorexia nervosa, which the confessional aspects of the essay would fit well with.

JustSomeNobody · on Aug 30, 2019

I'm confident that your example would of a less percentage than those mentioned in dmoy's comment.

manfredo · on Aug 30, 2019

When I was a high school student, we had some state administered test in health class that tasked us with analyzing advertisements for liquor and tobacco and seeing if we could recognize harmful behavior that the ads might be promoting. This test had no impact on our class grade...

..which means students wrote whatever the hell we wanted. I was assigned a Captain Morgan (rum) ad. I wrote that the ad was glorifying maritime piracy and was likely responsible for pirate activity in Somalia.

inimino · on Aug 30, 2019

Of course some kids are manipulative, going for shock value, continuing an "in-joke", or just plain trolling. But would a teacher just look the other way, or would they talk to the kid? What would you want for your kids? This is why teachers assigning homework like "what do you want your teacher to know about you" and then not even seeing it is dehumanizing.

mxcrossb · on Aug 30, 2019

I don’t know about calling it manipulative. I remember taking the ACT, and struggling to plan out one of my essays. It was something like “tell us about a book that inspired you”. So I changed details about the plot so it all fit nicely and was easy to write. I can see something similar here, where someone takes on a persona when writing in order to effectively communicate.

manfredo · on Aug 30, 2019

This is absolutely the case. In fact, my SAT prep class taught us that the factual veracity of our essays is irrelevant. Essay scores are almost entirely correlated with essay length as long as spelling, grammar, and basic paragraph structure (intro, body paragraphs, conclusion) is followed.

harry8 · on Aug 30, 2019

That's entirely fair. Manipulation is kind of what a writer does yet the word manipulative has perjoritive connotations. Many types of writing don't have literal truth as any kind of pre-requisite. Others make a pretence of literal truth to achieve greater effect then basically lie, many autobiographies fall into this trap to some degree. All these things. Differential empathy. Data quality matters.

7952 · on Aug 30, 2019

False accusations can actually be the result of prior abuse. They may substitute one person for another. Or do things as a result of mental illness caused by abuse. Kids think differently to adults and may behave inexplicably. And unfortunately that means that an abused child is a terrible witness.

aspaceman · on Aug 30, 2019

I was abused as a child (not sexually however) and I can attest to this. Many of my memories are highly charged and don't really hold up - they're very confused. Some of the scariest stuff that happened to me I don't even remember, and my siblings have had to let me in on it (and they were even younger at the time).

As a child you're really not prepared for the concept that your parents are treating you badly. So that realization doesn't come until much later.

fucking_tragedy · on Aug 30, 2019

> But don't make the mistake that there are no kids who go for shock value or are wantonly manipulative when they know it can't come back to them.

In the US, school funding is based upon standardized test results, and bad results can shut a poorly performing school down.

It's drilled into every kid's head that these tests are very important, super strict and if they accidentally mess up, it can ruin their academics, because retesting and regrading are expensive.

jacobwilliamroy · on Aug 30, 2019

As a kid I would go out of my way to fail those tests. The whole curriculum was designed around them, meaning that even if we did score high, any funding gains would just be put towards training us to take the test.

I thought the state was holding the school hostage, threatening to cut funds or shut them down if they ever stopped. We never learned anything about civics or American history. Until I was out of highschool, content regarding atrocities like slavery and the trail of tears was not on the test and that was enough to whitewash the whole curriculum.

Standardized testing is to the U.S. what lead waterpipes were to the roman empire.

lonelappde · on Aug 30, 2019

Your school didn't teach slavery? What state?

stochastic_monk · on Aug 30, 2019

It’s possible that there’s a difference between admitting it happened and an honest portrayal of the institution.

empath75 · on Aug 30, 2019

‘Poor sentence structure and grammar, 1 point out of five. Sorry your daddy touches you.’

shaggyfrog · on Aug 30, 2019

Punch up, not down.

empath75 · on Aug 30, 2019

You really misread that if you thought I was punching down. I was pointing out the absurdity of having to even grade such a thing.

wesammikhail · on Aug 30, 2019

boy do I hate that saying. It literally adds nothing to the conversation!

dredmorbius · on Aug 30, 2019

It can be a useful heuristic.

I don't find it adds much here.

bendbro · on Aug 30, 2019

[flagged]

aasasd · on Aug 30, 2019

I embraced the awkwardness, and now I know for sure that I'm doing ‘a thing,’ instead of worrying about whether I do or not.

DATACOMMANDER · on Aug 30, 2019

That’s what you’re doing wrong.

bendbro · on Aug 30, 2019

Thus aren't you as well?

DATACOMMANDER · on Aug 30, 2019

I meant that your self-consciousness and constant worry is probably harming your social interactions more than anything. Or not. It was for me in the past.

bendbro · on Aug 30, 2019

Aw, I was hoping we were going to dig deep into a "no u" situation.

rhexs · on Aug 30, 2019

Or...report it to the police? I’d gladly risk my job to do the right thing in that instance.

VikingCoder · on Aug 30, 2019

What my mom saw had an ID number on it. No other demographics. And she was grading from multiple states.

So do what? Contact her local police?

With a written accusation from a child? Is that enough to get a warrant to force the company to release the demographic information?

And people don't work at a job like that because they want to. They work there because they need the money.

Everything she took in and out of there was monitored, too. So it's not like she can go to the Xerox, and walk out of there with a copy.

It's beyond dehumanizing. For everyone. The kid, the people who work there.

mrep · on Aug 30, 2019

> So do what? Contact her local police?

> With a written accusation from a child? Is that enough to get a warrant to force the company to release the demographic information?

Absolutely! My girlfriend works as a counselor at a school and she is required by law to report all serious abuses by parents.

SoylentYellow · on Aug 30, 2019

> So do what? Contact her local police?

> With a written accusation from a child? Is that enough to get a warrant to force the company to release the demographic information?

Yes, she should have absolutely went to the local police. A child's first hand account in writing of child abuse and neglect is slam dunk evidence to secure a warrant to link the essay ID to the individual child.

> Everything she took in and out of there was monitored, too. So it's not like she can go to the Xerox, and walk out of there with a copy.

Doesn't matter. She could have went to the police herself as a witness. That alone would be enough for a probable cause warrant to retrieve the essays.

It is very sad she saw these signs of abuse and did not report it.

rhexs · on Aug 30, 2019

Yes, contact the local police so an investigation can get started. Yes, that’s enough to at least report it.

inimino · on Aug 30, 2019

> So do what? Contact her local police?

Collect or photograph all the evidence, record every conversation with supervisors, escalate as much as possible internally, then contact local police, and at the same time go to the media. Don't quit, but if necessary let them fire you and then sue. None of this is easy.

hanniabu · on Aug 30, 2019

When escalating I'm sure it'll be effective to say it'll be an interesting story for the news and how the incident is being blocked by supervisors that encourage child abuse.

xkcd-sucks · on Aug 30, 2019

If she did Xerox the disclosures, walked out, and said "please call the cops" when challenged, at least it would be a matter of public record

sjg007 · on Aug 30, 2019

They should be mandatory reporters at least in the USA.

CobrastanJorji · on Aug 30, 2019

School contractors are mandatory reporters, but I suspect that may not qualify.

dredmorbius · on Aug 30, 2019

Depending on the state, yes, they are, for 11 states.

https://www.childwelfare.gov/topics/systemwide/laws-policies...

garmaine · on Aug 30, 2019

These would not be school contractors.

ineedasername · on Aug 30, 2019

Why not? The school contracts with the testing agency that does the grading. Seems like a contractor relationship?

garmaine · on Aug 30, 2019

School contractor is someone who is hired by the school district on via a contractual relationship. Think temporary teachers, or custodian staff. It’s not a transitive relationship to every employee of every company who has some sort of contract, however small, with a school.

ineedasername · on Aug 31, 2019

So you're arguing that only individuals can be a contractor? That wouldn't make much sense, not only because such relationships are rare in schools. Most common are contractors that have been outsourced something like food service. The law would make no sense if it included practically no one. It would mean a company that provides, say, temp staffing within the school, and those temp staffers saw abuse, they too wouldn't be required to report. I have a hard time believing a court would rule the definition to be so narrow. Both the common language understanding of the term and legal literalism would point against that. There's no transitive property here. We're not talking about contractors hired by contractors hired by contractors. We're talking about a contractor and its employees. There is no way for it to exercise this reporting requirement save through it's individual employees.

DoctorOetker · on Aug 30, 2019

every relationship however transitive or small, is a relationship too! (Dr. Seuss)

aidenn0 · on Aug 30, 2019

Hand-graded standardized tests are usually anonymized.

fucking_tragedy · on Aug 30, 2019

It takes a five minute phone call with the company's legal department or a warrant to find out who the kid was. Either way it would need to be escalated to involving law enforcement.

seanmcdirmid · on Aug 30, 2019

Someone could track back the number to the testee, they need to get their grade at least.

killjoywashere · on Aug 30, 2019

Tell you mom to take those numbers and the company's name to the police. They can walk back the identification problem.

drngdds · on Aug 29, 2019

This is my first time learning that AI-graded essays are a thing. Am I the only one who thinks that's insane? I feel like you'd probably have to have an AGI to meaningfully evaluate an essay.

_delirium · on Aug 29, 2019

I work in AI, and was very surprised when I heard about this (a few years ago). I don't think anyone who works in the area thinks the tech is ready for this kind of deployment. There is research on the subject [1], and NLP systems can do better than baseline methods, but the error rates are still pretty high.

A thing you quickly find if you try to download off-the-shelf NLP tools and apply them to anything is how little is reliable at all, unless you can constrain the domain. Even basic topic identification only works with low error rates when constrained to something like NYT stories, or PubMed abstracts, not arbitrary text by arbitrary writers. And I would bet ETS is using worse tech than research state-of-the-art.

[1] e.g. https://www.aclweb.org/anthology/P15-1053

harry8 · on Aug 30, 2019

You've noticed though that the AI con is on. This damages your work as people get burned and will bring about the second "AI winter"

People making big decisions with a lot of money around computing know nothing about it and are marks for con-artists. Think big consulting firms selling to senior public servants in washington. "For a successful technology reality must take precedence of public relations." But reality just gets in the way when conning a mark for a successful snake oil sale, right?

Call it out, publically, cite your credentials. Encourage colleagues, your competition and everyone with a clue to pour scorn on whoever is selling this evil, toxic waste as drinkable.

setr · on Aug 30, 2019

Second? We’re heading to #3 — fully cyclical

mlthoughts2018 · on Aug 30, 2019

Hmmm. I also work in AI, in fact professionally in information retrieval and NLP. I disagree strongly with what you say. Basic topic summarization and keyword / named entity extraction on unstructured sources of text works reasonably well. It’s easy to modify BERT and GPT on smaller problems, language classification is borderline totally solved by extremely easy to train neural network models.

I still agree that automatic essay grading is beyond the reach of SOTA NLP models today, but youmake it sound like virtually nothing can be done in a production-grade manner that solves real world unconstrained NLP problems. This is manifestly false.

_delirium · on Aug 30, 2019

It's completely possible I'm not fully up on recent progress, especially since a bunch of stuff seems to have moved in the past 6 months. But I haven't seen any general models that can solve open-domain problems, without specifically retraining on each domain. Do you have any pointers? E.g. a single pretrained BERT model that can reliably extract topics from: tweets, paragraphs from 19th-century novels, mathematics journal articles, and Wikipedia articles? All the systems with very low error rates that I know of target one specific domain. The last time I looked into sentiment analysis (a year or so ago), it wasn't even that great on many individual domains, e.g. it would get tripped up by sentences from novels that used "negative" keywords in a humorous or ironic way.

mlthoughts2018 · on Aug 30, 2019

In production problems that I work on, we don’t even really use things from within the past year. These problems are just incredibly well-solved with fairly vanilla LSTM networks from 2-3 years ago. Enough so that while it’s probably premature for fully automated essay grading, it’s not _crazy_ to make a product from models trained to solve this problem.

jkdufair · on Aug 30, 2019

I have a grant where were are doing just that. Implementing more or less SOTA research using fairly vanilla LSTM networks from 2-3 years ago (primarily Taghipour & Ng) to provide low stakes feedback to students on their essays in one of our teaching tools at Purdue. It’s based on research using the Kaggle ASAP database and we have found it to be pretty accurate across a variety of domains in early testing. Though some essay prompts seem to do better with CNNs vs. RNNs. I doubt many of the systems in TFA are based on LSTMs or neural nets at all. They are probably doing regression on hand-crafted features.

ssivark · on Aug 30, 2019

Very interesting. Are there any meta-analyses / reviews that summarize progress in this area? Would it be possible to share your grant proposal -- I'd be curious to get an idea of what is being attempted.

jkdufair · on Aug 30, 2019

It's an internal grant and I'm not sure I'd be allowed to share it. We are adding AES to our peer-review app. Currently as an additional "grader" to the peer reviews since that's what the PI requested. Since the tool allows unlimited submissions until the review date, I hope to add it as a "pre-flight" estimate to give students a chance to get a rough prediction of the score they will receive and a metric they can use as they revise until the due date.

I'm not aware of any meta-analyses myself. I have been keeping up with the ASAP competition and various attempts to improve on the initial systems for a number of years. The two papers I believe are having the most success are [1] and [2]. [3] seems promising for balancing the opposing forces of high accuracy for true positives and the risk of false positives via adversarially crafted inputs.

I'm also vaguely aware of research happening around extracting features from neural nets. I'd love to be able to help students understand why the system is predicting a particular score.

[1] https://www.aclweb.org/anthology/D16-1193 [2] https://arxiv.org/pdf/1606.04289.pdf [3] https://arxiv.org/pdf/1804.06898.pdf

chrisdsaldivar · on Aug 30, 2019

We had this in my school for 8th and 9th grade so 2008-2010. We had to type the essays in class and submit by the end of the hour. I would only get maybe 3 paragraphs in before time was up because I was trying to build a strong argument for the prompts. Despite that I would usually get 3-4/6 and my teacher said she would read the essays and regrade but she never actually did. My friend literally copy and pasted the pledge of allegiance 20-30 times and scored a perfect 6/6. Later we found out if you repeated the words in the writing prompt you would get a guaranteed 5/6 and with a high enough word count you’d get 6/6. The essays were all bullshit and just a way for the teachers to get an extra free period once a week.

xkcd-sucks · on Aug 30, 2019

I totally agree that "AI" grading is totally bullshit. But, I also have plenty of experience teaching/TAing large courses, and after reading too many essays they all become semanticically saturated meaninglessness. One can not help but skim them, and grade according to a few quick heuristics. At that point one tries to be self-consistent and defensible in one's grading, but careful consideration is right out. I suspect state graders are dealing with way more than 100 essays per person and are probably on a tight schedule too. It's quite possible that a ML model is better than an exhausted human grader, as their cognitive strategies are mostly identical.

kwhitefoot · on Aug 30, 2019

The solution isn't to do a better job at grading 'meaninglessness' but to stop requiring the production of it in the first place.

One major problem with algorithmic approaches, whether automated or not, is that they become the definition of good in the context and therefore become something that cannot be argued against. And of course it makes 'teaching to the test' an even more likely outcome.

If I were a conspiracy theorist I'd attribute this to wanting a dumbed down population. Unfortunately I think it is probably the other way round, the population is already dumbed down and a belief in AI unicorns is the result.

As Aristotle said to Alexander: 'There is no royal road to geometry', and so it is with education; it's hard work for both the student and the educator and no amount of AI/ML/algorithmic snake oil will change that without also changing the meaning of the word education.

nefitty · on Aug 30, 2019

I remember when I was in middle school 16 years ago, my English classes would have us submit some of our work to a web app. It would then grade the submission. I remember this distinctly because I asked my teacher to intervene on at least two occasions. The app failed to recognize the words "squirrelly" (as in "That guy in the corner has been acting squirrelly.") and "defragment". My teacher decided to subvert the app's recommended grades because she, as a human, understood the intent of my use of those weird words.

To emphasize, this was 16 years ago.

dcolkitt · on Aug 30, 2019

> I feel like you'd probably have to have an AGI to meaningfully evaluate an essay.

So the reason this isn't the case, is because there are very simple metrics that tend to highly correlate with essay quality. It doesn't mean the grading-bot is actually evaluating essay quality. It's just looking for properties that are statistically associated with good essays. Remember, at the end of the day as long as the bot's ranking is close enough to the human grader's ranking, nobody really cares about the internal logic.

A very straightforward example is spelling mistakes. People who make spelling mistakes aren't necessarily bad writers. And vice versa, there may be great speller who can't write for shit. But by and large the people who spell poorly also tend to write poorly. Easily detectable grammatical issues, like misplaced modifiers, subject verb disagreement, or inconsistent tense, are also correlated indicators.

A very simple metric is essay length. Especially if its a timed exam. Good writers tend to have verbal fluidity, with words easily flowing to paper. They don't struggle converting thoughts too sentences. So they tend to end up with the most words written down within a fixed time period. By and large the longer a timed essay is, the more likely that its actual quality is high.

Grading bots basically rely on these statistical relationships. They're not measuring anything intrinsic to good writing. But at the end of the day, their student rankings are usually pretty close to that of a typical human grader. In some cases the bot will have a closer ranking to a random human grader, than two random human graders will have to each other.

The biggest flaw here is Goodwin's law. When the test takers become aware of the kludges that the bots use, they can exploit it. For example just dump a bunch of verbal diarrhea with as many correctly spelled words as possible. But even then it doesn't really hurt the bot's ranking accuracy too much. Because the kids who do the most test-prep and learn all the tips and tricks, are usually high-achievers who do well on essays anyway.

bo1024 · on Aug 30, 2019

Strongly (but respectfully) disagree with a lot of this!

This is related to current fairness-in-AI discussions. In many cases the basic problem is ML systems leverage correlations for making causal decisions. Here, there is a huge ethical difference between scoring a person based on "is this a good essay" and "do the features of this essay correlate with features of good essays". Just like there is a huge fairness and discrimination difference between "is this person qualified for a loan" and "do the features of this person correlate with features of people who qualify for loans" (algorithmic redlining). Your last sentence has a big discrimination/fairness issue also, since you are testing even more for parental income and parental free time.

nerdponx · on Aug 30, 2019

Which means that machine learning models need to not only be good predictors, but also be causal models to some extent.

6gvONxR4sf7o · on Aug 30, 2019

I can't disagree strongly enough.

>Remember, at the end of the day as long as the bot's ranking is close enough to the human grader's ranking, nobody really cares about the internal logic.

This isn't true at all. Imagine you got a B or C on an essay that a human would have given an A to because you wrote it concisely and in plain language, or because you used language that's statistically correlated with being black. Does the fact that this is rare console you? "Sorry, but it's usually very close to the human grader's ranking." Close enough isn't good enough when you get the short end of the stick. "Sorry, you aren't going to get to go to the college you wanted because you use language statistically correlated with poor writing." Or just because you're different, so the statistical correlation doesn't apply to you, you filthy outlier. Just because it's a rare event doesn't make it okay.

In adulthood, this is like hiring or firing for work statistically correlated with good work. Remember when amazon rolled out the resume scorer? [0] Sure it was biased towards women, but it was close enough to human scores, so who cares about the internal logic?

>Grading bots basically rely on these statistical relationships. They're not measuring anything intrinsic to good writing.

At the end of the day, our goal here is to measure good writing. If the bots aren't measuring anything intrinsic to good writing, we shouldn't use them.

https://www.reuters.com/article/us-amazon-com-jobs-automatio...

LorenPechtel · on Aug 30, 2019

The problem with the bots is that while they average agreement with the humans they can produce very different results. Fine if you're seeing how a school is doing, horrible if you're testing how a student is doing.

davidhowlett · on Aug 30, 2019

Minor correction; The automated resume reviewer was biased against women according to your reference.

ptx · on Aug 30, 2019

I think you meant Goodhart's law: "When a measure becomes a target, it ceases to be a good measure."

JoeAltmaier · on Aug 30, 2019

Is that what it's called? Its referred in neural net training as overtraining.

dcolkitt · on Aug 30, 2019

Haha, whoops. That mistake kind of changes the meaning...

Thanks for catching the mistake.

mannykannot · on Aug 30, 2019

Your last paragraph, and particularly the last sentence, epitomizes what is wrong with your whole thesis: the ultimate goal of the testing (and education itself, for that matter) is not to find people who can "do well on essays"; it is to develop analytical thinking.

JoeAltmaier · on Aug 30, 2019

Clearly, that referred to 'doing well on essays' correlating to good analytical thinkers. Which is does.

mannykannot · on Aug 30, 2019

That assumption lacks justification when the scoring does not actually measure analytical thinking. Any statistical evidence for it is suspect as a predictor of future outcomes when a high score can more easily be gamed than 'honestly' achieved.

JoeAltmaier · on Aug 30, 2019

Scoring is not the point here; the analytical thinker is gaming the test to pump the score, thus proving they are an analytical thinker. Not a statistical argument; a suggestion that the screen works, when it is abused. Because it is abused.

Anyway, yeah, not really a correlation.

rvense · on Aug 30, 2019

It is absolutely insane. By no definition does the system understand what is written.

You could ask a student to write an essay taking a firm opinion on some subject, and they could change standpoint every paragraph and there's no way these systems would know.

If I was a student I would be extremely offended at people wasting my time like this.

chucksmash · on Aug 30, 2019

I'm surprised people are surprised by it. I guess it just hasn't gotten talked about it a lot? When I took the GRE in 2011 the rule was that my essay would be graded by one human and one automated grader, and a second human would become involved if the computer and the human differed by one point or more iirc.

Maybe nobody really makes a big deal about it because it is pretty much irrelevant anywah. Applicants provide a letter of intent that the grad dept people can, y'know, actually read for themselves, so I think unless you totally bombed the writing section nobody cared.

Spivak · on Aug 30, 2019

In a forum of CS people I'm surprised this is one of the top opinions. Our field is full of super surprising results like this -- that you don't have to actually understand the text at beyond basic grammar structures to reasonably accurately predict the score a human would give it.

Like this kind of thing should be cool, not insane. I mean wasn't it cool in your AI class when you learned that DFS could play Mario if you structured the search space right?

cmroanirgo · on Aug 30, 2019

I came first in English for my school, many moons ago. Leading up to the finals, I regularly finished ahead of the hard core the English essay people, generally to my amusement. My exam essay responses were generally half the length (sometimes even shorter) than the prodigious writers. Although I've an ok vocabulary, I always made sure I made the right choice of word to hit a specific meaning, rather than choosing words with a high syllable count.

I'd find it highly interesting to see what kind of result I'd get using an automated system.

Why?

Because, I once asked a teacher (also an examiner) why I got good grades above the others, and the answer surprised me: my answers were generally unique /refreshingly different, to the point/ not too long and easy to read.

I suspect with this new system, I'd be an average student. It'd also be interesting to find out, several years down the road, if the automated system could be gamed at all -- I suspect it could, and teachers would help students 'maximise' their scores as a result of that.

rocqua · on Aug 30, 2019

It seems plausible that, under this system, you would eventually have learned to write longer essays. To my mind, that would be a school teaching you to be worse.

In fact, throughout the article I kept being surprised by the idea that long is good. When writing, I tend to prefer being brief.

bumby · on Aug 30, 2019

Your post resonated with my first thought on reading the article. I wonder if it would penalize writers with simple declarative sentences.

You know, those average writers like Hemingway /s

mherdeg · on Aug 30, 2019

When I hear a result like "software which understands basic grammar structures can predict what grade a human would give an essay" I think my views are roughly:

* 5% - cool, we could make a company that grades essays

* 15% - cool, we could make a company that grades essays and sell our source code to the test-prep industry

* 80% - fascinating, it sounds like the exam designers need to reevaluate what they are trying to measure with essay questions

thekyle · on Aug 30, 2019

Whatever we decide to measure, it needs to scale to millions of essay responses each year in a way such that scores are consistent across entire states or countries. With that in mind I'd imagine it's difficult to do much more than grade on grammar and basic semantics.

kwhitefoot · on Aug 30, 2019

No it doesn't.

And if you succeed you will simply be measuring an uninteresting but manageable subset of the problem which will then become in some people's eyes the definition of the problem.

Education is supposed to be about teaching people to think, to give them the tools with which to do it, to be able to evaluate, criticise, invent, etc.

danenania · on Aug 30, 2019

"...that you don't have to actually understand the text at beyond basic grammar structures to reasonably accurately predict the score a human would give it"

That only really shows that the humans they're training on are terrible at grading essays.

carapace · on Aug 30, 2019

Thank you.

GIGO is our God now.

https://en.wikipedia.org/wiki/Garbage_in%2C_garbage_out

munchbunny · on Aug 30, 2019

This problem is a first class demonstration of the difference between "can we?" and "should we?"

The fact that it's being implemented in society is insane because anyone who is paying attention to the state of AI today already knows how it will go wrong: without reading the article I already guessed that it systematically discriminated against certain demographics. Which was in fact what the article claimed.

It's interesting that it's possible to predict what the scorer would decide, but the moment you actually implement it is when all of the known problems become relevant, and the intellectual wonder must take a backseat to the human problems.

jammygit · on Aug 30, 2019

Teaching human-human communication by removing human inputs and having computers decide about quality... call me a skeptic. I feel bad for the students. Essay grading was bad enough before this

Narrowly for grammar however - is even that a good thing? It probably helps scale grammar help to more students, but if those tools became ubiquitous in grading and editing then unique voices would just disappear and a lot of potentially “great writers” might choose different careers because the machines don’t like them

shkkmo · on Aug 30, 2019

Adding further bias against the underprivileged is not "cool". Implenting this while avoiding publicity or providing a means to publically audit the results is doubly not cool.

It is fine to play with "cool" techniques when you are doing consequence free stuff like playing Mario. When you are creating systems that have significant and long term effects of people's lives a different standard applies.

ptah · on Aug 30, 2019

research needs to be done on bias correction. it will then be better than human where you cannot correct the bias

peteradio · on Aug 30, 2019

Based on the title alone, how would you feel if you were given bad marks due to flawed black box?

tiborsaas · on Aug 30, 2019

Like how I felt when I was given low grades for my ugly handwriting. It was stupid to grade it, but it guaranteed that I will never get a top score on any literature class.

C1sc0cat · on Aug 30, 2019

Yeh I (as a dyslexic) got the same at school having my handwriting mocked by the teacher.

It took me much longer to pass my English language O level (exam taken at 16).

lonelappde · on Aug 30, 2019

The same as when I get bad marks due to a flawed human and rubric.

strken · on Aug 30, 2019

This is sort of like discovering the Excel spreadsheet at the heart of a system responsible for handling hundreds of millions of dollars of transactions for your bank.

Yeah, it's cool, but what about your savings account?

RcouF1uZ4gsC · on Aug 29, 2019

Unlike a multiple choice test where the primary audience is automated graders, the primary audience for an essay is other humans. If even Google and Facebook with their billions of dollars and billions of posts worth of data, still cannot always understand the intent and purpose of written content, what hope do these algorithms have?

If it is cost-prohibitive for every essay to be graded by humans, then they should be dropped from the tests. Otherwise, we are missing the whole point of essays which is to communicate effectively with another human, not just match certain text patterns.

anigbrowl · on Aug 30, 2019

If it is cost-prohibitive then then maybe we should adjust the economic model, not abandon the measurement.

rocqua · on Aug 30, 2019

Sure, have less essay test questions, and start grading them for content not form.

If you want to grade on form to test the ability to write correct rather than coherent sentences, make those separate questions, and mark them so.

Spivak · on Aug 30, 2019

I mean that's what the automated grading systems are trying to do but it seems like people don't like them very much.

saagarjha · on Aug 29, 2019

> If it is cost-prohibitive for every essay to be graded by humans, then they should be dropped from the tests.

Apparently it is. But everyone still wants writing to be assessed…

risubramanian · on Aug 29, 2019

The SAT dropped the writing section a few years ago, and many schools don't care about the GRE writing score.

hnburnsy · on Aug 30, 2019

They did drop the writing but replaced it with an optional essay. Most upper tier schools require applicants to take the essay.