Hacker News new | past | comments | ask | show | jobs | submit login
Flawed Algorithms Are Grading Millions of Students’ Essays (vice.com)
358 points by elorant on Aug 29, 2019 | hide | past | favorite | 285 comments



> Utah has been using AI as the primary scorer on its standardized tests for several years. “It was a major cost to our state to hand score, in addition to very time consuming,” said Cydnee Carter, the state’s assessment development coordinator. The automated process also allowed the state to give immediate feedback to students and teachers, she said.

Yes, education takes time and costs money. Yes, not educating is both cheaper and faster. Note how the rationalizing ignores the needs of the students and the quality of the education.

I live in Utah and my children have been subjected to this automated essay scoring here. One night I came home from work and my son and wife were both in tears, frustrated with each other and frustrated with the essay scoring which refused to give a high enough score to meet what the teacher said was required, no matter how good the essay was. My wife wrote versions herself from scratch and couldn’t get the required score. When I got involved, I did the same with the same results.

Turns out the instructions said the essay would be scored on verbal efficiency; getting the point across clearly with the fewest words. I started playing around and realized that the more words I added, the higher the score, whether they were relevant or grammatical or not. Random unrelated sentences pasted in the middle would increase the score. We found a letter of petition online for banning automated scoring for the purposes of grades or student evaluation of any kind. It was very long, so it got a perfect score. I encouraged my son to submit it, and he did. Later I visited his teacher to explain and to urge her to not use automated scoring. She listened and then told me about how much time it saves and how fast students get feedback. :/


Frankly, I can't believe what I am reading. The idea that some "AI" grades essays automatically is idiotic and has nothing to do with education. Where is the place for discussions? Where is the place for ideas confrontation? Where is the place for writing style development? How this AI is supposed to grade things like repetitions (that can be either good rhetorical tool or a mistake, depending on context), etc?

Who the hell came out with such an idea. I would even hesitate to use "AI" for automatic spell checking as it is sufficient to give some character unusual name and it will be marked as error.

My guess is that soon or later people will learn how to game that AI. I wouldn't be surprised if there were some software that will generate essay that Utah "AI" likes.


My guess is that soon or later people will learn how to game that AI.

Already been done. http://lesperelman.com/writing-assessment-robo-grading/babel...

Here's a sample essay that is complete nonsense and got a perfect score on the GRE.

http://lesperelman.com/wp-content/uploads/2015/12/6-6_ScoreI...


The final paragraph from that example is steaming gibberish that nobody could mistake for English:

"Calling has not, and undoubtedly never will be aggravating in the way we encounter mortification but delineate the reprimand that should be inclination. Nonetheless, armed with the knowledge that the analysis augurs stealth with propagandists, almost all of the utterances on my authorization journey. Since sanctions are performed at knowledge, a quantity of vocation can be more gaudily inspected. Knowledge will always be a part of society.Vocation is the most presumptuously perilous assassination of mankind."

Yet the robo-scoring acclaims it as:

* articulates a clear and insightful position on the issue in accordance with the assigned task * develops the position fully with compelling reasons and/or persuasive examples * sustains a well-focused, well-organized analysis, connecting ideas logically * conveys ideas fluently and precisely, using effective vocabulary and sentence variety * demonstrates superior facility with the conventions of standard written English (i.e., grammar, usage, and mechanics) but may have minor errors

Any teacher faced with the requirement to use such tools would be better placed instructing their class on civil disobedience.


The let me posit another idea...

There's 2 ways of finding out these artifacts of AI essay grading: pure luck, and being able to afford extensive test-prep (rich).

The luck one can't be accounted for. So I im lead to believe that the purpose of these essays and their AI grading is to find and escalate rich people.


> So I im lead to believe that the purpose of these essays and their AI grading is to find and escalate rich people.

Well, of course. How many poor people are allowed to decide what is good for children's education?


The standard US response is:

'There's a reason why they're poor. Better pull themselves up by the bootstraps."

Mixed alongside with poverty stricken neighborhoods are the primary funding, resulting in poor school systems. And those students obviously wont have the money or the access to get the test-prep needed to "succeed".

It's all too laid out to be accidental.


Professor Perelman had also previously demonstrated that this sort of scoring was going on when essays were scored by humans [1].

I suspect that, in addition to the scoring rules being written for speed and frugality, they were shaped by a poorly thought out attempt to make the scoring 'objective', and independent of the scorer's beliefs, attitudes and unconscious biases.

In one sense, this software (I will not call it 'AI') is an extension of all those bad ideas, only greatly amplified in a way that only software can.

In evolutionary biology, there is the concept of 'honest signalling'[2], a true and unfakable indicator, to a potential mate or predator, of an animal's fitness. That is what we are missing here.

[1] https://www.bostonglobe.com/opinion/2014/03/13/the-man-who-k...

[2] https://en.wikipedia.org/wiki/Signalling_theory


Anpther issue is one of copyright: obviously the student is the author. And we all know that the ML scoring subcontractor is keeping copies, with human ratings for later retraining purpouses.

At the time the student takes the test, he should be prompted with the informed choice asking him to grant either 1) no license to keep a copy for training purpouses 2) a non-exclusive license, and the website where he can get a copy of his own essay 3) a public domain license, again with the relevant domain linked so he can find his own and other's essays. 4) any of the above or other as a function of the resulting grade!

At the same time he should also specify his desire for or against attribution, again probably best as a function of the resulting grade. And under what moniker he wishes this contribution to exist.

These options to be filled out during exam time should have no default options (no opt-out), and preferably should be standardized by the community and lobbied for to be mandatorily enforced at state or federal level (forcing examinations to present the student with an informed choice)

A public dataset of legally obtained essays (without scores or names) would already be a very important first step to invite others to make actual performant ML grading systems.

I don't believe the current datasets in these "non-profit" organizations actually comply with copyright law, organizations who don't profit from the grading service towards the state, but do provide a stable ML job on the income from charging the people with financial means to test submissions, enabling a stealth class based society.


That reads like it was written by a GPT-2 bot.


A Markov chain would have been able to produce a text like that.


Wow - should have been marked for improper use of : though :-)


>> Who the hell came out with such an idea.

I'd guess this is a product of dwindling state finances and contempt for any form of real education. AI's are orders of magnitude cheaper than real teachers. They also don't form unions and wouldn't voice any opposition against changes in the curiculum.

They are also pretty useless, as you have pointed out. The consequences of this policy will be postponed until the students reach a certain age -- that'll be like 10-15 years in the future.


Until they end up at Uni or Work and find that they haven't developed the right skills.


> My guess is that soon or later people will learn how to game that AI.

To be fair, the GP here is specifically describing that he gamed the AI via a copy-paste of a critique of the AI, his kid submitted it on their own accord, it was graded without comment, and then when the GP went into comment on the gaming of the AI, the teacher not only did not care that the AI was gamed, but expressed gratitude for the AI saving hours of work, still ignoring that the AI fundamentally made things worse, all at the expense of the entire point of being a teacher in the first place.

The issue, for the teacher, is that in 'the system' in which they collect a pay-check, the AI works flawlessly. The point, for the teacher, is not to educate children. It is to have assignments that children pass with some sort of distribution that can be sent in and calculated by some person in a beige suit, wide tie, and hair troubles. The difference is subtle at first, but when you get further along to the point where the GP is sitting, then the difference is comical.

The AI allows the teacher to increase their effiency in processing assignments, ones that never really mattered to the teacher in the first place. In valley-speak: the incentives are not aligned.


I can't believe either, it's completely ridiculous. They're basically claiming that they've developed a general AI. It's like some part of population is living in different fantasy worlds and makes policy decisions accordingly.


Is the USA tenable going forward? Your cost of everything, value of nothing culture appears to be very destructive.


It's progress built on top of an assumption of never-ending, unsustainable growth. Unfortunately for everyone else much of the world has been dragged into the rat race along with us.


I'd argue that the expectation of perpetual population growth is one of the big problems (the unsustainability of social security bottlenecking at the baby boomers being an obvious example).

There is a compelling case for immigration for that sake alone.


Honestly, no. It's not.

Granted I already have a very fatalistic view on my future already so take my opinion with a grain of salt.


Not without change, but I'd argue further change is certain, and likely to be of large magnitude.

I do not know if it will be sustainable, but I doubt it will be static.


I agree with you wholeheartedly but I think there's a stronger argument to be made here: the algorithms being used "work" only on a correlation based on an ignorance of the scoring metric. If the students under test knew even sketchily how the system worked, e.g., points deducted if your average sentence word length > 7, points added if your word length stddev is greater than 2, and the students could meaningfully push their scores up by focusing on these proxies that don't _actually_ measure what a human would say is quality work - or even they can even get gibberish[0] rated highly - then the whole thing is a fraud. No one will stand for a grading system that only works by virtue of obscurity.

[0] https://www.nytimes.com/2012/04/23/education/robo-readers-us...


It is classic beancounter thinking in the worst way, the worst stereotype of a MBA trying to minimize cost beyond all reason cutting corners. Even when it saws at the branch they sit upon.

It is frankly a sign of a diseased culture to use it in any capacity except an exercise to improve AI.


Teachers often score by similarly sensible criteria.


When I was a child I was obsessed with the "grade level" function in Microsoft Word. It was a preference you could enable on spell check to tell you the "grade level" of your writing.

Every essay I wrote, I'd always force myself to reach the max "12.0" grade level. While writing I'd struggle over word choice, sentence structure, rearranging paragraphs, working on my tone etc, all in pursuit of the 12th grade way to phrase things. All my revisions were subject to the approval of the Grade Level checker.

Whenever I could I would check the grade levels of my friend's writing - usually by showing them a "neat feature" they could enable. Then, I'd smugly applaud myself for being the better writer whenever their grade level was below 12.0.

The Grade Level feature fascinated me and to try and master it, I found a book about Microsoft Word and looked through it in a bookstore. I was absolutely gobsmacked at how simple the formulas was. I had childishly been expecting something, like perhaps Utah educators imagine they have. I genuinely expected the method to be complex beyond my understanding.

Instead, Word used a variant of Flesch-Kincaid. There was a direct relationship between sentence length and grade score, and polysyllabic words and grade score. Meaning, the longer your sentences and words, the higher your grade score.

As soon as I got home from the bookstore I loaded a draft of something I had written. It was "pre-12.0" writing from me. I simply deleted all the periods but one and checked again. 12.0.

Automatic grading is a wonderful lure. It's nice to imagine that there's some objective writing quality easy to tap into. At the moment, I think we're far from that ability.

Personally, I feel the solution to insufficient teacher time is to use peer grading much more, and spot checks. Get kids to read and revise each other's works frequently, and teachers should aim to grade at least N papers per student where N is much less than the number of papers a student writes.

Revising is a really vital part of writing. Getting more chances to do revision, plus having to write something good enough to show your peers, plus having the risk of any paper count for your grade should compensate for incomplete teacher grading.


The fact that you were literally still a child when this happened, but automated grading is being foisted on us by grown adults who are ostensibly professionals, says a lot about the situation.


> Personally, I feel the solution to insufficient teacher time is to use peer grading much more, and spot checks. Get kids to read and revise each other's works frequently, and teachers should aim to grade at least N papers per student where N is much less than the number of papers a student writes.

That's how it's done in creative writing courses. I've always found it infinitely more helpful than only having feedback from the instructor, even if the instructor's feedback was generally more helpful/useful than peer feedback.


Things like this story, Word's auto-grader, and Grammerly's style preferences are all surreal to me. We are asking a computer to validate prose meant for human consumption.

Not a reflection of physical reality like sensor data or even accounting information, but the method of communication explicitly invented for production and consumption by humans.

Of course feedback from humans is more valuable than feedback computers, it would be irrational/miraculous if anything was better at giving feedback than a human.

It is a shame it isn't self evident to instructors how poor of a solution this is, and how much better the results are when using critique by peers and instructors -- the classic way of doing things.


Arguably, Hemingway's texts are well written. One of the sources of power of his prose is the use of simple words, and basic sentence structures. I bet Word would classify that as below 12th grade.

The point I am trying to make in agreement with the parent is: there are qualities that are very hard to score with algorithms. The difficulty of solving this problem equals if not exceeds that of automated translation, which still only works properly for specialized and limited domains, e.g. weather forecasts.


All that grade-level gaming paid off, I reckon! This was a funny, informative personal account of it :)


It's interesting that the tool (and system) is designed to aid people trying for the opposite result, i.e. for publicists and other authors striving to word their message to be as widely understood as possible.


Exactly I am doing this as part of some work I am doing with some content changes we are experimenting with on a major brands site.


You just ruined my childhood Thanks.


I went to high school in Utah, long before this automated scoring. It sounds awful but considering the quality of the education I received there perhaps not that bad after all.

My best Utah education anecdote - In the first day of British literature class the teacher came in and asked "Does anyone here know what A.D means?" someone said After Death - she said no. I figured this was my time to shine so I raised my eager hand and said "Anno Domini, in the year of the lord" - she said no.

Then she announced: "A.D means after the Deluge, and B.C means before Christ".

She also totally lied to me one time about whether she would be considering a particular textbook question as applying to Rosencrantz or Guildenstern.

Anyway I think that was one of the many classes I got an F in after stopped going and would walk past it every day on my way to play chess with my German teacher.


How is this relevant in a Lit class - presumably you hid the fact you where a Catholic / Anglican from her.


Actually I am and was an atheist, but I had recently been on an "I'm going to read the encyclopedia!" mission.


I don't know the relevance, I guess she just felt enthused about sharing some of her learning with us impressionable minds.


Wow this is pretty shocking. I can understand using automated systems for something like math problems, it makes sense. There’s (usually) one right answer. But essays? This should be banned.


Wait 'til you see a kid in tears because the math answer they submitted was supposed to equal zero, but the algorithms behind the scenes are so bad that the float math failed the equality check.

Note: This is not hyperbole, I have seen this exact scenario more than once.

There may be a place for a well-designed one, but if it exists, I've never seen it.


I know this is just my experience but I can confirm the automated math scoring system I was using in a large US university in 2012 had bugs where many times I would enter a complex solution with fractions and it would tell me I was wrong and the correct response was some other form of the same equivalent fraction... Talk about frustrating after pouring over a question for 10 minutes.


I remember to this day when I went round and round with a calculus teacher in high school who told me "sin(x) + 2" was incorrect. Then answer she wanted was "2 + sin(x)". I argued that addition was commutative, "2 + 3 = 3 + 2" and so forth. She wouldn't budge. She also said it was impossible to calculate the exact area under a curve, because you can't draw an infinite number of boxes under the curve.

Having an ignorant teacher is almost as bad as a flawed, black-box algorithm.


My gods, that is stupid.

Math is only not commutative in very rare areas. A specific area is concatenation of 2 strings, or 2 rotations of a rubiks cube.

> said it was impossible to calculate the exact area under a curve

And this is why a lot of people are against unions and similar protections for teachers. How do you get rid of someone who blatantly lies and informs falsely to students? How do you get rid of horrible teachers?


what was the allowed form of the fraction? a ratio of literal integers, or also including irrationals? variables? did the scoring system need to know formulas? as in (2/3)v =?= (2/3) omega*r ?

if only the last one is not required of the scoring system, then it's been available for ages, and its just a really poor implementation. That's not programming but plumbing data-pipes.


Having been forced to use an online math software for all my homework while at school, I vehemently disagree. It was so poor that it became a meme within my year group.

It would mark you as incorrect for using too many decimal places, even though it wouldn't tell you how many significant figures was required. I often remember it marking my answer as incorrect, even though it was identical to the answer they gave. Sometimes you'd have to show your working, but it couldn't handle brackets. Once I put the answer as "1+x=y" but they wanted the answer "y-1=x", and they marked it as incorrect.

I'm sure academic software design is leaps and bounds above what it was in the early 2000's, but to have a pupils futures hinge on what generally seems to be poorly tested code is dangerous.


That's just poor software. As long as the software and teachers or professors allow for going over answers and checking for correctness it should be alright.


That sounds like "My Math Lab" by Pearson. It is horrible like everyone talking in here states.

It's also a bundle with the book sold by unis, because the code 'allows' you to submit homework required for the class. So they're doing both resale prevention, AND horrible grading.

People I know who have to interact with it call it "My Meth Lab" - because you have to be high to like it.


I have often solved many hard math problems with very unconventional solutions (eg geometrical proof for algebraic problems). Trust me, a piece of software is decades away from being able to accurately determine the future of children and massively impact their self esteem / trust in society.


Unconventional solutions can stump human teachers, also. One day on a chemistry test I was being stupid about how to define "stereoisomer". I knew what it was (two compounds that are mirror images of each other), I was just having trouble expressing it properly. Running out of time I put down "two molecules that are identical if and only if you permit rotation through the fourth dimension." This is extremely unconventional but it is correct--except not only did the teacher not understand it but I couldn't find any help in the mathematics department, either.


On one hand, I agree with you. I remember having to argue whether I showed my work or not by using imaginary numbers instead of standard formulas in high school physics.

But even with these examples, the path of appeal and rectification of mistakes is much easier with all humans involved. I fear soon people will side with the machine out of ignorance or to be justified in an incorrect stance.

The idea that we could be so poorly taught by broken automated systems, that we become incapable of detecting the system is broken seems like a possibility with AI that is much less likely in pure human systems of education (though not impossible).


"Fast, good, cheap"

The state chose fast and cheap. Well, its cheaper than more teachers.


It is important for the teacher to see where part of the class took the wrong turn, where the students' understanding ended. It is important to distinguish between careless errors, wrong memorizing of a formula and lack of understanding.


> Turns out the instructions said the essay would be scored on verbal efficiency; getting the point across clearly with the fewest words. I started playing around and realized that the more words I added, the higher the score, whether they were relevant or grammatical or not.

Frankly, this does not change anything from my experience in school decades ago. The teachers always said that the length does not matter and we should not pad the papers. However students who wrote more pages got better scores every single time.


Is it possible that the students who wrote shorter papers were in fact presenting incomplete arguments and/or thoughts? Writing clearly and concisely is extremely difficult.


It is possible but all of them? Also by “short” I usually mean four pages which was the usual recommendation for a paper length. To have a good grade one would need to write ten. The subject was usually something vapid so most of that must have been drivel.


Historically on the written portion of the SAT length is substantially correlated to the final score.


I'm not sure that's an issue by itself. If the prompt is broad enough, a minimum length can be reasonable for essays.

It certainly could be a problem if the prompt was too narrow, or time constraints, or some other factor.

Do you think the correlation by itself indicates something negative/inefficient I'm missing?


I think that the required length should be bounded on both sizes and penalised if the paper is either too short or too long.


You have automated systems that rate essays without any human actually reading them?

Kids, forget everything you know because crime does indeed pay off. Best grades will be reserved for those that try to cheat this system however it is implemented. Botting your essays is the way to go in the 21st century.


Given that the stack ranking at your future job will also be done by an "AI" (probably developed by the same company that graded you tests) this is a very useful skill to have.


Good point, I better teach my kid this skill so she can go on to a job programming one of these "AIs"

idiocracy will happen not because people get any stupider, but because the bots reward the stupid ones first.


Maybe Idiocracy was right about the whole "It's got electrolytes!" meme, except swap out electrolytes for neural networks.


Now I need a video of Luke Wilson saying "What are neural networks? Do you even know?"


>Turns out the instructions said the essay would be scored on verbal efficiency; getting the point across clearly with the fewest words. I started playing around and realized that the more words I added, the higher the score

Apart from the fact that your story is straight up frightening, isn't this part completely backwards, too? I mean, clearly using more words to convey the same message is /less/ efficient, not more so?


Yes. Exactly. Before I figured out how to game the program, my son and wife were editing shorter. That’s what the instructions said to do. And, that’s also a major strategy for decent writing: brainstorm a lot, then edit down to the good parts. What this means is the software’s scoring is an anti-incentive to good writing. Used as a teaching aid, it’s actually doing pure damage, not good. Not only can it not score reliably, nor provide meaningful feedback, it’s actually actively teaching a very wrong way to write. But it is cheaper than humans, and it does give immediate feedback, so there’s that.


This is a problem I have with a lot of human behavior. Instead of admitting you don’t have the resources to do something or aren’t willing to prioritize it, people come up with a bad version that’s not worth doing. Lots of things are worth doing poorly, but many of them I believe you just need to admit are not worth it unless a certain level of performance is met.

What’s even cheaper than AI? Tell the students to write some pages, have the teacher glance at the number of pages written, give full credit if the mark was met, and throw the papers out without reading them. It sounds like it would be similarly effective and less aggravating. Unfortunately, this would require humility on the part of the educators.


Think from a positive angle, students today are learning useful life skills to game computer systems, which they will have to deal with when they grow up.

edit: ...just like how previous generations have to learn how to game social systems.


Except the algorithm being gamed can change suddenly, drastically and without the gamer's knowledge.

When such changes occur, the gamer will be docked until they can reverse-engineer the new algorithm. There's also the risk that all their previous inputs "gaming" the system might be reconsumed to terrible results as well, effectively rewriting their historical performance disasterously.

As always, those with the social standing and power to have insider knowledge or guidance will be in the best position to profit off such systems.


Ho ho. Wait. You mean you were able to submit multiple versions of the essay? So that anyone can basically game the test, by submitting multiple essays until they get the best score they can wring out of it?

That is just mad.


It's be easier and equally fair to just grade student's essays by rolling a pair of dice.


It's arguably more fair. At least purely random scoring doesn't incentivize cheating.


wait, you get the score in real time? Like some kind of objective function you can train a machine to maximise on?


Heh heh. I like the way you think. Hackers of Utah unite!


This needs to just be outright banned


I can't even comprehend how someone can use automation for a task like this... It completely goes against human nature. In a world where all jobs have been automated teachers would be the last ones to go before humanity is completely obsolete.


Do you just get to keep submitting the essay to see what score it will get before you turn it in? That sounds like a bigger problem than any of the particulars about how the grading is done.


In this case, there was a limit to the number of times the essay could be submitted, and there was a required score that needed to be obtained within that limit, otherwise the grade would go down. The limit was something like 20 tries, and when I got there they’d already used maybe 14 of them.

I could perhaps see value in having unlimited tries, as a teaching aid, if the result wasn’t being used for grading. That would at least leave room for curiosity and exploration. And, more importantly, I could see value if the software wasn’t essentially a scam that fundamentally is not able do what is advertised. If the software really could grade essays reliably, and provide meaningful suggestions for improvement, then maybe it could be used to help educate students, in conjunction with the teacher’s guidance. But the software does not grade reliably, and it absolutely does not offer meaningful constructive feedback, and the teachers were using it to avoid reading essays, not to supplement their own expertise.

One of the several amusing ironies here is how the software company has convinced the state and teachers to willingly replace themselves with bots, despite obvious evidence that the humans can do the job better.


I can see how giving students multiple tries would be a great teaching aid if we had human-level AI. With what we have now, I'd bet it's just training them to produce essays that hit the flaws in the AI as hard as possible to produce unreadable garbage with high scores. 20 tries per essay adds up to a lot across 12 years of schooling.

Teachers are extremely overworked, underpaid, and underappreciated. I'm not surprised that it was easy to convince them to offload the difficult and time-consuming work of manually grading essays. This also means they don't have to deal with complaints about unfair grading. A machine did it and it's out of their hands.


The instant feedback mechanism is just begging for someone to turn it into a GAN by writing the other half. I would absolutely love to hear that some particularly clever high school student was able to train an ML algorithm to consistently fool the grading algorithm, thus instantly rendering all of their efforts worthless and dragging the administrators through the mud at the same time.


That really sounds like Utah, they have lots of students (due to LDS influences) with a conservative government (ditto), so the pupil/teacher ratio is insane. I can guess the teacher really doesn’t have any other choice.


My mother worked grading standardized tests. It was a hellish job for many reasons (limited breaks, etc.)

One question she had to grade was essentially, "What's something you want your teacher to know about you?"

It was an essay answer, and she was supposed to grade it on grammar, etc. Just the mechanical aspects of writing. (The real question explained the details more, but that was the core of the question.)

She saw answers that would make you weep.

"My daddy touches me."

"I haven't eaten today. I don't know when I'm going to eat again."

Stuff like that.

And my mother was going to be the only human who ever saw their responses. Their teacher had no chance to see their responses, just my mom.

So she goes to her supervisor and asks, "What can we do to help these kids?"

The supervisor said there was nothing you can do. Just grade the answers.


The US has federal child abuse mandatory reporting requirement laws which include teachers and school staff and personnel, as well as additional state requirements which vary but include, for 11 states, faculty, staff, and volunteers at public or private higher education institutions. Computer and IT professionals are also covered in cases.

Faculty, administrators, athletics staff, or other employees and volunteers at institutions of higher learning, including public and private colleges and universities and vocational and technical schools (11 States).

https://www.childwelfare.gov/topics/systemwide/laws-policies...

https://www.childwelfare.gov/pubPDFs/manda.pdf

This includes penalties for failure to report in multiple states:

https://www.childwelfare.gov/topics/systemwide/laws-policies...


so ... nobody wonders about the obvious rammification: then any ML scoring systems ... must detect child abuse signals!


Hey uh, that actually seems valuable.

I'd believe that ML could spot abuse that humans miss pretty well from signals like non-overt references in homework and school records, if one could come up with an adequate training set.

Much more likely than teaching ML to score reasoned and creative activity in any reasonable way.


What do you think the false positive rate is likely to be?


Depends upon what false negative rate you're willing to tolerate. ;) And I don't know how good of a signal there is. This is pure handwaving.

But this type of thing seems like the exact kind of spooky correlation that ML is good at spotting compared to humans.


Machine learning techniques are going to be absolutely awful in detecting something like this, the reason being it's exceedingly rare (at least I'm guessing it is; if we're talking about child sexual abuse by one's own parents, it sure sounds extremely unlikely- but even child abuse in general is probably rare [1]). Machine learning systems are awful at identifying rare events. Like the OP seems to suggest, the false positive rate would most likely be very high.

"Spooky" machine learning results happen when a correlation is abundant in a dataset [2]. Otherwise, machine learning techniques will probably miss it altogether.

______________

[1] Quick online search: https://www.inquirer.com/philly/blogs/healthy_kids/What-is-t...

[2] The archetypal spooky machine learning story is surely the one about Target sending baby item coupos to a girl in high school before her father knowing she was pregnant:

https://www.forbes.com/sites/kashmirhill/2012/02/16/how-targ...


> if we're talking about child sexual abuse by one's own parents, it sure sounds extremely unlikely-

Child sexual abuse isn't extremely rare and familial abuse is a very large minority of child sex abuse.


Humans are awful at rare events and vigilance tasks, too. That's part of why we're seeing machine vision and machine learning starting to outperform humans in e.g. grading radiology screening scans.

The total incidence of child abuse of all types from infancy to adulthood is on the order of 1 in 3. This is not terrifically rare-- it's of higher prevalence than pregnancy and of positive screening events.

A much bigger concern is non-causative correlations. It'd be pretty easy to train ML to be racist or look for e.g. indicators of class, which are correlates of abuse.

As to false positive rates-- you can pick your false positive rate to be whatever you want it to be, by twiddling the threshold for a positive result. I'm not sure false positives are of that great of a concern, if the output from a system is a notification to school administrators that they may want to keep an eye out for this student.


> But this type of thing seems like the exact kind of spooky correlation that ML is good at spotting compared to humans.

How? Particularly, where do you get training data at the required scale?


You take samples of hundreds or thousands of past students' schoolwork, e.g. submissions of essays for standardized tests.

You survey those kids in adulthood about whether and how they were subject to abuse and other types of relevant adversity.

You attempt to control the data so that you don't just latch onto other correlates of abuse (e.g. social class).


Who cares? The positives must be evaluated by a human anyway.


Who cares?

The people whose lives are ruined by being mis-identified by the system.

The positives must be evaluated by a human anyway.

Those same people whose lack of competence people are bemoaning throughout these comments.


> The people whose lives are ruined by being mis-identified by the system.

When a child writes "daddy touches me between the legs" in an essay, it doesn't matter if a human spots it or an AI that forwards it to a human, this needs to be investigated either way.

> Those same people whose lack of competence people are bemoaning throughout these comments.

It's not a lack of competence that's bemoaned, it's a massive amount of understaffing (and resulting overwork) in teachers and other school resources, as well as a drastic lack of financing because it's easy to cut budgets for schools for politicians as the effects only show up two decades afterwards.


Sprinkling some AI over it won't fix those issues, I'd argue it will make them worse as people blindly accept the results.

There were some cases in the UK about a decade ago where bugs in software the Royal Mail was using led to incorrect accusations of fraud. People actually went to jail over this, it took years to resolve.


> When a child writes "daddy touches me between the legs" in an essay, it doesn't matter if a human spots it or an AI that forwards it to a human, this needs to be investigated either way.

When a child writes a set of things that individually are not very concerning, they may have cues that could say "hey, this kid, you should maybe keep an eye out for evidence of abuse."

Particularly attuned, experienced individuals might spot these cumulative cues, but we all know that this is not all people dealing with children.

It's an interesting problem.


Humans will still let false positives through. And false accusations of child Abuse have significant ramifications.


This is so dangerous.

Society's bigotry is going to flood that bad boy so quick you might as well name it Gobbels.

I love ML. I want children to be safe. This is not the place for ML or AI or Quantum or any tech.

What needs to exist is better resources for those children, that mother grading the tests, the teachers of those children, and social services that are meant to support them. If you want to make a difference about this, look there.

Don't go building a automaton King Solomon who decides why this kid should be taken from these parents because speaking Spanish was worth -0.1 on some goddamn weight trained on data generated from a racist society.

This isn't a "spooky" correlation a cool algorithm can detect, it's a serious, layered social problem.


> Don't go building a automaton King Solomon who decides why this kid should be taken from these parents because speaking Spanish was worth -0.1 on some goddamn weight trained on data generated from a racist society.

Totally what I advocated for and not a strawman attack /s. Indeed, the chance that such an algorithm could be racist or classist and there being needs to avoid bad correlations and have appropriate controls is important.

I think there are opportunities here. Ideally ed-tech doesn't take humans out of the loop, but asks schoolteachers and administrators questions like, "Hey, are you sure students A, B, and C are being supported correctly for subject Z? Are you sure students D, E doesn't have some kind of abuse or other significant home problem? It sure looks like student F is in this subpopulation that research shows benefits from educational intervention Y. You might want to keep your eye out for that."

And then the teacher goes "Oh, crap. Now that I think about D, there were always these little things 1, 2, and 3 that seemed off... maybe this is worth a referral to social services to check on what's up."

Or "Oh, ... maybe F's struggles in reading really are a speech problem and we should handle that"


That is not how the law work. The law states that if people at a school is made aware or suspect abuse then they must act on that knowledge. A ML scoring system is obviously unable to be made aware or having suspicions, but the administrators could be help responsible if they happen to see something and chooses to not act.

It would be interesting to know if a child psychiatrist could be held liable if incompetence prevented them from seeing obvious sign of abuse, but I doubt that is covered under the cited law above.


Some of these will be 100% true as well. But don't make the mistake that there are no kids who go for shock value or are wantonly manipulative when they know it can't come back to them.

So how many are true and how many false? I have no clue. Literally none. And no it doesn't make me feel any better about the screams of existential agony even if that were a low percentage. Could be high too.


For the not eating, it's pretty easy to get data. It's like 1 in 5 children live in food-insecure households in the US and maybe 1 in 20 of those very insecure, so not eating before school provided lunch is common enough that if you're grading tons of papers you'll run into kids like that.


https://www.ers.usda.gov/topics/food-nutrition-assistance/fo...

Food Security Status of U.S. Households with Children in 2017 Among U.S. households with children under age 18:

84.3 percent were food secure in 2017. In 8.0 percent of households with children, only adults were food insecure. Both children and adults were food insecure in 7.7 percent of households with children (2.9 million households). Although children are usually protected from substantial reductions in food intake even in households with very low food security, nevertheless, in about 0.7 percent of households with children (250,000 households), one or more child also experienced reduced food intake and disrupted eating patterns at some time during the year.


It could also be a student suffering from anorexia nervosa, which the confessional aspects of the essay would fit well with.


I'm confident that your example would of a less percentage than those mentioned in dmoy's comment.


When I was a high school student, we had some state administered test in health class that tasked us with analyzing advertisements for liquor and tobacco and seeing if we could recognize harmful behavior that the ads might be promoting. This test had no impact on our class grade...

..which means students wrote whatever the hell we wanted. I was assigned a Captain Morgan (rum) ad. I wrote that the ad was glorifying maritime piracy and was likely responsible for pirate activity in Somalia.


Of course some kids are manipulative, going for shock value, continuing an "in-joke", or just plain trolling. But would a teacher just look the other way, or would they talk to the kid? What would you want for your kids? This is why teachers assigning homework like "what do you want your teacher to know about you" and then not even seeing it is dehumanizing.


I don’t know about calling it manipulative. I remember taking the ACT, and struggling to plan out one of my essays. It was something like “tell us about a book that inspired you”. So I changed details about the plot so it all fit nicely and was easy to write. I can see something similar here, where someone takes on a persona when writing in order to effectively communicate.


This is absolutely the case. In fact, my SAT prep class taught us that the factual veracity of our essays is irrelevant. Essay scores are almost entirely correlated with essay length as long as spelling, grammar, and basic paragraph structure (intro, body paragraphs, conclusion) is followed.


That's entirely fair. Manipulation is kind of what a writer does yet the word manipulative has perjoritive connotations. Many types of writing don't have literal truth as any kind of pre-requisite. Others make a pretence of literal truth to achieve greater effect then basically lie, many autobiographies fall into this trap to some degree. All these things. Differential empathy. Data quality matters.


False accusations can actually be the result of prior abuse. They may substitute one person for another. Or do things as a result of mental illness caused by abuse. Kids think differently to adults and may behave inexplicably. And unfortunately that means that an abused child is a terrible witness.


I was abused as a child (not sexually however) and I can attest to this. Many of my memories are highly charged and don't really hold up - they're very confused. Some of the scariest stuff that happened to me I don't even remember, and my siblings have had to let me in on it (and they were even younger at the time).

As a child you're really not prepared for the concept that your parents are treating you badly. So that realization doesn't come until much later.


> But don't make the mistake that there are no kids who go for shock value or are wantonly manipulative when they know it can't come back to them.

In the US, school funding is based upon standardized test results, and bad results can shut a poorly performing school down.

It's drilled into every kid's head that these tests are very important, super strict and if they accidentally mess up, it can ruin their academics, because retesting and regrading are expensive.


As a kid I would go out of my way to fail those tests. The whole curriculum was designed around them, meaning that even if we did score high, any funding gains would just be put towards training us to take the test.

I thought the state was holding the school hostage, threatening to cut funds or shut them down if they ever stopped. We never learned anything about civics or American history. Until I was out of highschool, content regarding atrocities like slavery and the trail of tears was not on the test and that was enough to whitewash the whole curriculum.

Standardized testing is to the U.S. what lead waterpipes were to the roman empire.


Your school didn't teach slavery? What state?


It’s possible that there’s a difference between admitting it happened and an honest portrayal of the institution.


‘Poor sentence structure and grammar, 1 point out of five. Sorry your daddy touches you.’


Punch up, not down.


You really misread that if you thought I was punching down. I was pointing out the absurdity of having to even grade such a thing.


boy do I hate that saying. It literally adds nothing to the conversation!


It can be a useful heuristic.

I don't find it adds much here.


[flagged]


I embraced the awkwardness, and now I know for sure that I'm doing ‘a thing,’ instead of worrying about whether I do or not.


That’s what you’re doing wrong.


Thus aren't you as well?


I meant that your self-consciousness and constant worry is probably harming your social interactions more than anything. Or not. It was for me in the past.


Aw, I was hoping we were going to dig deep into a "no u" situation.


Or...report it to the police? I’d gladly risk my job to do the right thing in that instance.


What my mom saw had an ID number on it. No other demographics. And she was grading from multiple states.

So do what? Contact her local police?

With a written accusation from a child? Is that enough to get a warrant to force the company to release the demographic information?

And people don't work at a job like that because they want to. They work there because they need the money.

Everything she took in and out of there was monitored, too. So it's not like she can go to the Xerox, and walk out of there with a copy.

It's beyond dehumanizing. For everyone. The kid, the people who work there.


> So do what? Contact her local police?

> With a written accusation from a child? Is that enough to get a warrant to force the company to release the demographic information?

Absolutely! My girlfriend works as a counselor at a school and she is required by law to report all serious abuses by parents.


> So do what? Contact her local police?

> With a written accusation from a child? Is that enough to get a warrant to force the company to release the demographic information?

Yes, she should have absolutely went to the local police. A child's first hand account in writing of child abuse and neglect is slam dunk evidence to secure a warrant to link the essay ID to the individual child.

> Everything she took in and out of there was monitored, too. So it's not like she can go to the Xerox, and walk out of there with a copy.

Doesn't matter. She could have went to the police herself as a witness. That alone would be enough for a probable cause warrant to retrieve the essays.

It is very sad she saw these signs of abuse and did not report it.


Yes, contact the local police so an investigation can get started. Yes, that’s enough to at least report it.


> So do what? Contact her local police?

Collect or photograph all the evidence, record every conversation with supervisors, escalate as much as possible internally, then contact local police, and at the same time go to the media. Don't quit, but if necessary let them fire you and then sue. None of this is easy.


When escalating I'm sure it'll be effective to say it'll be an interesting story for the news and how the incident is being blocked by supervisors that encourage child abuse.


If she did Xerox the disclosures, walked out, and said "please call the cops" when challenged, at least it would be a matter of public record


They should be mandatory reporters at least in the USA.


School contractors are mandatory reporters, but I suspect that may not qualify.


Depending on the state, yes, they are, for 11 states.

https://www.childwelfare.gov/topics/systemwide/laws-policies...


These would not be school contractors.


Why not? The school contracts with the testing agency that does the grading. Seems like a contractor relationship?


School contractor is someone who is hired by the school district on via a contractual relationship. Think temporary teachers, or custodian staff. It’s not a transitive relationship to every employee of every company who has some sort of contract, however small, with a school.


So you're arguing that only individuals can be a contractor? That wouldn't make much sense, not only because such relationships are rare in schools. Most common are contractors that have been outsourced something like food service. The law would make no sense if it included practically no one. It would mean a company that provides, say, temp staffing within the school, and those temp staffers saw abuse, they too wouldn't be required to report. I have a hard time believing a court would rule the definition to be so narrow. Both the common language understanding of the term and legal literalism would point against that. There's no transitive property here. We're not talking about contractors hired by contractors hired by contractors. We're talking about a contractor and its employees. There is no way for it to exercise this reporting requirement save through it's individual employees.


every relationship however transitive or small, is a relationship too! (Dr. Seuss)


Hand-graded standardized tests are usually anonymized.


It takes a five minute phone call with the company's legal department or a warrant to find out who the kid was. Either way it would need to be escalated to involving law enforcement.


Someone could track back the number to the testee, they need to get their grade at least.


Tell you mom to take those numbers and the company's name to the police. They can walk back the identification problem.


This is my first time learning that AI-graded essays are a thing. Am I the only one who thinks that's insane? I feel like you'd probably have to have an AGI to meaningfully evaluate an essay.


I work in AI, and was very surprised when I heard about this (a few years ago). I don't think anyone who works in the area thinks the tech is ready for this kind of deployment. There is research on the subject [1], and NLP systems can do better than baseline methods, but the error rates are still pretty high.

A thing you quickly find if you try to download off-the-shelf NLP tools and apply them to anything is how little is reliable at all, unless you can constrain the domain. Even basic topic identification only works with low error rates when constrained to something like NYT stories, or PubMed abstracts, not arbitrary text by arbitrary writers. And I would bet ETS is using worse tech than research state-of-the-art.

[1] e.g. https://www.aclweb.org/anthology/P15-1053


You've noticed though that the AI con is on. This damages your work as people get burned and will bring about the second "AI winter"

People making big decisions with a lot of money around computing know nothing about it and are marks for con-artists. Think big consulting firms selling to senior public servants in washington. "For a successful technology reality must take precedence of public relations." But reality just gets in the way when conning a mark for a successful snake oil sale, right?

Call it out, publically, cite your credentials. Encourage colleagues, your competition and everyone with a clue to pour scorn on whoever is selling this evil, toxic waste as drinkable.


Second? We’re heading to #3 — fully cyclical


Hmmm. I also work in AI, in fact professionally in information retrieval and NLP. I disagree strongly with what you say. Basic topic summarization and keyword / named entity extraction on unstructured sources of text works reasonably well. It’s easy to modify BERT and GPT on smaller problems, language classification is borderline totally solved by extremely easy to train neural network models.

I still agree that automatic essay grading is beyond the reach of SOTA NLP models today, but youmake it sound like virtually nothing can be done in a production-grade manner that solves real world unconstrained NLP problems. This is manifestly false.


It's completely possible I'm not fully up on recent progress, especially since a bunch of stuff seems to have moved in the past 6 months. But I haven't seen any general models that can solve open-domain problems, without specifically retraining on each domain. Do you have any pointers? E.g. a single pretrained BERT model that can reliably extract topics from: tweets, paragraphs from 19th-century novels, mathematics journal articles, and Wikipedia articles? All the systems with very low error rates that I know of target one specific domain. The last time I looked into sentiment analysis (a year or so ago), it wasn't even that great on many individual domains, e.g. it would get tripped up by sentences from novels that used "negative" keywords in a humorous or ironic way.


In production problems that I work on, we don’t even really use things from within the past year. These problems are just incredibly well-solved with fairly vanilla LSTM networks from 2-3 years ago. Enough so that while it’s probably premature for fully automated essay grading, it’s not _crazy_ to make a product from models trained to solve this problem.


I have a grant where were are doing just that. Implementing more or less SOTA research using fairly vanilla LSTM networks from 2-3 years ago (primarily Taghipour & Ng) to provide low stakes feedback to students on their essays in one of our teaching tools at Purdue. It’s based on research using the Kaggle ASAP database and we have found it to be pretty accurate across a variety of domains in early testing. Though some essay prompts seem to do better with CNNs vs. RNNs. I doubt many of the systems in TFA are based on LSTMs or neural nets at all. They are probably doing regression on hand-crafted features.


Very interesting. Are there any meta-analyses / reviews that summarize progress in this area? Would it be possible to share your grant proposal -- I'd be curious to get an idea of what is being attempted.


It's an internal grant and I'm not sure I'd be allowed to share it. We are adding AES to our peer-review app. Currently as an additional "grader" to the peer reviews since that's what the PI requested. Since the tool allows unlimited submissions until the review date, I hope to add it as a "pre-flight" estimate to give students a chance to get a rough prediction of the score they will receive and a metric they can use as they revise until the due date.

I'm not aware of any meta-analyses myself. I have been keeping up with the ASAP competition and various attempts to improve on the initial systems for a number of years. The two papers I believe are having the most success are [1] and [2]. [3] seems promising for balancing the opposing forces of high accuracy for true positives and the risk of false positives via adversarially crafted inputs.

I'm also vaguely aware of research happening around extracting features from neural nets. I'd love to be able to help students understand why the system is predicting a particular score.

[1] https://www.aclweb.org/anthology/D16-1193 [2] https://arxiv.org/pdf/1606.04289.pdf [3] https://arxiv.org/pdf/1804.06898.pdf


We had this in my school for 8th and 9th grade so 2008-2010. We had to type the essays in class and submit by the end of the hour. I would only get maybe 3 paragraphs in before time was up because I was trying to build a strong argument for the prompts. Despite that I would usually get 3-4/6 and my teacher said she would read the essays and regrade but she never actually did. My friend literally copy and pasted the pledge of allegiance 20-30 times and scored a perfect 6/6. Later we found out if you repeated the words in the writing prompt you would get a guaranteed 5/6 and with a high enough word count you’d get 6/6. The essays were all bullshit and just a way for the teachers to get an extra free period once a week.


I totally agree that "AI" grading is totally bullshit. But, I also have plenty of experience teaching/TAing large courses, and after reading too many essays they all become semanticically saturated meaninglessness. One can not help but skim them, and grade according to a few quick heuristics. At that point one tries to be self-consistent and defensible in one's grading, but careful consideration is right out. I suspect state graders are dealing with way more than 100 essays per person and are probably on a tight schedule too. It's quite possible that a ML model is better than an exhausted human grader, as their cognitive strategies are mostly identical.


The solution isn't to do a better job at grading 'meaninglessness' but to stop requiring the production of it in the first place.

One major problem with algorithmic approaches, whether automated or not, is that they become the definition of good in the context and therefore become something that cannot be argued against. And of course it makes 'teaching to the test' an even more likely outcome.

If I were a conspiracy theorist I'd attribute this to wanting a dumbed down population. Unfortunately I think it is probably the other way round, the population is already dumbed down and a belief in AI unicorns is the result.

As Aristotle said to Alexander: 'There is no royal road to geometry', and so it is with education; it's hard work for both the student and the educator and no amount of AI/ML/algorithmic snake oil will change that without also changing the meaning of the word education.


I remember when I was in middle school 16 years ago, my English classes would have us submit some of our work to a web app. It would then grade the submission. I remember this distinctly because I asked my teacher to intervene on at least two occasions. The app failed to recognize the words "squirrelly" (as in "That guy in the corner has been acting squirrelly.") and "defragment". My teacher decided to subvert the app's recommended grades because she, as a human, understood the intent of my use of those weird words.

To emphasize, this was 16 years ago.


> I feel like you'd probably have to have an AGI to meaningfully evaluate an essay.

So the reason this isn't the case, is because there are very simple metrics that tend to highly correlate with essay quality. It doesn't mean the grading-bot is actually evaluating essay quality. It's just looking for properties that are statistically associated with good essays. Remember, at the end of the day as long as the bot's ranking is close enough to the human grader's ranking, nobody really cares about the internal logic.

A very straightforward example is spelling mistakes. People who make spelling mistakes aren't necessarily bad writers. And vice versa, there may be great speller who can't write for shit. But by and large the people who spell poorly also tend to write poorly. Easily detectable grammatical issues, like misplaced modifiers, subject verb disagreement, or inconsistent tense, are also correlated indicators.

A very simple metric is essay length. Especially if its a timed exam. Good writers tend to have verbal fluidity, with words easily flowing to paper. They don't struggle converting thoughts too sentences. So they tend to end up with the most words written down within a fixed time period. By and large the longer a timed essay is, the more likely that its actual quality is high.

Grading bots basically rely on these statistical relationships. They're not measuring anything intrinsic to good writing. But at the end of the day, their student rankings are usually pretty close to that of a typical human grader. In some cases the bot will have a closer ranking to a random human grader, than two random human graders will have to each other.

The biggest flaw here is Goodwin's law. When the test takers become aware of the kludges that the bots use, they can exploit it. For example just dump a bunch of verbal diarrhea with as many correctly spelled words as possible. But even then it doesn't really hurt the bot's ranking accuracy too much. Because the kids who do the most test-prep and learn all the tips and tricks, are usually high-achievers who do well on essays anyway.


Strongly (but respectfully) disagree with a lot of this!

This is related to current fairness-in-AI discussions. In many cases the basic problem is ML systems leverage correlations for making causal decisions. Here, there is a huge ethical difference between scoring a person based on "is this a good essay" and "do the features of this essay correlate with features of good essays". Just like there is a huge fairness and discrimination difference between "is this person qualified for a loan" and "do the features of this person correlate with features of people who qualify for loans" (algorithmic redlining). Your last sentence has a big discrimination/fairness issue also, since you are testing even more for parental income and parental free time.


Which means that machine learning models need to not only be good predictors, but also be causal models to some extent.


I can't disagree strongly enough.

>Remember, at the end of the day as long as the bot's ranking is close enough to the human grader's ranking, nobody really cares about the internal logic.

This isn't true at all. Imagine you got a B or C on an essay that a human would have given an A to because you wrote it concisely and in plain language, or because you used language that's statistically correlated with being black. Does the fact that this is rare console you? "Sorry, but it's usually very close to the human grader's ranking." Close enough isn't good enough when you get the short end of the stick. "Sorry, you aren't going to get to go to the college you wanted because you use language statistically correlated with poor writing." Or just because you're different, so the statistical correlation doesn't apply to you, you filthy outlier. Just because it's a rare event doesn't make it okay.

In adulthood, this is like hiring or firing for work statistically correlated with good work. Remember when amazon rolled out the resume scorer? [0] Sure it was biased towards women, but it was close enough to human scores, so who cares about the internal logic?

>Grading bots basically rely on these statistical relationships. They're not measuring anything intrinsic to good writing.

At the end of the day, our goal here is to measure good writing. If the bots aren't measuring anything intrinsic to good writing, we shouldn't use them.

https://www.reuters.com/article/us-amazon-com-jobs-automatio...


The problem with the bots is that while they average agreement with the humans they can produce very different results. Fine if you're seeing how a school is doing, horrible if you're testing how a student is doing.


Minor correction; The automated resume reviewer was biased against women according to your reference.


I think you meant Goodhart's law: "When a measure becomes a target, it ceases to be a good measure."


Is that what it's called? Its referred in neural net training as overtraining.


Haha, whoops. That mistake kind of changes the meaning...

Thanks for catching the mistake.


Your last paragraph, and particularly the last sentence, epitomizes what is wrong with your whole thesis: the ultimate goal of the testing (and education itself, for that matter) is not to find people who can "do well on essays"; it is to develop analytical thinking.


Clearly, that referred to 'doing well on essays' correlating to good analytical thinkers. Which is does.


That assumption lacks justification when the scoring does not actually measure analytical thinking. Any statistical evidence for it is suspect as a predictor of future outcomes when a high score can more easily be gamed than 'honestly' achieved.


Scoring is not the point here; the analytical thinker is gaming the test to pump the score, thus proving they are an analytical thinker. Not a statistical argument; a suggestion that the screen works, when it is abused. Because it is abused.

Anyway, yeah, not really a correlation.


It is absolutely insane. By no definition does the system understand what is written.

You could ask a student to write an essay taking a firm opinion on some subject, and they could change standpoint every paragraph and there's no way these systems would know.

If I was a student I would be extremely offended at people wasting my time like this.


I'm surprised people are surprised by it. I guess it just hasn't gotten talked about it a lot? When I took the GRE in 2011 the rule was that my essay would be graded by one human and one automated grader, and a second human would become involved if the computer and the human differed by one point or more iirc.

Maybe nobody really makes a big deal about it because it is pretty much irrelevant anywah. Applicants provide a letter of intent that the grad dept people can, y'know, actually read for themselves, so I think unless you totally bombed the writing section nobody cared.


In a forum of CS people I'm surprised this is one of the top opinions. Our field is full of super surprising results like this -- that you don't have to actually understand the text at beyond basic grammar structures to reasonably accurately predict the score a human would give it.

Like this kind of thing should be cool, not insane. I mean wasn't it cool in your AI class when you learned that DFS could play Mario if you structured the search space right?


I came first in English for my school, many moons ago. Leading up to the finals, I regularly finished ahead of the hard core the English essay people, generally to my amusement. My exam essay responses were generally half the length (sometimes even shorter) than the prodigious writers. Although I've an ok vocabulary, I always made sure I made the right choice of word to hit a specific meaning, rather than choosing words with a high syllable count.

I'd find it highly interesting to see what kind of result I'd get using an automated system.

Why?

Because, I once asked a teacher (also an examiner) why I got good grades above the others, and the answer surprised me: my answers were generally unique /refreshingly different, to the point/ not too long and easy to read.

I suspect with this new system, I'd be an average student. It'd also be interesting to find out, several years down the road, if the automated system could be gamed at all -- I suspect it could, and teachers would help students 'maximise' their scores as a result of that.


It seems plausible that, under this system, you would eventually have learned to write longer essays. To my mind, that would be a school teaching you to be worse.

In fact, throughout the article I kept being surprised by the idea that long is good. When writing, I tend to prefer being brief.


Your post resonated with my first thought on reading the article. I wonder if it would penalize writers with simple declarative sentences.

You know, those average writers like Hemingway /s


When I hear a result like "software which understands basic grammar structures can predict what grade a human would give an essay" I think my views are roughly:

* 5% - cool, we could make a company that grades essays

* 15% - cool, we could make a company that grades essays and sell our source code to the test-prep industry

* 80% - fascinating, it sounds like the exam designers need to reevaluate what they are trying to measure with essay questions


Whatever we decide to measure, it needs to scale to millions of essay responses each year in a way such that scores are consistent across entire states or countries. With that in mind I'd imagine it's difficult to do much more than grade on grammar and basic semantics.


No it doesn't.

And if you succeed you will simply be measuring an uninteresting but manageable subset of the problem which will then become in some people's eyes the definition of the problem.

Education is supposed to be about teaching people to think, to give them the tools with which to do it, to be able to evaluate, criticise, invent, etc.


"...that you don't have to actually understand the text at beyond basic grammar structures to reasonably accurately predict the score a human would give it"

That only really shows that the humans they're training on are terrible at grading essays.



This problem is a first class demonstration of the difference between "can we?" and "should we?"

The fact that it's being implemented in society is insane because anyone who is paying attention to the state of AI today already knows how it will go wrong: without reading the article I already guessed that it systematically discriminated against certain demographics. Which was in fact what the article claimed.

It's interesting that it's possible to predict what the scorer would decide, but the moment you actually implement it is when all of the known problems become relevant, and the intellectual wonder must take a backseat to the human problems.


Teaching human-human communication by removing human inputs and having computers decide about quality... call me a skeptic. I feel bad for the students. Essay grading was bad enough before this

Narrowly for grammar however - is even that a good thing? It probably helps scale grammar help to more students, but if those tools became ubiquitous in grading and editing then unique voices would just disappear and a lot of potentially “great writers” might choose different careers because the machines don’t like them


Adding further bias against the underprivileged is not "cool". Implenting this while avoiding publicity or providing a means to publically audit the results is doubly not cool.

It is fine to play with "cool" techniques when you are doing consequence free stuff like playing Mario. When you are creating systems that have significant and long term effects of people's lives a different standard applies.


research needs to be done on bias correction. it will then be better than human where you cannot correct the bias


Based on the title alone, how would you feel if you were given bad marks due to flawed black box?


Like how I felt when I was given low grades for my ugly handwriting. It was stupid to grade it, but it guaranteed that I will never get a top score on any literature class.


Yeh I (as a dyslexic) got the same at school having my handwriting mocked by the teacher.

It took me much longer to pass my English language O level (exam taken at 16).


The same as when I get bad marks due to a flawed human and rubric.


This is sort of like discovering the Excel spreadsheet at the heart of a system responsible for handling hundreds of millions of dollars of transactions for your bank.

Yeah, it's cool, but what about your savings account?


Unlike a multiple choice test where the primary audience is automated graders, the primary audience for an essay is other humans. If even Google and Facebook with their billions of dollars and billions of posts worth of data, still cannot always understand the intent and purpose of written content, what hope do these algorithms have?

If it is cost-prohibitive for every essay to be graded by humans, then they should be dropped from the tests. Otherwise, we are missing the whole point of essays which is to communicate effectively with another human, not just match certain text patterns.


If it is cost-prohibitive then then maybe we should adjust the economic model, not abandon the measurement.


Sure, have less essay test questions, and start grading them for content not form.

If you want to grade on form to test the ability to write correct rather than coherent sentences, make those separate questions, and mark them so.


I mean that's what the automated grading systems are trying to do but it seems like people don't like them very much.


> If it is cost-prohibitive for every essay to be graded by humans, then they should be dropped from the tests.

Apparently it is. But everyone still wants writing to be assessed…


The SAT dropped the writing section a few years ago, and many schools don't care about the GRE writing score.


They did drop the writing but replaced it with an optional essay. Most upper tier schools require applicants to take the essay.


Harvard, MIT, Stanford, Caltech, Princeton, Yale, Dartmouth and many other schools do NOT require applicants to take the SAT essay.


Because they prefer an essay you can buy from a college applications consultant.


Do you know anyone who has gotten in without taking the SAT essay?


Finally, but then none of these new test takers will know what it feels like to get near perfect scores on the other sections of the test but then completely bomb the written portion and ruin your overall score.


> Otherwise, we are missing the whole point of essays which is to communicate effectively with another human, not just match certain text patterns.

I agree, this is traditionally the purpose of an essay. But to play devil's advocate, consider the rising number of people who are writing SEO or ASO content which is actually targeted at machines.


“In most machine scoring states, any of the randomly selected essays with wide discrepancies between human and machine scores are referred to another human for review”.

And “between 5 to 20 percent” of essays are randomly selected for human review.

So the takeaway is that if you’re one of the 80-95% of (typically black or female) people who the machine scored dramatically lower, but are not selected for human review, your education future is systematically fucked and you have no knowledge of why or how to change it.

Absolutely reprehensible. Anyone involved in the creation or adoption of these systems should be ashamed.


The thing is, you could be similarly screwed by a biased human whose grading is not checked by a less biased human.

At least the machines offer the following hope: even if unbiased humans are rare among paper-grading teachers, those humans can be used to train the machines, so then bias-free or lower-bias grading becomes more ubiquitous.

Basically, the system has the potential for systematically identifying and reducing systematic bias. A computer program can be retrained much more readily than nation-wide army of humans. Humans can be given a lecture on bias, and then they will just return to their ways.


AI has a lot more potential for bias than humans. It depends on the input data which is likely heavily biased based on other data set results like face detection. It will only amplify any small bias present in the data.


It's amazing to see how the general opinion of CS people has completely shifted in the last few years from "algorithmic scoring is important in removing the bias from human graders" to the exact opposite.


If we can quantify the bias in the machine, that gives us an opportunity to close the feedback loop and control the bias.

The bias comes from the human-generated training data in the first place; the machine isn't introducing its own. For instance, the machine has no inherent concept of disparaging someone's language because it's from an identifiable inner city dialect. If it picks up that bias, at least it will apply it consistently. When we investigate the machine, the machine will not know that it's being investigated and will not try to conceal its bias from us.

On the other hand, eliminating bias from humans basically means this: producing a new litter of small humans and teaching them better than their predecessors.


If...


That was the hope, but all the most effective methods suffer from data collection bias and the studies show that makes them worst than implicitly biased humans.


It is important but its not ready.


So will humans. With AI, the model and training data are auditable and, if necessary, modifiable. You can't audit a person's life history and you can only really modify a grader by replacing them, and it's much less socially damaging to retrain a model than to fire an employee.

AI certainly has a lot of potential for bias, but claims that AI bias is somehow worse than good old human bias always seem shoddily supported (Note I'm not claiming it's untrue. Just that it's never been shown to my satisfaction, which is not surprising given how quickly AI is changing. It may well be true.)


> but claims that AI bias is somehow worse than good old human bias always seem shoddily supported

Well, AI bias can combine sampling bias with human bias. Like say we train the AI with the output of only 10 human paper graders, all chosen from the same school district.

Due to the sampling bias, that data could create markedly more (or less!) bias than the entire population of human paper-graders.

The resulting AI will ideally mimic those 10 humans, though; it shouldn't show more bias than that group. If those 10 are flaring racists, and grade accordingly, the AI will be the same. (In fact, we hope that it will be the same, if the algorithm actually works in mimicking human grading.)


>Anyone involved in the creation or adoption of these systems should be ashamed

That's the problem - there is seemingly no shame these days. People involved "saved time and money", got paid and that's it. "If I didn't do it someone else would" and all of that.


Weapons of Math Destruction talks about this.


[flagged]


Did you read the article this thread is discussing? It cites multiple specific results wherein minority and female writers were underscored by the machine graders compared to human ones.


> It's quite funny how are some people manipulated to think that society and especially education is somehow biased against minorities or women when opposite is true

I think you have misunderstood the parent, who is asserting that the machines scoring these essays typically give lower scores to members of these demographics. This does not, by itself, mean the entire system is biased against those groups.

Whether or not the groups are, overall, systematically benefitted or harmed is not relevant to the injustice this article says exists.


The idea that black folks and females score dramatically lower on anything is a commonly held racist opinion that is foundational for the continuation of institutional oppression. When you assume the worst about people it ends up hindering their personal progress.


That's not what the article is saying, at all. There ARE general differences in style and word choice between minority groups or women and the average white male in writing. The corpus of training examples used in making this AI grader are at least biased towards the average white male. When the AI grades an essay harshly, it is not saying "this essay by a black woman is bad writing", it is saying "this essay differes from my training set by an aggregate score of ____", then sorts those results and (maybe) applies a curve.

An essay could be different from the reference standard because the standard is an example of good writing and the essay is not. Or it can be different because the author has a cultural, regional, gender, or developed background that imparts a different style than anything in the training corpus. Mistaking the two is very, very bad.


They do when graded by a computer program as described in the article. Are you disputing the article's veracity? What is your point?


The links you provided don’t show what you claim.


That link talks about how black women are attending college in highest proportion due to extremely recent growth. It also says that despite their recently high educational attainment, black women are still underrepresented by a factor of 2 in private sector jobs, as compared to their college graduation rates.


Personal anecdote;

I remember taking a standardized test, can't remember if it was SAT or CSAT (Colorado pre-SAT test). This was at a time when I'm confident that humans were the graders.

I started with an intro that would be appropriate for a standard 5 paragraph essay; i.e. the thing you write when you don't know what you're talking about and you're just following a format.

In the third paragraph I took a leaf from family guy, and just interjected "WAFFLES, NICE CRISPY WAFFLES, WITH LOTS OF SYRUP." for the next page and a half, I berated the very foundation of the essay prompt, insulting it the way only an angst ridden early teen can.

... I got a 98% on the essay.

Fast forward several years. I write an essay for for an introductory college course final. My paper is returned to me with a coffee stain and a "94% - good work!" note scribbled on the top. That note was scribbled by a TA that would turn out to be my girlfriend for 2 years. One night in bed, she tilts her laptop to me, showing an article that I used as the central theme to the above essay; "can you believe this?"

"Are you joking? Of course I can believe this, it was the subject of the essay you gave me an A on 2 years ago"

She admits she didn't read past the first paragraph of anything she grades, and just bases grades on intuition based on how articulate the essays are at the outset.

...

The point I'm making:

Does AI suck at judging the amount of informative content in a student essay? YES

Do humans suck at judging the amount of informative content in a student essay? ALSO YES


This is a great example of why it's grossly irresponsible for members of the ML community to talk about how AGI is just around the corner. In addition to the fact that we have no idea whether this is true, it primes a naive public for believing that technologies like this are worth the tradeoff.

"People worry that computers will get too smart and take over the world, but the real problem is that they're too stupid and they've already taken over the world."


I imagine that any student that experimented with the form of the essay or wrote an exceptionally well argued piece in simple language would not have their test graded appropriately either.

Any essay writing test which could be adequately graded by a machine is not testing anything of value.

Edit: I’ll further add that as soon as people’s careers depend on a metric, the metric becomes useless as a metric, because it will be gamed and manipulated by everyone involved. Almost nobody involved is incentivized to accurately measure student’s writing ability.


I think machines could be valuable in giving feedback on writing, like that grammarly.com.

A lot of what students write is actually garbage from that point of view. Even if they happen to have a good basic idea about what they want to say, the point of essay writing is to master the mechanics of expression so that you get the idea across effectively.

Whether the student has a brilliant idea isn't even so important, and it wouldn't even be fair; imagine if high school computer science expected students to turn in a best-selling app for a term project. Not everyone can come up with something brilliant to say; and even relatively mundane lines of reasoning can be given a good treatment in writing to develop the skill.

I remember when I had essays graded in school, a lot of the comments were low-grade fluff like "run on sentence", "wrong word", "faulty parallelism", "missing colon before 'for example'" and such points having nothing to do with the content being original, well-considered and well-argued. That sort of thing might as well be done by machine, at least as a preprocessing step to improve a student's rough draft.


Almost nobody involved is incentivized to accurately measure student’s writing ability

It's the same reason you see keyword posters in math education. "Together" means "plus", that kind of thing. It's completely worthless, except for one-step problems, and even then it doesn't always work. What is happening is collusion between teachers and testmakers. You can't teach understanding, but you can teach test-passing techniques because the way the test is set permits this.

You see the same thing here, in English you can get away with not teaching quality writing if you teach techniques to score well.


I feel like the mistake is assuming that essay writing is about the content. It's just a thing to give the student something barely non-trivial to write about.

When your essays are graded they're marked down for mechanical and wording problems. There's really no point in trying or grade 'good ideas' on a subject piece you had maybe 10 minutes to skim.


a subject piece you had maybe 10 minutes to skim

That's a travesty, and you know it because when the kids are in college and they have as much time as they like to write their assignments they all use the wrong words and then misapply them.


If I have 3 left shoes colored blue green and red, and you have 2 right shoes colored black and white, how many pairs can we make if our lefts and rights are put together?

Hint: together does not mean plus.


There is value in the ability to produce correct English 'off the cuff'. You could argue essays are the best way to get students to produce off the cuff written text. Hence, it makes some sense to ask students for essays, and then judge those essays only for form.

However, it is rather important that students know their essays are not judged as essays, but only judged on the content. Otherwise you teach students that form trumps content in essays.

When judging an essay as an essay correct English barely matters. What matters is how convincing you are, and how interesting of a read the essay is. This is a great skill to have, and testing it also makes sense. Really though, we should separate these two forms of testing.


To me this brings up the absurdity of having essays on standardized tests. What about an essay is standardized? It's a totally nonsensical premise.

This always gets made into some kind of techluminati conspiracy for the machines to ingrain structural racism whereas it's pretty clear all the algorithms fail to do is improve an already bad situation stemming from a flawed premise.


A number of states found out their schools were graduating students who genuinely could not read or write effectively. If you want to quantify that, you're forced to test it somehow. How would you test writing ability without asking them to write something?


Reading comprehension with simple factual questions.


Maybe it would be easier to test the teachers than the students then?


It's nuts when you put it that way. To really standardize an essay, you'd have to give the prompt and argument to be made and just test their ability to turn it into prose.


Any state that relies on the AI as the primary grader does not understand the current state of AI.

It would make sense to use the AI as a first pass, and then not randomly grade the essays with a human, but specifically choose all the essays that are on the cusp of the pass fail line. Then use all those human generated scores to update the model, especially if someone moves from pass to fail or fail to pass. Then maybe throw in a few of the really high and really low outliers to make sure those are right, and throw away your entire model if the human scores are drastically different (and obviously don't tell the humans what the computer score was so they have no idea if they're reading a "cusp" essay or an outlier essay).

But putting the educational fate (and therefore future earnings) in the hands of an AI is unconscionable.


But I bet the company took the decision makers to a really nice restaurant nudge nudge


I think machine learned grading of papers is insane, but at the same time I don't think we should be training or encouraging students to speak in AAVE (as the article suggests).

I think the right approach for machine learned systems is to automatically "whitelist" essays rather than "blacklisting" them. Students in the middle of the distribution of essays aren't really interesting, so whitelist them, give them a pass. Those at the extremes can be either exceptional or terrible, but usually terrible. The judgement of those at the extremes should be decided by a human, not a machine. You wouldn't want to blacklist the Einstein of essays because he did something genius that is indistinguishable from insanity.

However, I think there are some essays that can automatically be blacklisted. For example, those with:

1. Plagiarism (perhaps human moderated)

2. Extremely low word count

3. Extremely high count of fake words

And at the end of the day, these essay assignments aren't there to judge whether a student is the next writing sensation; they are given to judge whether the student can write legible sentences and words, to ensure they are prepared for the future. So perhaps it is at least possible to automatically blacklist on sentence structure and spelling (you should just lose points for invalid structure or invalid words, you shouldn't gain points for big words or complicated sentences). To make this fair, the student should be informed of this requirement. If they are informed and still fail, then they need to be remediated. If we discover that a disproportionate number of minorities are getting blacklisted, then we should investigate why the school is failing to teach them proper sentence structure and spelling, not pretend we can change the world to make AAVE an acceptable dialect of english in the workplace.


The underlying problem is that reading essays with a careful critical eye is not scalable. But another issue this highlights is the complete misalignment of incentives of the people who greenlit the adoption of this technology. Because educational outcomes are much harder to evaluate over the course of a bureaucrat's tenure than budget sizes (longer time horizon and many exogenous variables), there is a natural inclination to make decisions that reduce costs as long as they don't have any obvious (to them or their superiors) adverse outcome for students. This is a pretty low bar, especially so given that most bureaucrats do not have the background necessary to evaluate technical solutions.


I've heard stories from others in the industry of companies using tools like this on their human-facing documentation and requiring a certain score from them. Imagine using Microsoft Word's spelling and grammar checker, not being able to add or override its decisions (without following an extremely lengthy and bureaucratic process), and being required to have less than X "defects" per 100 words. Naturally, this results in documentation that is perfectly grammatical and free of spelling errors, but verbose, full of unusual phrasing, and next to useless for its actual purpose of informing a human.

Grading students' code using a machine is not such a bad idea in contrast, because in that case there is [1] no exceptions possible in a programming language, [2] the machine (compiler) has to understand it anyway, and [3] it does save time verifying correctness. But communication in a human language really needs to be assessed by humans. Anyone who thinks "AI" can accurately assess human language is either severely delusional, or trying to make $$$ from it.


I am working with reducing the time teachers spend on exams and assessments. I have access to a cleaned and manually scored dataset of 550k essays that is exponentially growing. Looked at creating at a model based on this dataset to automatically score essays with NLP parameters such as grammar, structure, spelling, word complexity, sentiment, relative text length etc.

The problem that I encountered was actually how to apply it in a useful way, since the problems mentioned in the article are quite obvious when you design the model.

Options that I saw:

1. Use it as autonomous grading with optional review by the teacher, see the linked article for the problems with this.

2. Use it as a sanity check on the teachers manual scoring, but it would not reduce the work load and probably just undermine the teacher.

Do you have any suggestions for how such a model could be applied in a practical and ethical way?

Had some thoughts on how to measure actual knowledge about a subject, but that would require a massive knowledge graph which would introduce a huge amount of complexity just to see if it would be a feasible approach.


Here are some thoughts: 1. Instead of grading, maybe you can use it for training, tutoring. If a student is learning to write essays, I'm assuming it's hard for them to get any feedback. 2. But then there's probably not enough money to be earned there.

One trick might be to write an independent AI to summarize the essay back and see how closely it matches the essay title. This might weed out gibberish essays with sound English sentences.


Current Transformer models are looking pretty good at complex end-to-end tasks (at least, better than the shallow regression with hand-picked features that ETS probably uses). In a few years, complete end-to-end evaluation may not be so impossible, especially with so much data.


Such a stupid application of technology. It looks as if learning is completely out of fashion nowadays.

First of all, complaining about minorities getting lower grades because their English is not as sophisticated as that of others is the inversion of the idea of teaching. That feedback is actually great. We have machines that can give that feedback (e.g., grammarly)? Then use it to make everyone's writing better. Grades are just a measure of the success of learning, after all. I never got why one would not allow a student to repeat a particular test as often as they like, tbh.

Second, grading essays this way is a clear violation of the idea of teaching. What do you want the students to learn? Structure? Knowledge transfer? Grammar? Writing an essay is such a complex task it is a really too broad goal. And then naturally grading becomes quite difficult.


While this is already terrible, I’m aware of a few project that are trying to do the same with scientific literature. Basically they are trying to train models for scoring literatures based on their quality, novelty and what not. At the current rate and state of AI, I cannot ever imagine this is going to work.

It was a few weeks ago that someone shared “The Dark Age of AI” on HN [1]. I think we are promising way over what Drew McDermott thought we would not going to promise. This is to the extend that we are applying AI on assessing Art, Creativity and even quality and novelty of Science, something that in a way we don’t even understand (or trying to understand) ourselves at the time that we are publishing it.

[1]: https://news.ycombinator.com/item?id=20546503


Grading... algorithms... for essays? How/why is that even a thing? That's absolutely insane. You can't grade someone's writing skills using algorithms. That is totally counter to providing a proper education. My mind is officially boggled.


Quality of educational is proportional to quality of evaluation.

Evaluation of how well someone follows arbitrary language conventions is worse than useless.

I only got to university English 101 outside of some technical writing in the engineering department, but I have to say none of my education in writing was worth anything past elementary school. It is perhaps one of the most difficult things to teach and evaluate, to be fair, but I feel like I am missing a huge chunk of my education and general ability because of it. I can't write or form an argument particularly well, rambling on HN and the like is the closest thing to education I have had.

Prescriptive language rules are not entirely useless. That is the best you can say about them.


I would like to see how it scored on essays by great writers. “Sorry Mr Tolkien, I’m afraid you have to go to community college first.”


In my state, it's going to be "Sorry Mr Tolkien, but we eliminated all of the departments that are not STEM enough."


I'm normally pretty open minded but this is just stupid. AI is nowhere near literate enough for this task. What kind of world is it when humans create merely for the consumption of machines. The product of our creativity deserves better.

I would support any student who refuses to consent to their work being used in this fashion.


I wish this machine bias wasn't always presented in such divisive terms as race and "disadvantaged groups". It can affect anybody. If you happened to develop a writing style that looks like typical bad essay writers' style, then you could be hurt by bias in the grading.


Anyone can be hurt by bias, but minority groups are the obvious ones to be most likely statistically affected by it, making them the most obvious red flag for this kind of situation.


If an image processing algorithm fails to recognize black people or worse, profiles them, how else should this be described but in terms of race?

If you don't talk about the actual problem, how can you possibly expect to solve it?


In terms of the technical effect that causes it as opposed to sociological.

The point then isn't that the algorithm hates black people or the programmers are racist (even if they were they would likely find it hard to train it to specifically exclude accurately without major side effects), but that their training set or analysis is flawed - potentially even if it is representative. Even if the results are problematic trying to address hate which isn't there won't be helpful. Call for better vetting procedures instead or that it clearly isn't ready for the proposed application.

Say it gets the highest accuracy on a set of photos of non-people images and a set of people which represents the exact ethnic make up all tagged just as "has people" or "doesn't have people". Going with something biased towards the most populous groups would get the best results quickest.

Talking about say image tagging to ensure automated performance checking across various characteristics could be productive on the other hand.


There are many classes of people who have problems of discrimination. Short, ugly, ginger, etc. The intersections of all those classes are so numerous that everybody will have some disadvantage. But it won't be apparent unless you define their class and measure it.


That's just substituting in smaller or harder-to-define minority groups, though.


From the article: "All essays scored by E-rater are also graded by a human and discrepancies are sent to a second human for a final grade. Because of that system, ETS does not believe any students have been adversely affected by the bias detected in E-rater."


Also from the article:

> Of those 21 states, three said every essay is also graded by a human. But in the remaining 18 states, only a small percentage of students’ essays—it varies between 5 to 20 percent—will be randomly selected for a human grader to double check the machine’s work.

So that applies only in a minority of cases.


Oops, my mistake! That's worse than I thought!


That particular company seems to do a not-horrible job. But they’re not the only game in town, so presumably most or many essays are graded by another company’s system.


> the engines also focus heavily on metrics like sentence length, vocabulary, spelling, and subject-verb agreement... The systems are also unable to judge more nuanced aspects of writing, like creativity.

This reminds me of a wonderful essay/speech by Stephen Fry on the harm done by pedantry. I also feel that schools focus so much on a single structure of essay writing and similarly take the joy out of language.

https://youtu.be/J7E-aoXLZGY


This is a natural development of industrialized education. Treating children as individual thinkers would require for more resources and manpower than our system would like to provide.


Bringing SEO mentality to standardized testing what could go wrong.


Absolute garbage. Kids would be better educated by reading and posting on HN than they would by attending English classes in one of the states that uses these tools.


Here's a thought: if classwork and homework is getting so overwhelming that teachers can't possibly grade all of it, then it's overwhelming for the STUDENTS too, and they shouldn't freaking be assigning so much busywork. You don't need a 5 page essay to determine whether a kid has read a book. You can figure that out really quickly in a classroom discussion without anyone having to lift a pencil.


> Here's a thought: if classwork and homework is getting so overwhelming that teachers can't possibly grade all of it, then it's overwhelming for the STUDENTS too

There's no necessary connection there, especially if one of the reasons that teachers are being overwhelmed is that the teacher/student ratio is increasing.

> You don't need a 5 page essay to determine whether a kid has read a book.

No, you need it to determine whether a student has (1) read and understood a book well enough to apply structured thought to the contents and (2) has developed the writing skills to write a 5-page essay.

Determining whether a student read a book is rarely, on its own, of significant interest in school.


Reminds me of the plagiarism checker they had at my partners university, they would check identical words on specific subjects... Meaning every word in any order, so naturally there is a high % of overlap not only with quotes but also the words used regarding the subject, the teacher would take this literally as "you did not write this yourself" if 10% of words would be similar.

Don't think anyone passed that class.


I can't believe that anyone would try to automatically grade essays. This is either deeply cynical or astonishingly dumb.


Good lord what a terrible design. Rather than determine if the writer has a coherent understanding and a complex prompt, the system grades based on writing patterns. This is actually my biggest fear of AI. Deploying wide scale systems like this that have very clear flaws


I live in Poland and it is the first time I hear about it.

I am absolutely apalled.

Not even at the idea of grading by algorithm, but by the fact that many, many people had to cooperate to make this happen.


You say "flawed algorithm," I saw "easily exploitable by intelligent students."


I don't even think I would be qualified to grade essays, let alone an algorithm!


Teachers talk back and even may unionize! Crappy AI is cheap and can't unionize.


It seems like these accumulated errors in the educational system and filters needed to get through it would create a market inefficiency that could be exploited by a firm willing to ignore degrees, grades, and test scores and judge for themselves whether a candidate can do the job they're being hired for.


Why are we even bothering to discuss this on this site?

Wouldn't it be better and less biased if we each wrote our own AI systems and had them discuss with each other instead?

(And we should publish our training data as well, of course)


Why are algorithms grading essays in the first place?


The sooner we get it out of our heads that this education system of ours is a meritocracy the closer we’ll get to actually creating a quality universal system.


What are teachers but flawed Algorithms?


It is becoming increasingly evident that the hubris for implementing AI is what is going to ruin everything.


Because a (likely unsophisticated) algorithm is grading the essays, there's probably a deterministic method to do score well.

This seems like a terrible idea.

It's not a stretch to imagine the opportunity for nefarious behavior this allows - think of the recent college admission scandals, and how happy they'd be to have a guise of algorithmic indifference'.

If used long-term, it could offer a big advantage to the wealthy in other avenues. Another hypothetical, probably not far from reality: the algorithm becomes solved (almost or completely) by some premier 'tutoring' company. Said company can charge a pretty penny given its stellar track record, offering yet another hidden advantage to the wealthy/elite.


Surely there's a deterministic method to score well on the math questions?


An essay is to a math problem as a proof is to a grammar problem.


There's definitely a deterministic way to score well on HS level proofs. Also, I think you are overestimating the requirements for an essay on a standardized test.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: