It's amazing to see how the general opinion of CS people has completely shifted in the last few years from "algorithmic scoring is important in removing the bias from human graders" to the exact opposite.
If we can quantify the bias in the machine, that gives us an opportunity to close the feedback loop and control the bias.
The bias comes from the human-generated training data in the first place; the machine isn't introducing its own. For instance, the machine has no inherent concept of disparaging someone's language because it's from an identifiable inner city dialect. If it picks up that bias, at least it will apply it consistently. When we investigate the machine, the machine will not know that it's being investigated and will not try to conceal its bias from us.
On the other hand, eliminating bias from humans basically means this: producing a new litter of small humans and teaching them better than their predecessors.
That was the hope, but all the most effective methods suffer from data collection bias and the studies show that makes them worst than implicitly biased humans.