This feels like an argument bigger than AI evaluations. All points you raised could very well be issues with humans evaluating other humans to attempt to predict future outcomes.
They are not wrong. And the art of predicting future outcomes proves to be difficult and fraught with failure. But human evaluation of other humans is more like an open level field to me. A human is accountable for what he or she says or predicts about others, subject to interrogation or social or legal consequences. Not so easy with AI, because it steps out of all these areas - at least many actors using AI do not seem to stay responsible and take on all these mistakes.
In my experience, we're really bad at holding humans accountable for their predictions too. That may even be a good thing, but I'm less confident that we would be holding LLMs less accountable for their predictions than humans.