It's not built on high test scores - while academics do benchmark models on vari...

It's not built on high test scores - while academics do benchmark models on various tests, all the many people who built up the hype mostly did it based on their personal experience with a chatbot, not by running some long (and expensive) tests on those datasets.

The tests are used (and, despite their flaws, useful) to compare various facets of model A to model B - however, the validation whether a model is good now comes from users, and that validation really can't be flawed much - if it's helpful (or not) to someone, then it is what it is, the proof of the pudding is in the eating.