Sorry, but these people are not victims. I went through a tech PhD; it was well-known how fast the wind changes and the trendy topic falls by the wayside. Big data and crowdsourcing pre-AI boom, then ML, then ethics of AI (around 2020-21), now we are at AI agents which is inherently more product-oriented. Beyond this, we had COVID and a now brutal tech market, despite the US economy seemingly way up and the AI craze continuing. If you went into a CS PhD thinking nothing changes and people will beg you to take cushy salary positions with total freedom, then you simply didn’t do your research.
Wow. This actually disproves a key subtext of the match mentioned by some commentators: that Ding failed to convert winning positions to wins. Instead, it shows that Ding converted more often than Gukesh. The fact that Gukesh won seems more a statistical anomaly in light of this evidence. We are indeed probably post-hoc rationalizing the winner.
It doesn't really disprove anything. The problem with this type of analysis is that it's based on engines which are many levels above human play.
While watching the commentary, you will often see comments from super GMs like "engine suggest move XY, but it's not a move a human player would find/consider". The move may be optimal, but only if you're at this Stockfish 3600 ELO level because you need to precisely execute a series of 3600 ELO moves to exploit it. A suboptimal move for 3600 ELO player may be the optimal move for a 2800 ELO player, but Stockfish won't tell you.
I'm not saying this analysis isn't interesting, but we shouldn't overinterpret it.
To add to this, part of what sets engines apart from humans is their understanding of time. The engine always knows whether it has time to complete an attack before the opponent can defend or counterattack - in other words, which player is truly attacking.
If you make a calculation mistake, suddenly your attack falters, and you may have sacrificed material and/or positional integrity that puts you critically behind or makes you vulnerable to counterattack.
This is part of how you get the narrative (in multiple games) that Ding got ahead but lost his nerve. The engine was saying he had time to attack, but he didn't have the certainty an engine does. He didn't immediately press that attack, and his opportunity disappeared.
Whenever you look at an analysis that leads you to a conclusion like this, your starting position must be that the analysis is wrong. ACPL etc are poor metrics to evaluate this chess match in particular where fatigue, time pressure, psychological factors so clearly dominated.
Yes. To be honest, when the match was over, I was also left with the feeling that Ding did not capitalize enough on his opportunities. But later after crunching the data I saw that it was actually the other way around.
Sincerely, doesn't this make you question your methodology? It's such an obviously incorrect conclusion that you may as well have concluded that Ding actually won.
I feel like your over-reliance on engine stats like ACPL has led you to some conclusions that may have been true had stockfish been playing leela but really have little or nothing to do with humans playing chess.
“Do politics have artifacts?” was the rejoinder article. IMO that article should be as widely read as the main one, because it provides a warning to those who take the main one as gospel. Link: https://journals.sagepub.com/doi/abs/10.1177/030631299029003...
It's curious that Anthropic is entering the LLMOps tooling space ---this definitely comes as a surprise to me, as both OpenAI and HuggingFace seem to avoid building prompt engineering tooling themselves. Is this a business strategy of Anthropic's? An experiment? Regardless, it's cool to see a company like them throw their hat into the LLMOps space beyond being a model provider. Interested to see what comes next.
It's curious that Anthropic is entering the LLMOps tooling space ---this definitely comes as a surprise to me, as both OpenAI and HuggingFace seem to avoid building prompt engineering tooling themselves. Is this a business strategy of Anthropic's? An experiment? Regardless, it's very cool to see a company like them throw their hat into the LLMOps space beyond being a model provider. Interested to see what comes next.
ChainForge lets you do this, and also setup ad-hoc evaluations with code, LLM scorers, etc. It also shows model responses side-by-side for the same prompt: https://github.com/ianarawjo/ChainForge
There is a long term vision of supporting fine-tuning through an existing evaluation flow. We originally created this because we were worried about how to evaluate ‘what changed’ between a fine-tuned LLM and its base model. I wonder if Vertex AI has an API that we could plug-in, though, or if it’s limited to the UI.