More

fatso784 · 2024-12-25T14:47:33 1735138053

Sorry, but these people are not victims. I went through a tech PhD; it was well-known how fast the wind changes and the trendy topic falls by the wayside. Big data and crowdsourcing pre-AI boom, then ML, then ethics of AI (around 2020-21), now we are at AI agents which is inherently more product-oriented. Beyond this, we had COVID and a now brutal tech market, despite the US economy seemingly way up and the AI craze continuing. If you went into a CS PhD thinking nothing changes and people will beg you to take cushy salary positions with total freedom, then you simply didn’t do your research.

perching_aix · 2024-12-26T00:10:46 1735171846

They don't need to be "victims" for what I wrote to apply.

fatso784 · 2024-12-19T09:41:31 1734601291

Wow. This actually disproves a key subtext of the match mentioned by some commentators: that Ding failed to convert winning positions to wins. Instead, it shows that Ding converted more often than Gukesh. The fact that Gukesh won seems more a statistical anomaly in light of this evidence. We are indeed probably post-hoc rationalizing the winner.

The_Colonel · 2024-12-19T10:30:01 1734604201

It doesn't really disprove anything. The problem with this type of analysis is that it's based on engines which are many levels above human play.

While watching the commentary, you will often see comments from super GMs like "engine suggest move XY, but it's not a move a human player would find/consider". The move may be optimal, but only if you're at this Stockfish 3600 ELO level because you need to precisely execute a series of 3600 ELO moves to exploit it. A suboptimal move for 3600 ELO player may be the optimal move for a 2800 ELO player, but Stockfish won't tell you.

I'm not saying this analysis isn't interesting, but we shouldn't overinterpret it.

banannaise · 2024-12-19T14:47:04 1734619624

To add to this, part of what sets engines apart from humans is their understanding of time. The engine always knows whether it has time to complete an attack before the opponent can defend or counterattack - in other words, which player is truly attacking.

If you make a calculation mistake, suddenly your attack falters, and you may have sacrificed material and/or positional integrity that puts you critically behind or makes you vulnerable to counterattack.

This is part of how you get the narrative (in multiple games) that Ding got ahead but lost his nerve. The engine was saying he had time to attack, but he didn't have the certainty an engine does. He didn't immediately press that attack, and his opportunity disappeared.

taytus · 2024-12-19T15:23:03 1734621783

To your point, Magnus Carlsen, arguably the GOAT, hung a rook yesterday.

seanhunter · 2024-12-20T08:12:58 1734682378

Whenever you look at an analysis that leads you to a conclusion like this, your starting position must be that the analysis is wrong. ACPL etc are poor metrics to evaluate this chess match in particular where fatigue, time pressure, psychological factors so clearly dominated.

maximamel · 2024-12-19T10:15:15 1734603315

Yes. To be honest, when the match was over, I was also left with the feeling that Ding did not capitalize enough on his opportunities. But later after crunching the data I saw that it was actually the other way around.

seanhunter · 2024-12-20T08:14:31 1734682471

Sincerely, doesn't this make you question your methodology? It's such an obviously incorrect conclusion that you may as well have concluded that Ding actually won.

I feel like your over-reliance on engine stats like ACPL has led you to some conclusions that may have been true had stockfish been playing leela but really have little or nothing to do with humans playing chess.

fatso784 · on March 25, 2024

“Do politics have artifacts?” was the rejoinder article. IMO that article should be as widely read as the main one, because it provides a warning to those who take the main one as gospel. Link: https://journals.sagepub.com/doi/abs/10.1177/030631299029003...

fatso784 · on Dec 12, 2023

This seems like a sheets implementation of something like ChainForge (https://github.com/ianarawjo/ChainForge).

It's curious that Anthropic is entering the LLMOps tooling space ---this definitely comes as a surprise to me, as both OpenAI and HuggingFace seem to avoid building prompt engineering tooling themselves. Is this a business strategy of Anthropic's? An experiment? Regardless, it's cool to see a company like them throw their hat into the LLMOps space beyond being a model provider. Interested to see what comes next.

fatso784 · on Dec 12, 2023

The original poster making this claim used a t-test to compare means (https://x.com/RobLynch99/status/1734278713762549970?s=20). Turns out the data is not normally distributed, making a t-test worthless (https://www.statology.org/t-test-assumptions/).

There might be other tests to do, but for this specific setup, the claim has been debunked.

fatso784 · on Dec 11, 2023

Can’t reproduce this. See for yourself: https://x.com/IanArawjo/status/1734307886124474680?s=20

Inspectable evaluation flow in ChainForge: https://chainforge.ai/play/?f=2yvqkpe1vpus8

data-ottawa · on Dec 11, 2023

N=470 vs N=80 can impact the replicability

motoboi · on Dec 11, 2023

wait. WHAT! this app chainforge is great!

fatso784 · on Dec 11, 2023

This seems like a sheets implementation of something like ChainForge (https://github.com/ianarawjo/ChainForge).

It's curious that Anthropic is entering the LLMOps tooling space ---this definitely comes as a surprise to me, as both OpenAI and HuggingFace seem to avoid building prompt engineering tooling themselves. Is this a business strategy of Anthropic's? An experiment? Regardless, it's very cool to see a company like them throw their hat into the LLMOps space beyond being a model provider. Interested to see what comes next.

fatso784 · on Sept 11, 2023

ChainForge lets you do this, and also setup ad-hoc evaluations with code, LLM scorers, etc. It also shows model responses side-by-side for the same prompt: https://github.com/ianarawjo/ChainForge

fatso784 · on Aug 12, 2023

Thanks!

fatso784 · on Aug 12, 2023

There is a long term vision of supporting fine-tuning through an existing evaluation flow. We originally created this because we were worried about how to evaluate ‘what changed’ between a fine-tuned LLM and its base model. I wonder if Vertex AI has an API that we could plug-in, though, or if it’s limited to the UI.

koryk · on Aug 13, 2023

I meant for completion, chat and embedding. https://cloud.google.com/vertex-ai/docs/generative-ai/chat/t... . Some examples here.

Vertex AI has the same API as PaLM as far as I know. However, the authorization is through Google Cloud. So I use it like any other GCP API.

I love the idea of adding fine tuning as a node though. Here is the API for creating a model tuning job - https://cloud.google.com/vertex-ai/docs/generative-ai/models...

I wish I could use ChainForge nodes in Node Red.