>The daydream machine then daydreamed some output text, as is it's function.
These are not advertised as daydream machines.
They compete on how accurate they they are against various accuracy benchmarks.
The average person who uses them does so with the expectation of accurate results, you know, as they are advertised. Accuracy and speed are pretty much the entire business model.
No, they generally do not compete on accuracy benchmarks afaik.
GitHub/openai/simple-evals is what I checked here, and no, openai do not compete on accuracy benchmarks as far as I can tell. So I'd be interested in seeing what led you to think that, and also what led you to earlier claim that anyone typing in the complainant's name saw the same hallucination.
>No, they generally do not compete on accuracy benchmarks afaik.
"Get Answers" is literally at the top of ChatGPTs landing page. You think the average person interprets that to mean "Get inaccurate answers"?
Google "AI benchmark" and almost every result is an assessment of the accuracy of various models. What do you think they compete on? How do you think they measure the improvement of one model to the next?
Pop this in Google and see the pages of results about accuracy: site:openai.com "accuracy". To claim that they don't optimize for accuracy confirms to me that you are not discussing this in good faith. Perhaps you are just trying to be contrarian or something, I don't know.
>and also what led you to earlier claim that anyone typing in the complainant's name saw the same hallucination.
Well, it says right in the article that different people received the same result.
Why are the goalposts moving? Actually, nevermind, I don't care to continue the conversation.
I don't know why I'm bothering. But notice how all of these explicitly mention accuracy? And how they are benchmarking the accuracy of the LLM against a known dataset? How accuracy is the primary metric they are evaluated on? Maybe it's because they are trying to improve the accuracy of the models...
First line of the abstract of MMLU: "We propose a new test to measure a text model's __multitask accuracy__."
Fourth line of the abstract of MATH: "To facilitate future research and __increase accuracy__ on MATH"
Second line of GPQA abstract: "We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach __65% accuracy__ [...] while highly skilled non-expert validators only reach __34% accuracy__"
Fifth line of the DROP abstract: "We apply state-of-the-art methods from both the reading comprehension and semantic parsing literature on this dataset and show that the best systems only achieve 32.7% F1 on __our generalized accuracy metric__"
From the MGSM paper: "MGSM __accuracy__ with different model scales."
Models are designed to output accurate information in a reasonable amount of time. That's literally the whole goal. The entire thing. A math-specific model wants to provide accurate math answers. A general model wants to provide accurate answers to general questions. That's the whole point.
How much farther can you move the goalposts? We're already almost on another planet.
You ignored almost everything in my original comment and hyper-focused on accuracy. Then, when confronted with the fact that every single example benchmark you provided is a measure of accuracy, you now say "well, it's not a benchmark about a specific person in norway". Obviously not!
The MATH benchmark doesn't ask "what is 2+2", either. Your argument is "well, math-focused models aren't expected to accurately answer 2+2 because it isn't in the MATH benchmark". It's ridiculous.
These are not advertised as daydream machines.
They compete on how accurate they they are against various accuracy benchmarks.
The average person who uses them does so with the expectation of accurate results, you know, as they are advertised. Accuracy and speed are pretty much the entire business model.