Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>a personal benchmark of 10 questions that resemble common tasks

That is an idea worth expanding on. Someone should develop a "standard" public list of 100 (or more) questions/tasks against which any AI version can be tested to see what the program's current "score" is (although some scoring might have to assign a subjective evaluation when pass/fail isn't clear).




Thats what a benchmark is, and they're all gamed by everyone training models, even if they don't intend to, because the benchmarks are in the training data.

The advantage of a personal set of questions is that you might be able to keep it out of the training set, if you don't publish it anywhere, and if you make sure cloud-accessed model providers aren't logging the conversations.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: