Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How would you recommend choosing the "best response" programmatically?


Ask each model to score and rank its own answer and each other. It's AI turtles all the way down.


If you iteratively score, request improvement, and submit do results converge to a score value? If not, what do you think that means?


They aren't good at scoring their own work in my experience


Someone should build this as a service


Once they're faster and cheaper, it'll probably end up a standard pattern taught in school.


I think a lot of local LLM benchmarks are evaluated by gpt4 haha


You wouldn't. As the originator of the prompt, the human user is the best judge of whether the prompt accurately captures their intent.


Or just choose the best response manually as a human.


This is what I do today.

I input the same prompt across all 3 and gauge the output of the first response. Whichever assistant best “understands” what I want to accomplish, I choose that assistant to continue the follow up prompts with.

There is a bias where my lack of prompting technique may be the cause of the assistant not providing the best response. But, im grading on a fair curve since they all have the same input and I see this as the core value proposition of the assistant.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: