How would you recommend choosing the "best response" programmatically?

throwup238 · on March 17, 2024

Ask each model to score and rank its own answer and each other. It's AI turtles all the way down.

sorokod · on March 17, 2024

If you iteratively score, request improvement, and submit do results converge to a score value? If not, what do you think that means?

brokensegue · on March 17, 2024

They aren't good at scoring their own work in my experience

riku_iki · on March 17, 2024

Someone should build this as a service

muzani · on March 17, 2024

Once they're faster and cheaper, it'll probably end up a standard pattern taught in school.

nilsherzig · on March 17, 2024

I think a lot of local LLM benchmarks are evaluated by gpt4 haha

patrickhogan1 · on March 17, 2024

You wouldn't. As the originator of the prompt, the human user is the best judge of whether the prompt accurately captures their intent.

vineyardmike · on March 17, 2024

Or just choose the best response manually as a human.

patrickhogan1 · on March 17, 2024

This is what I do today.

I input the same prompt across all 3 and gauge the output of the first response. Whichever assistant best “understands” what I want to accomplish, I choose that assistant to continue the follow up prompts with.

There is a bias where my lack of prompting technique may be the cause of the assistant not providing the best response. But, im grading on a fair curve since they all have the same input and I see this as the core value proposition of the assistant.