More

danlenton · on May 29, 2024

We just initialize a random latent vector for each model, and then jointly train each of these unique latent vectors :)

danlenton · on May 23, 2024

Interesting, do you have any hunch as to why this is? We've seen in more verticalized apps where the underlying model is hidden from the user (sales call agent, autopilot tool, support agent etc.) that trying to reach high quality on hard prompts and high speed on the remaining prompts makes routing an appealing option.

weird-eye-issue · on May 24, 2024

We charge users different amounts of credits based on the model used. They also just generally have a personal preference for each model. Some people love Claude, some hate it, etc

For something like a support agent why couldn't the company just choose a model like GPT-4o and stick with one? Would they really trust some responses going to 3.5 (or similar)?

danlenton · on May 24, 2024

Currently the motivation is mainly speed. For the really easy ones like "hey, how's it going?" or "sorry I didn't hear you, can you repeat?" you can easily send to Llama3 etc. Ofc you could do some clever caching or something, but training a custom router directly on the task to optimize the resultant performance metric doesn't require any manual engineering.

Still, I agree that routing in isolation is not thaaat useful in many LLM domains. I think the usefulness will increase when applying to multi-step agentic systems, and when combining with other optimizations such as end-to-end learning of the intermediate prompts (DSPy etc.)

Thanks again for diving deep, super helpful!

danlenton · on May 23, 2024

One use case is optimizing agentic systems, where a custom router [https://youtu.be/9JYqNbIEac0] is trained end-to-end on the final task (rather than GPT4-as-a-judge). Both the intermediate prompts and the models used can then be learned from data (similar to DSPy), whilst ensuring the final task performance remains high. This is not supported with v0, but it's on the roadmap. Thoughts?

qeternity · on May 24, 2024

We do agentic systems. We already optimize for these things. We route between different models based on various heuristics. I absolutely would not want that to be black box. And doing any sort of vector similarity to determine task complexity is not going to work well.

I would also not try to emulate DSPy, which is a massively overrated bit of kit and of little use in a production pipeline.

tarasglek · on May 28, 2024

Curious, can you explain why you think DSPy overrated?

danlenton · on May 23, 2024

Thanks for sharing, will get this fixed now!

danlenton · on May 23, 2024

If you do test it out, feel free to ping me with any questions!

danlenton · on May 23, 2024

Thanks for weighing in. I'm sure for your setup right now, our router in it's current form would not be useful for you. This is the very first version, and the scope is therefore relatively limited.

On our roadmap, we plan to support:

- an API which returns the neural scores directly, enabling model selection and model-specific prompts to all be handled on the client side

- automatic learning of intermediate prompts for agentic multi-step systems, taking a similar view as DSPy, where all intermediate LLM calls and prompts are treated as latent variables in an optimizable end-to-end agentic system.

With these additions, the subtleties of the model + prompt relationship would be better respected.

I also believe that LLMs will become more robust to prompt subtleties over time. Also, some tasks are less sensitive to these minor subtleties you refer to.

For example, if you have a sales call agent, you might want to optimize UX for easy dialgoue prompts (so the person on the other end isn't left waiting), and take longer thinking about harder prompts requiring the full context of the call.

This is just an example, but my point is that not all LLM applications are the same. Some might be super sensitive to prompt subtleties, others might not be.

Thoughts?

weird-eye-issue · on May 23, 2024

I don't want/need any of that

It's already hard enough to get consistent behavior with a fixed model

If we need to save money we will switch to a cheaper model and adapt our prompts for that

If we are going more for quality we'll use and more expensive model and adapt our prompts for that

I fail to see any use case where I would want a third party choosing which model we are using at run time...

We are adding a new model this week and I've spent dozens of hours personally evaluating output and making tweaks to make it feasible

Making it sound like models are interchangeable is harmful

danlenton · on May 23, 2024

Makes sense, however I would clarify that we don't need to make the final decision. If you're using the neural scoring function as an API, then you can just get predictions about how each model will likely perform on your prompt, and then use these predictions however you want (if at all). Likewise, the benchmarking platform [https://youtu.be/PO4r6ek8U6M] can be used to just assess different models on your prompts without needing to do any routing. Nonetheless, this perspective is very helpful.

danlenton · on May 23, 2024

Thanks! Ipsos is also a great analogous example, I hadn't thought of that.

billylo · on May 23, 2024

Any time... In the banking world, the analogous is JD Power. Useful and valuable insights from them.

https://canada.jdpower.com

danlenton · on May 23, 2024

duly noted!

danlenton · on May 23, 2024

It's on the roadmap! Hopefully will be added next week

danlenton · on May 23, 2024

Yes the benchmarks are ongoing, we continually plot the speed and cost across time in our runtime benchmarks [https://unify.ai/benchmarks], and we use this live data when plotted the quality scatter graphs [https://console.unify.ai/dashboard]. The router configurations are "self-improving" in the sense that any given router config will quickly wrap the latest models and providers under the hood. Using a router config is a way of riding the wave of models and providers, whilst just specifying your priorities for quality, speed and cost. We will have some case studies which better explain this soon!

danlenton · on May 23, 2024

Currently, we simply use GPT4-as-a-judge, with a general system prompt we've written which is task agnostic. This is then used to train the neural scoring function, which predicts quality ahead-of-time. However, it's on our roadmap to add make the judging more flexible, potentially task-specific judge prompts and in-context examples, also perhaps using a jury [https://arxiv.org/pdf/2404.18796].