Mixture-of-Expert models benefit from economies of scale, because they can process queries in parallel, and expect different queries to hit different experts at a given layer. This leads to higher utilization of GPU resources. So unless your application is already getting a lot of use, you're probably under-utilizing your hardware.
Interestingly I can't get ChatGPT to help me find a video showing me how to disable the cellular modem on my Subaru 2024 Crosstrek. Time to do some old-fashioned research, I guess...
"Process-oriented" verification has been a thing for a while in mathematical reasoning CoT. Google had a paper about it last year [1]. The key term to look for is "Process-reward model." I particularly like RL Tango [2].
I think it has to do with mental model. If you already know what to write and it is reasonably complex you'll have a mental model ready and can quickly write it down (now even faster as LLMs autocomplete 3-4 lines at a time). While reading someone else code you'll have to constantly map the code in your mind with code written and have to then compare quality, security and other issues.
I just tend to find LLM code output extremely to read, I guess. It tends to be verbose and do a lot of unnecessary stuff, but I can always get the point easily and edit accordingly.
reply