That’s very subjective and case dependent. I use local models most often myself with great utility and advocate for giving my companies the choice of using either local models or commercial services/APIs (ChatGPT, GPT-4 API, some Llama derivative, etc.) based on preference. I do not personally find there to be a large gap between the capabilities of commercial models and the fine-tuned 70b or Mixtral models. On the whole, individuals in my companies are mixed in their opinions enough for there to not be any clear consensus on which model/API is best objectively — seems highly preference and task based. This is anecdotal (though the population size is not small), but I think qualitative anec-data is the best we have to judge comparatively for now.
I agree scoreboards are not a highly accurate ranking of model capabilities for a variety of reasons.
If you're using them mostly for stuff like data extraction (which seems to be the vast majority of productive use so far), there are many models that are "good enough" and where GPT-4 will not demonstrate meaningful improvements.
It's complicated tasks requiring step by step logical reasoning where GPT-4 is clearly still very much in a league of its own.
I agree scoreboards are not a highly accurate ranking of model capabilities for a variety of reasons.