My general sense is that for research-level mathematical tasks at least, current models fluctuate between "genuinely useful with only broad guidance from user" and "only useful after substantial detailed user guidance", with the most powerful models having a greater proportion of answers in the former category. They seem to work particularly well for questions that are so standard that their answers can basically be found in existing sources such as Wikipedia or StackOverflow; but as one moves into increasingly obscure types of questions, the success rate tapers off (though in a somewhat gradual fashion), and the more user guidance (or higher compute resources) one needs to get the LLM output to a usable form. (2/2)