The level of denial people are willing to sink into regarding how good GPT-4 is compared to everything else is truly crazy. Not a single other project is an order of magnitude close to the quantitative and qualitative (actual experiential results, not just benchmarks) results that GPT-4 brings.
I feel that there’s significant insecurity among a lot of coders about GPT-4. A lot of them are ignoring the pace of improvement and highlighting the few off chances where it gets things wrong.
I think there's a lot of people writing boilerplate programs who are going to be freed from these menial tasks (i.e. no more Java enterprise application development, thankfully).
GPT 4 is quite astounding. It might be wrong on occasion, but it will easily point you in the right direction most of the time. It still messes up, but like a twentieth of what 3.5 did. Honestly it is like an incredible rubber ducky for me. Not only can I just talk like I’m talking to a rubber duck but I can get fast, mostly informed, feedback that unblocks me. If I have a bunch of things competing for my attention I can ask gpt about one of them, a hard one, go do something else while it types out its answer, and then come back later and move on with that project.
I'm a recent convert, been experimenting with converting one PL to another.
GPT-3.5 will get the gist of what the code is doing, and then provide what looks like a direct translation but differs in numerous details whilst having a bunch of other problems.
GPT-4 does a correct translation, almost every time.
It kills me that there's a waiting list for the API. I have put together some tools to integrate 3.5 into my workflow and it helps for my current task a lot (for others it's useless). But to really shine it needs to have API access to 4.
I recently finally got access to 4 in the API, it's good. It's much better imo at following the system prompt too. Faster than you see in chatgpt I think, not as fast as 3.5-turbo but definitely less tedious.
My only kind of quantitative answer is that I had 3.5 creating ember templates & data and it would get the templates mostly ok after a couple of iterations of fixing errors, almost never on the first shot (if ever). Often wouldn't quite get it in two, and data structures would often be kinda there but not quite. Required a lot more care with the prompts. 4 gave me working things every time first time (except only where it did things that were fine in a template but not currently supported by the custom framework I'm using them in), and didn't need as much hand holding.
Qualitatively, it's wildly different from gpt-3.5-turbo for discussions. 3.5 feels a little formulaic after a while with some kinds of questions. 4 is much more like talking to an intelligent person. It's not perfect, but I'm flipping between discussing a sporting thing, then medical malpractice, legal issues, technical specifications and it's doing extremely well.
If it's affordable for you, I'd really recommend trying it.
For code 90%+ reduction is easily correct. For text and other content, I can't say, but I would guess it's not that good.
Anything involving reasoning, code, complex logic, GPT-4 is a breakthrough. GPT-3.5 turbo is more than good enough for poetry and the other text generation stuff.
My favorite part about GPT-4 is that if it generates code that is wrong, and you ask it to verify what it just wrote - without telling it whether it's wrong or not, much less pointing out the specific issue - more often than not it will spot the problem and fix it right away.
And yes, it does indeed make an amazing rubber duck for brainstorming.
GPT-4 is leaps ahead, and it's improving with every new release. The latest March 23 release is significantly better than the previous one and does a LOT of heavy lifting for my code at least.
At the very least, it's a massive productivity booster.
I've had decent success with Open Assistant, an open source model. I'd say it's within the order of magnitude of ChatGPT, given the prompts I'm looking at, including reasoning prompts. This, I believe, is due to the overwhelmingly clean data that OA have managed to acquire through human volunteers.
Maybe PaLM is near there (it's not evaluated on that page) but nothing else even comes close at all