Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
[flagged] ChatGPT is better at generating code for problems written before 2021 (ieee.org)
59 points by ummonk on July 7, 2024 | hide | past | favorite | 46 comments


TFA actually has four key findings:

(1) ChatGPT is better at generating functionally correct code for problems before 2021 in different languages than problems after 2021 with 48% advantage in Accepted rate on judgment platform, but ChatGPT 's ability to directly fix erroneous code with multi-round fixing process to achieve correct functionality is relatively weak;

(2) the distribution of cyclomatic and cognitive complexity levels for code snippets in different languages varies. Furthermore, the multi-round fixing process with ChatGPT generally preserves or increases the complexity levels of code snippets;

(3) in algorithm scenarios with languages of C, C++, and Java, and CWE scenarios with languages of C and Python3, the code generated by ChatGPT has relevant vulnerabilities. However, the multi-round fixing process for vulnerable code snippets demonstrates promising results, with more than 89% of vulnerabilities successfully addressed; and

(4) code generation may be affected by ChatGPT 's non-determinism factor, resulting in variations of code snippets in functional correctness, complexity, and security. Overall, our findings uncover potential issues and limitations that arise in the ChatGPT -based code generation and lay the groundwork for improving AI and LLM-based code generation techniques.


They used GPT 3.5 so these findings are irrelevant.


The AI community is worse than JavaScript for acting like everything needs to be thrown out every six months.

Aside from the very important fact that GPT-3.5 is still far and away the most frequently used LLM model, it's not like GPT-4 has a completely different architecture with completely different characteristics. It's clearly better, but much of what they describe should generalize to LLMs as a whole (for example, knowledge cutoff dates matter a lot and these things likely have memorized a lot more than we thought they did).


It doesn’t matter what is the most commonly used model. If you want to make claims about LLM capabilities you better be using the best model available.


I don’t have an account and can’t read the article - but this is obvious to anyone with decent ML experience. Models are good with data they have seen. New unseen data is hit or miss. Most laypeople using AI these days could benefit greatly from this one piece of knowledge. If it’s never seen the kind of data you’re throwing at it, the likelihood of error goes up.


The distinction here is between 'the /kind/ of data you're throwing at it' and 'data you have already thrown at it.'

The result shows that the LLM has a hard time generalizing to new (basic!) problems that were not in its training set; this suggests it is memorizing solutions rather than understanding the mechanisms... It's the undergrad who has crammed Leetcode solutions for two weeks before their interview, rather than a student with a deeper understanding who can successfully address new problems.

Having a friend who is objectively pretty dumb but has memorized the entire existing literature of your field of study is still pretty damned useful, however. You just need to understand what kinds of questions you can trust them with.

Along these lines, my own best use of LLMs is looking up jargon and old results. I see some complicated problem, propose a method for attacking it, and ask my friend who has read every stats paper ever digitized whether they have seen this kind of analysis before, what it's called, why it's a bad idea, and what people do instead. The answers point me to relevant literature by giving me the right jargon and method names to search for using Good-Old-Fashioned-Google.


In addition to that -- maybe it's just my tinfoil hat speaking -- I think it calls into question whether benchmark numbers are really that meaningful. "The machine has a 90% success rate when we tell the machine the answer" is a bit weaker than how those numbers were presented.


I think it's becoming widely accepted that the current benchmarks aren't especially useful metrics for intelligence - rather, they're useful metrics for measuring how well a system can answer the benchmark questions.


Just like the SATs!


> The result shows that the LLM has a hard time generalizing to new (basic!) problems that were not in its training set; this suggests it is memorizing solutions rather than understanding the mechanisms [...]

I am once again reminded of Searle's "Chinese room" thought experiment [0] in which it is argued that computers cannot understand Chinese, merely memorize and regurgitate a canned set of responses to given prompts, and execute procedures by rote; that their computations cannot be "about" a subject matter (such as Chinese, or programming) the same way the operation and contents of our minds are "about" something.

[0] https://plato.stanford.edu/entries/chinese-room/


The gpt4 model was able to draw a unicorn and given a program that drew a unicorn, modify it so the unicorn was rotated 90 degrees.

https://arxiv.org/abs/2303.12712


Summarizing: LLMs are autocomplete, not thinking machines.


This is still too reductive, IMO. The kinda dumb colleague who has memorized all the literature is not just auto-complete. When I ask about some statistical technique that I didn't know the name of, being able to connect my description to the correct methods in the literature isn't 'just auto-complete.'

They just have a particular set of strengths and weaknesses, same as any tool. Figuring out their strengths and limitations is how you use them well. And dismissing them for not fully solving AGI is short-sighted.


It may be obvious, but it does resurface the question of how these models are supposed to absorb post-2021 information effectively when the open internet from that point forwards is increasingly being filled with low-quality AI slurry. Internet scrapes being a representative sample of human-created media was a one-time-only deal.


I agree with you, but then I think about it and.. that's what humans do. Humans are gonna grow based on regurgitated internet stuffs. Maybe forever? With or without AI.

We never used to have such an elaborate or accessible thing before. Kids and senior citizens likely engage constantly without even realizing it - and adopt one another's, idk the words, but I'm going for something akin to dialects/accents/language/opinions/etc

We're all a bunch of parrots


If the internet is being filled with AI generative content post-2021, then doesn't that just imply that the next generation of AI training on this "slurry" would be analogous to a "multi-round fixing" operation (as quoted above)?

While currently this is a relatively weak strength of genAI - assuming technological improvement of this technique over time, isn't it just as possible that the data quality will converge positively rather than negatively over time, in the future? That is to say, the web would be consistently "refined" as time goes on, by predominant VLLM?

Assuming that the internet is even "filled" as you say in the first place (personally I don't think organically-generated content is ever going to be pushed out of the internet, but that's my opinion, and I'll entertain the opposite case for the sake of the discussion). It also assumes that people are using models trained on the current state of internet "slurry" in the first place - that we are continually ingesting more internet YoY into these models. If we come up with a better model that needs less data to produce high-quality content, neither my nor your assertion is even relevant. Same case if the internet just decides to use small, low-quality models trained on only a portion of the internet.

But if the internet is continually recycling the entirety of itself through a model that has tens of millions of dollars of funding and research focused on directly improving the quality of it's answer metrics, it's not necessarily 100% locked into a downward quality convergence slide. Especially if we assert that humans /will/ continue to be consistently putting more organic data into the internet over time. It's a pessimistic take.


Perhaps there will be a day where human beings are paid specifically to train these models rather than by scraping existing corpuses. It would be pretty interesting if many of the same individuals put out of work by LLMs end up being employed to train those same models.


This will be an area where smaller, but more focused, models shine.


given how much of "writing code" is yet another rewrite of something really interesting, I imagine it's still useful to a lot of programmers out there. the ones doing interesting work will probably know what it can and can't do, and those regurgitating decades old solutions in the latest and shiniest framework will probably still benefit from it.


>Thus, in this study, we take the state-of-the-art ChatGPT (the default version of GPT-3.5), the recent popular product, as the representative of LLMs for evaluation.

Are they really publishing a paper based on GPT3.5 in July 2024? I am not sure these results are relevant in any way today.

Edit: Just for reference. The best model for coding today (according to most benchmarks) is Claude-3.5-Sonnet which is freely accessible. Also GPT-4o is freely accessible and is still vastly better than GPT-3.5.

The lm sys arena coding leaderboard (https://chat.lmsys.org/?leaderboard) lists sonnet-3.5 and gpt-4o jointly on #1 and GPT-3.5-Turbo on #35. You can freely download and run LLMs locally on your machine that are significantly better than GPT-3.5, for example Mistral Codestral.

There is really no reason to accept any results on GPT3.5 for relevant today. This is as if you were complaining that a computer from the 00ies is not running <recent operating system> well.


Yes, the version that the vast majority of people are using since it is free when you go to chatgpt.com


GPT-4o is the free version you will mostly use on chatgpt.com


Huh, I was fooled by marketing copy, the upgrade button says my free account has "Access to GPT-3.5, Limited Access to GPT-4o" but you're right, 4o is the default


That is still the default, non-paid version of ChatGPT today, no?


It's not, the default one is GPT-4o.


GPT-4o is "limited" for non-paid accounts, though. So you're at the mercy of OpenAI if you get it or not.


Scientific studies take a while from inception to publishing.


That is why the ML/AI community is usually shortcutting this process by publishing preprints on Arxiv.

A publication based on an LLM that was state-of-the art only until March of 2023 cannot be justified by long review times.

Edit: To be fair, it seems their preprint was first submitted in August 2023 and the IEEE article that is based on the paper was a bit slow...


But this isn't a preprint and this isn't on Arxiv.


4 months is plenty of time to rerun experiments using GPT4.


This is about GPT 3.5

> Thus, in this study, we take the state- of-the-art ChatGPT (the default version of GPT-3.5), the recent popular product, as the representative of LLMs for evaluation.


I think we should establish a precedent of flagging articles that refer to GPT-3.5 as just ChatGPT in the title.


GPT-3.5 is the default version when you land on chatgpt.com so I don't see why it is surprising. Blame OpenAI for calling the SOTA "ChatGPT+"


Isn't gpt-4o already the default for non-plus accounts?


I checked again and yes it defaults to 4o, I was confused by the upgrade button which says free accounts have limited access


Who cares what is default? If you want to make claims about LLMs you use the best model.


Arxiv preprint version:

https://arxiv.org/abs/2308.04838


I only have access to the abstract, so maybe the full paper addresses this. What does it mean by problems before 2021? Does it mean date related questions? Problems relating to frameworks newer than 2021? Or are they classifying the technical problems themselves as somehow post 2021?


I linked to the preprint here: https://news.ycombinator.com/item?id=40899054


makes sense since the correct solutions need a little time to bubble to the top so they can be parroted correctly

it rubs the lotion on its skin, ... now it places the lotion in the basket


[dupe]

More discussion on blog post: https://news.ycombinator.com/item?id=40897958


It's not a dupe if the previous post got erroneously flagged to death.


Erroneously? A number of people saw this already, same content, same discussion, more of it over there.


On an article that got flagged to death within 25 minutes of hitting the front page and therefore stands no chance of developing into a decent conversation. That's only long enough for surface-level reactions and hot takes.

HN is not a curated map of links to discussions, it's a forum, a set of link-discussion pairs. If a discussion fails to take off and receive "significant attention" [0] there's nothing wrong with resubmitting it, much as that may offend your sensibilities.

[0] https://news.ycombinator.com/newsfaq.html


Don't editorialize the title, follow the guidelines.


Floor slippery when wet




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: