Worse/less experienced developers see a much greater increase in output, and better and more experienced developers see much less improvement. AI are great at generating junior level work en masse, but their output generally is not up to quality and functionality standards at a more senior level. This is both what I've personally observed and what my peers have said as well.
interesting paper and lots of really well done bits. As a senior dev that uses LLM extensively: This paper was using copilot in 2023 mostly. I used it and chatgpt in that timeframe, and took chatgpts output 90% of the time; copilot was rarely good beyond very basic boilerplate for me, in that time period. Which might explain why it helped jr devs so much in the study.
Somewhat related, i have a good idea what i can and cannot ask chatgpt for, ie when it will and wont help. That is partially usage related and partially dev experience related. I usually ask it to not generate full examples, only minimal snippets which helps quite a bit.
Another factor not brought into consideration here may be that there are two uses of "senior dev" in this conversation so far; one of them refers to a person who has been asked to work on something they're very familiar with (the same tech stack, a similar problem they've encountered etc.) whereas the other one has been asked to work on something unfamiliar.
For the second use case, I can easily see how effectively prompting a model can boost productivity. A few months ago, I had to work on implementing a Docker registry client and I had no idea where to begin, but prompting a model and then reviewing its code, and asking for corrections (such as missing pagination or parameters) allowed me to get said task done in an hour.
So I often use Github Copilot at work usually with o1-preview as the LLM. This often isn't "autocomplete" which generally uses a lower end model, I almost exclusively use the inline chat. That being said.. I do also use the auto-complete a lot when editing. I might create a comment on what I want to do and have it auto-complete, that is usually pretty accurate, and also works well with me since I liked Code Complete comment then implement method.
For example I needed to create a starting point for 4 langchain tools that would use different prompts. They are initially similar but, I'll be deverging them. I would do something like copy the file of one. select all then use the inline chat to ask o1 to rename the file, rip out some stuff and make sure the naming was internally consistent. Then I might attach additional output schema file and the maybe something else I want it to integrate with and tell it to go to town. About 90% of the work is done right.. then I just have to touch up. (This specific use case is not typical, but it is an example where it saved me time, I have them scafolded out and functional while listening to a keynote and in-between meetings.. then in the laster day I validated it. There were a handful of misses that I needed to clean up.)
See here: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4945566
Worse/less experienced developers see a much greater increase in output, and better and more experienced developers see much less improvement. AI are great at generating junior level work en masse, but their output generally is not up to quality and functionality standards at a more senior level. This is both what I've personally observed and what my peers have said as well.
Out of curiosity, which LLM code tool do you use?