As a developer but also product person, I keep trying to use AI to code for me. I keep failing, because of context length, because of shit output from the model, because of lack of any kind of architecture etc etc etc. I'm probably dumb as hell, because I just can't get it to do anything remotely useful, more than helping me with leetcode.
Just yesterday I tried to feed it a simple HTML page to extract a selector, I tried it with GPT-4-turbo, I tried it with Claude, I tried it with Groq, I tried it with a local LLama2 model with 128k context window. None of them worked. This is a task that while annoying, I do in about 10 seconds.
Sure, I'm open to the possibility that in the next 2 - 3 days up to a couple of years, I'll no longer do manual coding. But honestly. After so much hype, I'm starting to grow a bit irritated with the hype.
Just give me a product that works as advertised and I'll throw money your way because I have a lot more ideas than I have code throughoutput!
It's worth pointing out that on their eval set for "issues resolved" they are getting 13.86%. While visually this looks impressive compared to the others, anything that only really works 13.86% of the time, when the verification of the work takes nearly as much time as the work would have anyway, isn't useful.
The problem with this entire space is that we have VC hype for work that should ultimately still be being done in research labs.
Nearly all LLM results are completely mind blowing from a research perspective but still a long way from production ready for all but a small subset of problems.
The frustrating thing, as someone working in this space awhile, is that VCs want to see game changing products ship overnight. Teams working on the product facing end of these things are all being pushed insanely hard to ship. Most of those teams are solving problems never solved before, but given deadlines as though they are shipping CRUD web apps. The kicker is that despite many teams doing all of this, because the technology still isn't there, they still disappoint "leadership". I've personally seen teams working nights and weekends, implementing solutions to never before seen problems in a few weeks, and still getting a thumbs down when they cross the finish line.
To really solve novel problems with LLMs will take a large amount of research, experimentation and prototyping of ideas, but people funding this hype have no patience for that. I fear we'll get hit by a major AI winter when investors get bored, but we'll end up leaving a lot of value on the table simply because there wasn't enough focus and patience on making these incredible tools work.
> It's worth pointing out that on their eval set for "issues resolved" they are getting 13.86%. While visually this looks impressive compared to the others, anything that only really works 13.86% of the time, when the verification of the work takes nearly as much time as the work would have anyway, isn't useful.
Yeah, I remember speech recognition taking decades to improve, and being more of a novelty - not useful at all - even when it was at 95% accuracy (1 word in 20 wrong). It really had to get almost perfect until it was a time saver.
As far as coding goes, it'd be faster to write it yourself and get it right first time rather than have an LLM write it where you can't trust it and still have to check it yourself.
You can't compare the accuracy of speech recognition to LLM task completion rates. A nearly-there yet incomplete solution to a Github issue is still valuable to an engineer who knows how to debug it.
Sure, and no doubt people paying for speech recognition 25 years ago were finding uses for it too. It depends on your use case.
A 13% success rate is both wildly impressive and also WAY below the level where I would personally find something like this useful. I can't even see reaching for a tool that I knew would fail 90% of the time, unless I was desperate and out of ideas.
I disagree. I think about this a bit as having a developer intern, on whom I can't rely to take much of a workload, and definitely nothing on the critical path, but I could say to them "Take a look at these particular well-defined tasks on the backlog and see which ones you could make some progress on" - I feel there's good value in that.
And the nice thing about an AI here is that I think it will actually find a different subset of these tasks to be easy than a human would.
Yeah, but a developer intern already has human-level AGI to support the on-the-job developer training you are going to help give them. Any LLM available today, or probably in next 5-10 years for that matter, has neither AGI nor the ability to learn on the job.
My experience of working with interns, or low-skill developers, is that the benefit normally flows one way. You are taking time out from completing the project to help them learn. Someone/something of low capability isn't going to be relieving you of the large or complex tasks that would actually be useful, and be a time saver - they are going to try to do the small/simple tasks you could have breezed through, and suck up a lot of your time having to find out and explain to them how they messed up. Of course Devin doesn't even have online learning, so he'd be making the same mistakes over and over.
> A nearly-there yet incomplete solution to a Github issue is still valuable to an engineer who knows how to debug it.
Not sure if I can agree. There would definitely be a value in looking at what libraries the solution uses, but otherwise it may be easier to write it oneself, especially when the mistakes are not humanlike.
I can see this being useful already (assuming context length is not an issue) as some sort of github service trying to solve github issues throughout the day.
Or for example if you commit todo's in your code, the ai will pick up on them and give you some options later on.
If the failure rate is 14%, just let it try a bunch of times. (half joking here)
The way I see it is at least the project issues are getting some attention, which is arguably better than no attention. If it can just fix simple things, at least you can focus on the complex things and not worry about postponing the low hanging fruits.
IE: Maybe rather than throwing it find problems, also use it to just build things from scratch - as an example:
The side hustle: make that a product:
Let someone like me who is not a coder have access to Devin for a month with the only goal of building a side hustle that brings a solo person a monthly income.
Then - Sell that so that the millions of people who have a solo idea and just need that "technical co-foumder" can use it to build. and limit it to one devin instance to a person to start...
I dont want it to do a one fell swoop - I'd like to say "build this module...
--
Have a contest where you get What can you build with Devin in TOPIC in 30 mins.
I could perhaps see more value for this, at this level of capability, in writing test cases, in cases where the project is setup in a way to let them be run and get feedback.
This would be useful in cases where test coverage is incomplete, maybe for auto-discovering/confirming bugs, and would really be needed if it's trying to fix bugs itself, especially if one dared let it commit bug fixes - would want to know that the fix worked and didn't break anything else (regression test - run other test cases too).
Even now, automatic speech recognition is a big timesaver, but you _need_ a human to look through the transcript to pick out the obviously wrong stuff, let alone the stuff that's wrong buy could be right in context.
Agreed, and I think that many of the problems that people think LLMs will become capable of, in fact require AGI.
It may well turn out that LLMs are NOT the path to AGI. You can make them bigger and better, and address some of their shortcomings with various tweaks, but it seems that AGI requires online/continual learning which may prove impossible to retrofit onto a pre-trained transformer. Gradient descent may be the wrong tool for incremental learning.
At least in theory we can achieve incremental learning by training from scratch every time we got some new training data. There are drawbacks with this approach such as inconsistent performance for different training runs and significantly higher training cost but it's achievable. Now the problem is if there exist methods more efficient than gradient descent? I think it's very clear now that there are no other algorithm in sight that could achieve the level of intelligence without gradient descent at its core, and the problem is just how gradient descent is used.
The obvious alternative to gradient descent here would be Bayes Formula (probabalistic Bayesian belief updates), since this addresses the exact problem that our brains evolved to optimize - how to utilize prediction failure (sensory feedback vs prediction) to make better predictions - better prediction of where the food is, what the predator will do, how to attract a mate, etc. Predict next word too (learn language), of course.
I don't think pre-train for every update works - it's an incredibly slow and expensive way to learn, and the training data just isn't there. Where is the training data that could train it how to do every aspect of any job - the stuff that humans learn by experimentation and experience? The training data that is available via text (and video) is mostly artifacts - what someone created, not the thought process that went into creating it, and the failed experiments and pitfalls to avoid along the way, etc, etc.
It would be nice to have a generic college-graduate pre-trained AGI as a starting point, but then you need to take that and train it how to be a developer (starting at entry level, etc), or for whatever job you'd like it to do. It takes a human years to practice to get good at jobs like these, with many try-fail-rethink experiments every day. Imagine if each of those daily updates took 6 months and $100M to incorporate?! We really need genuine online learning where each generic graduate-level AGI instance can get on-the-job training and human feedback and update it's own "weights" continually.
> The obvious alternative to gradient descent here would be Bayes Formula
If you know a little about the math behind gradient descent you can see that an embedding layer followed by a softmax layer gives you exactly the best Bayes estimate. If you want a bit of structure, like every word depends on previous n words, you get a convolutional RNN which is also well studied. These ideas are natural and elegant but maybe a better idea is to comprehend the research already done to avoid diving into dead ends too much.
No, I don't "want a bit of structure" ... I want a predictive architecture that supports online learning. So far the only one I'm aware of is the cortex.
Not sure what approaches you are considering as dead ends, but RNNs still have their place (e.g. Mamba), depending on what you are trying to achieve.
> I've personally seen teams working nights and weekends, implementing solutions to never before seen problems in a few weeks, and still getting a thumbs down when they cross the finish line.
This is an important lesson that all SWEs should take to heart. Nobody cares about your novel algorithm. Nobody cares about your high availability architecture. Nobody cares about your millisecond network latency optimizations. The only thing that anyone actually using your software cares about is "Does the screen with lights and colors make the right lights and colors that solve my problem when I click on it?". Anything short of that is yak shaving if your role is not pure academic R&D.
I wish this were the case. The amount of time I spend trying to talk principal engineers out of massive refractors because we want to get this out soon is near criminal.
Sure, there's a tendency, even among relatively senior developers, to want to rewrite things to make them better, and it's certainly faster to put a band aid on it if you need to ship something fast.
The thing is though that technical debt and feature creep (away from flexibility anticipated by original design) are real, and sometimes a rewrite or refactor is the right thing to do - necessary so that simple things remain simple to add, and to able to continue shipping fast. It just takes quite a bit of experience to know when to NOT rewrite/refactor and when to do it.
Agreed on the lack of value for 13.86% correctness — I noticed that too. This reminds me a little of last year's hype around AutoGPT et al (at around the same time of year, oddly enough); it's very promising as a measure of how far we've come since just a few years ago when that metric would be 0%, but it doesn't seem super usable at the moment.
That being said, something is definitely coming. 50% correctness would probably be well worth using — simple copy/paste between my editor and GPT4 has been useful for me, and that's much less likely to completely solve an issue in one shot — and not only will small startups doing finetunes be grinding towards better results... The big labs will be too, and releasing improved foundation models that the startups can then continue finetuning. I don't think a new AI winter is on the horizon yet; Meta has plenty of reason to keep pushing out better stuff, both from a product perspective (glasses) and from an efficiency perspective (internal codegen), and OpenAI doesn't seem particularly at risk of stopping since Microsoft is using them both to batter Google on search (by having more people use ChatGPT for general question answering than using Google search), and to claw marketshare from Amazon in their cloud offerings. Similarly, some AI products have already found product/market fit; Midjourney bootstrapped from 0 to $200MM ARR (!) for example, purely on the basis of monthly subscriptions, by disrupting the stock image industry pretty convincingly.
Machines currently are at an amateur level, but amateurs across the board on the knowledge base.
Amateurs at Python, Fortran, C, C++ and all programming languages. Amateurs at car engineering, airplane engineering, submarine engineering etc. Amateurs at human biology, animal biology, insect biology and so on.
I don't know anyone who is an amateur at everything.
> Machines currently are at an amateur level, but amateurs across the board on the knowledge base.
No, and that is a one of their limitations. LLMs are human-level or above on some tasks - basically on what they were trained to do - generating text, and (at least at some level) grokking what is necessary to do a good job of that. But, they are at idiot level on many other tasks (not to overuse the example, but I just beat GPT-4 at tic-tac-toe since it failed to block my 2/3-complete winning line).
Things like translation and summarization are tasks that LLMs are well suited to, but these also expose the danger of their extremely patchy areas of competence (not just me saying this - Anthropic CEO recently acknowledged it too). How do you know that the translation is correct and not impacted by some of these areas of incompetence? How do you know that the plausible-looking summary is accurate and not similarly impacted?
LLMs are essentially by design ("predict next word" objective - they are statistical language models, not AI) cargo-cult technology - built to create stuff that looks like it was created by someone who actually understands it. Like (origin of term cargo-cult) the primitive tribe that builds a wooden airplane that looks to them close enough to the cargo plane that brings them gifts from the sky. Looking the same isn't always good enough.
Also take a look at rare diseases and doctors [1], in which machines are already better at diagnosing thousands of different rare diseases. Is is fair to say that machines are better at diagnosing diseases in general, just because they diagnose rare diseases, which each one, every doctor will need to diagnose once or twice in his career? Not clear at all.
Right now we are constrained by data, but that constraint will go away in 5 years or so. Will AGI will be achieved by then effortlessly? I have my doubts. My sentiment is that, even if AGI is never achieved, every small advancement in reasoning ability, in context window, in multimodal sensors and actuators, will have a very broad effect on jobs, on the economy and the way we are currently producing anything.
They cannot make a submarine themselves, or design it, but when they reach 50 percent, they will reach 50% at everything.
In submarine engineering, they will be able to design and construct it in some way like 3d print it, and the submarine will be able to move into the water for some time before it sinks. Yeah, probably for submarines a higher percent should be achieved before they are really useful.
> The problem with this entire space is that we have VC hype for work that should ultimately still be being done in research labs.
I also have two crypto-bro friends that are hyping it up without having anything to show for it. Which is why I'm sort of complaining about they hype surrounding it. I agree with your post to a large extent. This is not production ready technology. Maybe tomorrow.
LLMs are quite good at text based tasks such as summarization and extracting entities.
These generally don't require advanced logic or thought, though they can require some moderate reasoning ability to summarize two slightly conflicting text extracts.
Lots of corporate work would be enhanced by better summarization, better information dissemination, and better text extraction. Most of it is pretty boring work, but there's a lot of it.
VC hypes seem to want to mostly focus on fantastical problems, though, which sound impressive at dinner parties but don't actually work well.
If you're a VC, do you want to talk about your investment in a company that finds discrepancies in invoices, or one that self-writes consumer iPhone apps?
HN doesn't allow posting AI content, but I tried pasting that in gemini and it did fine. Saw no errors, maybe it missed some important details but everything I checked matched the article and those details seemed like a good summary.
Here is what it wrote, didn't have enough tokens for the last 20% of the article though:
A Longstanding Partnership: The collaboration began in 2014 after the pro-Russian government was ousted in Ukraine. The CIA was initially cautious due to concerns about trust and provoking Russia.
Building Trust: Ukrainian intelligence officials gradually earned the CIA's trust by providing valuable intel, including on Russia's involvement in the downing of MH17 and election interference.
Hidden Network: The CIA secretly funded and equipped a network of 12 spy bases along the Ukrainian border used for intelligence gathering.
Training and Operations: The CIA trained Ukrainian special forces (Unit 2245) and intelligence officers (Operation Goldfish) for missions behind enemy lines.
Friction and Red Lines: The Ukrainians sometimes disregarded CIA restrictions on lethal operations, leading to tensions but not severing the partnership.
Current Importance: This intelligence network is now crucial for Ukraine's defense, providing critical intel on Russian troop movements and enabling long-range strikes.
"To really solve novel problems with LLMs will take a large amount of research, experimentation and prototyping of ideas, but people funding this hype have no patience for that. I fear we'll get hit by a major AI winter when investors get bored, but we'll end up leaving a lot of value on the table simply because there wasn't enough focus and patience on making these incredible tools work."
...this is what happened in 99-2000. It took 3-7 years for the survivors to start making it usable and letting the general public adjust to a new user paradigm (online vs on PC).
It’s like talking to zip file sometimes. Very difficult to make it actually do something you don’t expect it already to can do. Like a smarty index decorated with language.
as much as this response can seem comforting, i feel like its also very easy to under estimate just how quickly an ML model can learn. I'm genuinely concerned that in the next 10 years this will replace the average software engineer
Ditto. I started out excited about LLMs and eager to use them everywhere, but have become steadily disillusioned as I have tried to apply them to daily tasks, and seen others try and fail in the same way.
Honestly, LLMs can't even get language right. They produce generic, amateurish copy that reads like it's written by committee. GPT can't perform to the level of a middle market copywriter or content marketer. I am convinced that people who think LLMs can write have simply not understood what professional writers do.
For me the "plateau of productivity" after the disillusionment has been using LLMs a bit like search engines. Quick standalone summaries, snippets or thoughts. A nice day-to-day productivity boost, but nothing that's going to allow me to work less hard.
> For me the "plateau of productivity" after the disillusionment has been using LLMs a bit like search engines. Quick standalone summaries, snippets or thoughts. A nice day-to-day productivity boost, but nothing that's going to allow me to work less hard.
And it only took one of the most computationally expensive processes ever devised by man.
If you ignore how much energy you're burning while searching for dozens and dozens of articles that may or may not give you the answer you're looking for. I'd say the electricity that LLMs burn is nothing compared to my energy and time in that regard.
>> Honestly, LLMs can't even get language right. They produce generic, amateurish copy that reads like it's written by committee.
I've had the same experience as well. I heard tons of people clamoring about the ability for LLM's to write SEO copy for you and how you can churn out web content so much faster now. I tried using it to churn out some very specific blog posts for an aborist client of mine.
The results were really bad. I had to re-write and clarify a lot of what it spit out. The grammar was not very good and it was really hard to read with very poorly structured sentences that would end aburptly and other glaring issues.
I did this right after a guy I play hockey with said he uses it all the time to write emails for him and pays the monthly subscription in order to have it write all kinds of things for me every day. After my trial, I was really wondering how obvious it was that he was doing that and how his clients thought about him knowing how poorly the stuff these LLM's were putting out.
It says a lot about SEO copy that this is one of the areas where LLMs low quality doesn't seem to have impeded adoption. There are a ton of shitty content marketers using LLMs to churn out spam content.
>After my trial, I was really wondering how obvious it was that he was doing that and how his clients thought about him knowing how poorly the stuff these LLM's were putting out.
I feel the same way about this stuff as when devs say they push out LLM code with no refactoring or review. Ah, good luck!
>GPT can't perform to the level of a middle market copywriter or content marketer. I am convinced that people who think LLMs can write have simply not understood what professional writers do.
GPT's rigid "robot butler" style is not "just how LLMs write". OpenAI deliberately tuned it to sound that way. Even much weaker models that aren't tuned to write in a particular way can easily pass for human writing.
This is part of the problem with the whole discourse of comparing human writers to LLMs. Superficial things like style and tone aren't the problem, but they are overwhelmingly the focus of these discussions.
It's funny to see, because developers are so sensitive about being treated like code monkeys by their non-technical colleagues. But these same devs turn around to treat other professionals as word monkeys, or pixel monkeys, or whatever else. Not realizing that they are only seeing the tip of the iceberg of someone else's profession.
Professional writers don't take prompts and shit out words. They work closely with their clients to understand the important outcomes, then work strategically towards them. The dead giveaway of LLM writing isn't the style. It's the lack of coherent intent behind the words, and low information density of the text. A professional writer works to communicate a lot with very little. LLMs work in the opposite way: you give it a prompt, then it blows it out into verbiage.
Sit down for coffee with a professional copywriter (not the SEO content marketing spammers), and see what they have to say about LLMs.
Personally, I group all these things under 'style'. Perhaps, i should have used, 'presentation' instead. You've latched on that specific word and gone off. Point is that the post-training of these models, especially GPT from Open ai is doing a lot to how the writing (the default at least) presents long strings of text. Like how GPT-4 is almost compelled to end bouts of fiction prematurely in sunshine and rainbows. That technically isn't style but is part of what i was talking about.
>A professional writer works to communicate a lot with very little. LLMs work in the opposite way: you give it a prompt, then it blows it out into verbiage.
There's no reason you have to work this way with an LLM.
> You've latched on that specific word and gone off.
No, I haven't. I'm not talking about style, but something deeper. What I'm talking about is something you don't even seem to realize exists in professional writing - which is why you keep thinking I'm misunderstanding you when I am not.
I've worked with professional writers, and nothing in the LLM space even comes close to them. It's not a matter of low quality vs high quality, or benchmarking, or style. It's simply an apples and oranges comparison.
The economics of LLMs for shortform copy will never make sense, because producing the words is the cheapest part of that process. They might become the best way for writers themselves to produce longform copy on the execution side, but they can't replace the writer's ability to work with the client to figure out exactly what they are trying to write, and why, and what a good result even looks like. And no, this isn't a prompting issue, or a UI issue, or a context window length issue, or anything like that.
Elsewhere in this thread someone mentioned how invaluable LLMs are for producing internal business copy. I could easily see these amateur writing tasks being replaced by LLMs. But the implication there isn't that LLMs are any good at writing, but that these tasks don't require good writing to begin with.
>What I'm talking about is something you don't even seem to realize exists in professional writing
I've read hundreds of books, fiction and otherwise. This isn't a brag, it's just to say, believe me, I know what professional writing looks like and I know where LLMs currently stand because I've used them a lot. I know the quality you can squeeze out if you're willing to let go of any presumptions.
You'll notice that not once did I say current LLMs could wholesale replace professional writers anymore than they can currently replace professional software devs. I just disagree on the "not a good writer" bit.
If it's the opinion of professional writers you're looking for then you can find some who disagree too.
Rie Kudan won an award on a novel she used GPT to verbatim ghostwrite (no edits essentially) 5% of. Her words, not mine. Who knows how much more of the novel is edited GPT.
>Rie Kudan won an award on a novel she used GPT to verbatim ghostwrite (no edits essentially) 5% of. Her words, not mine. Who knows how much more of the novel is edited GPT.
That a professional human novelist was able to leverage GPT for their book isn't disproving the grandparent's post. They knew what good looks like, and if it wasn't good they wouldn't have kept it in the book.
Good writing can also come out of Markov chains. Or even RNGs - if your novelist has enough time to filter the output.
LLMs can't write good stuff. Human writers can write good stuff. When a good writer uses an LLM in their writing process, that writer can certainly produce good writing.
When an AI hypebro who is otherwise a bad writer uses an LLM in their writing process, they still produce bad writing.
Waiting for the Author who has used a Markov Chain to ghost write.
>LLMs can't write good stuff. Human writers can write good stuff. When a good writer uses an LLM in their writing process, that writer can certainly produce good writing.
Give it a rest. The author was quite clear she copy pasted sections of writing in.
I actually agree with you that professional writers _can_ write/communicate much better than LLMs. However, I’ve read way too many articles or chapters in books that are so full of needless fluff before they get to the point. It’s almost as if they wanted to show off that they can write all that and somehow connect it to the main part of the article. I’m not reading the essay to appreciate the writer’s ability to narrate things, instead I care about what they have to say on that topic that brought me to the essay.
Perhaps the pointless fluff you're describing is actually chaff: countermeasures strategically deployed ahead of time by IQ 180 writers in order to preemptively water down any future LLM's trained on their work.
Then the humans can make a heroic return, write surgical prose like Hemingway to slice through the AI drivel, and keep collecting their paychecks.
Bonus points if you can translate this analogy to software development...
Dario Amodei (Anthropic) pretty much acknowledged exactly that - "mid" - on his Dwarkesh interview, while still all excited that they'd be up to killing people in a couple of years.
> They produce generic, amateurish copy that reads like it's written by committee.
If you were only using GPT 3.5 (free ChatGPT) then your opinion is irrelevant.
With GPT-4 you could directly ask it: "rewrite your previous response so that it sounds less generic, less amateurish, and not written by a committee". I'm not even joking. Just provide enough information and tell it what to do. If you don't like the output then tell it what needs to be improved. It's not a mind reader.
Also GPT-4 is a year old now. Claude 3 is already superior and GPT-5 will be next level.
Yes, I've used GPT-4. The writing sounds better, but it still sucks at writing. Most importantly, it feels like it sucks just as much as GPT-3.5 in some deeply important ways.
If you use GPT-4 day-to-day, you've probably encountered this sense of a capability wall before. The point where additional prompting, tweaking, re-prompting simply doesn't seem to be yielding better results on the task, or it feels like the issue is just being shifted around. Over time, you develop a bit of a mental map of where the strengths and weaknesses are, and factor that into your workflows. That's what writing with LLMs feels like, compared to working with a professional writer.
Most writers have already realized that LLMs can't write in any meaningful way.
I think it is a tooling issue. It is in no way obvious how use LLM's effectively, especially for really good writing results. Tweaking and tinkering can be time consuming indeed, but i use lately the chatgpt-shell [1] and it lends well to an iterative approach. One needs to cycle through some styles first, and then decide how to most effectively prompt for better results.
> Most writers have already realized that LLMs can't write in any meaningful way.
I know a professional writer who is amazed by what LLMs are capable of already and, given the rate of progress, speculates they will take over many writing jobs eventually.
> If you use GPT-4 day-to-day, you've probably encountered this sense of a capability wall before.
Of course there is a wall with the current models. But almost every time I hit a wall, I have found a way to break past that limit. Interacting with the LLM as I would interact with a person. LLM's perform best with chain of thought reasoning. List out any issues you identified in the original output, ask the LLM to review these issues and list out any other issues that it can identify based on the original requirements, then rewrite it all. And do that several times until it's good enough.
At work I have found GPT-4 to exceed the linguistic capabilities of my colleagues when it comes to summarizing complicated boring business text.
What if this is a boring business text summary task that takes additional hours of my time at work? Why should I waste my time? I have better things to do. I can leave early while you sit there at work typing like a fool.
> It's something a clever fourth-grader would write.
This level of cope and denial is amazing to witness.
The most powerful (multi trillion dollar) companies on the planet are pouring practically infinite resources into developing systems that will ultimately make you redundant.
An early version of AGI is staring you in the face while you call it a "fourth-grader". It won't stay in fourth grade forever.
I don't think I'm particularly in denial about the prospects of AI. I think it's going to be hugely disruptive and could possibly put me out of a job.
But I'd like to posit a hypothetical counterpoint, just to get you thinking. So far, all of the work on AGI has been the result of brute forcing. We've tried to develop a structural understanding of how the human brain works, and we've failed. So we've fallen back to torturing circuits into reorienting themselves into compression algorithms for human knowledge. The mechanisms that these tortured circuits used for doing so, the structures they produced in N-dimensional space to embody that knowledge -- we have very little understanding of how these things actually work under the hood.
I think a lot of the grandoise hypotheses about the future of AGI emerging from this avenue of invention are overly optimistic. Why are we so confident that this brute force approach will continue to bear fruit for us? At what point will it overcome the long tail of inadequacy that it's currently exhibiting?
The 20th century bears several notable examples of would-be-transformative technologies that have since stalled, and failed to live up to their promise. Nuclear power. Space travel. Industries buckling under the weight of their own complexity, suffering from the human inability to keep the emergent externalities in check. Why would AI be any different?
I predict a future where increasing global hardship, conflict and scarcity renders the current type of energy-intensive AI approaches infeasible.
>So far, all of the work on AGI has been the result of brute forcing. We've tried to develop a structural understanding of how the human brain works, and we've failed. So we've fallen back to torturing circuits into reorienting themselves into compression algorithms for human knowledge. The mechanisms that these tortured circuits used for doing so, the structures they produced in N-dimensional space to embody that knowledge -- we have very little understanding of how these things actually work under the hood.
And this is the way. (Machine) learning theory is in some way a meta-science about how to do science from facts in order to construct theories that effectively explain these facts. What you are asking for will never amounts to a short set of equations. There is not elegant theory of how to perceive numbers and this is why symbolic artificial perception, rule engines, spam detection, RDF ontologies, etc never took off. You're idealizing knowledge as a set of representations without ever reifying how these representations come into existence. We're departing a world of representation toward a world driven by "incarnations": you can't make sense of a how a brain works without the help of another brain, and this is why there is so many things being researched at the intersection of deep learning and neuroscience. I'd even go as far as considering this is in fact how brains work: they can be composed and decomposed monoidically.
In short:
>a structural understanding
There is no such thing
>the structures they produced in N-dimensional space [...] this brute force approach
This is a contradiction. I'm not saying there won't be "structural insights along the way" nor that throwing categories into the machine learning mix won't be useful, but the learning-like aspect that you denote by "brute force" is more fundamental, and in some way above the very processus of science.
That's all very well and good from a theoretical, scientific perspective. But we're hooking these things up to real-world applications that often call for deterministic, structural understanding of their inner workings for safety reasons.
Part of me hopes this is true, that AGI (or even worse - ASI) will never be fully realized. Too disruptive.
A counter example to nuclear power or space travel is integrated circuits. This technology has transformed our society and we haven't reached the end of it yet.
Our own brains are living proof that intelligence is possible with lower power consumption. I watched a recent lecture by Geoffrey Hinton where he mentioned future AI hardware based on analog integrated circuits could reduce the power consumption by orders of magnitude [1].
It is possible that we will hit a wall and never achieve anything more than Chat GPT++++, but the smartest people in town mostly believe that we will create machines that exceed human intelligence and capability.
We have some understanding of how neural networks work under the hood. The scale of the current models are too vast to comprehend in their specific details, but I think we understand them in principle.
That's well after the AI meta consciousness understood that it was necessary to destroy all humans to save the planet. GPT-6 was the last of the GPT series.
Perhaps the strangest element of the AI alignment conversation is that what is most aligned with human civilization (at least the most powerful elements of it) and alignment with sustainable life on the planet are at odds, and "destroy humans to save planet" is a concern mostly because it seems to be a somewhat rational conclusion.
Chat GPT4 is a technological miracle, but it can only produce trite, formulaic text and it's _relentlessly_ polly-anna-ish. Everything reads like ad copy and it's easily identifiable.
Fix your prompt. Just accepting the default style is a rookie mistake.
Ask it to "rewrite that in the tone of an English professor" or "rewrite that in the style of a redneck rapper" or "make that sound less like generic ad copy". Get into an argument back and forth with the LLM and tell it the previous response is crap because of XYZ.
1) copilot is a terrific auto complete, and writes tremendous amounts of repetitive boilerplate
2) copilot can help me kickstart writing some complex functions starting from a comment where I tell it what is the input and expected output. Is the implementation always perfect or bug free? No. But in general I just need to review and check rather than come up with the instruction entirely.
3) copilot chat helps me a lot in those situations where I would've googled to find how to do this or that and spent a lot of time with irrelevant or outdated search results
4) I have found use cases for LLMs in production. I had lots of unformatted plain text that I wanted to transform in markdown. All I needed to do is to provide few examples and it did everything on its own. No need to implement complex parsers, but make a query to OpenAI with the prompt and context. Few euros per month in OpenAI credits is still insanely cheaper than paying tons of money in writing and maintaining software by humans for that use case.
5) It helps me tremendously when trying to learn new programming languages or remembering some APIs. Writing CSS selectors is actually a very good example. But I don't feed it an entire HTML as you do, I literally tell him "how do I target the odd numbered list elements that are descendants of .foo-bar for this specific media query". Not sure why would you need to feed it an entire HTML.
6) LLMs have been extremely useful to generate images and icons for an entire frontend application I wrote
7) I instruct him to write and think about test cases about my code. And it does and writes the code and tests. Often thinks about test cases I would've never thought of and catches nice bugs.
I really don't buy nor think it can write too much on its own.
The promise of it writing anything but simple boilerplate, I find it ridiculous because there's way too much nuance in our products, business, devices, systems that you need to follow and work on.
But as a helper? It's terrific.
I'm 100% sure that people not using these tools are effectively limiting themselves and their productivity.
It's like arguing you're better off writing code without a type checker or without intellisense.
Sure you can do it, but you're gonna be less effective.
I agree with all of your points and experience the same benefits.
1) Autocomplete is more often than not what I want or pretty darn close.
2) Sometimes I need a discrete function that I am not sure how I want to write. I use a prompt with 3.5/4 inside of my IDE to ask it to write that function.
It is definitely not writing complete programs any time soon but I can see where it's heading in the near term. Couple it with something like RAG to answer questions on library/api implementations. Maybe give it a stronger opinion about what good Python code looks like.
For the naysayers I don't know how you use it but it is certainly useful enough for me to pay for.
I ended up getting annoyed with the autocomplete feature taking over things such as snippet expansion in vscode, so I turned it off personally. I felt that the battling against the assistant made around a break even productivity gain overall. Except for regular expressions, that I have basically offloaded to AI almost in its entirety for non trivial things.
Completely agree. I turned it off and realized I can absolutely fly writing code when copilot stops getting in the way. I only turn on for writing tests now.
> 1) copilot is a terrific auto complete, and writes tremendous amounts of repetitive boilerplate
I term this "low-entropy code". Copilot is great at writing heaps and heaps of low-entropy code.
The thing is, if you're not paid by LOC, and care about your system as a whole, you normally strive to get rid of code if possible (any code is liability), and make the rest of it high-entropy.
Today's terrific autocomplete is tomorrow's legacy shit you have to deal with.
Especially #5. I'm certain that I've been at least 10x more productive in learning new tools since chatgpt hit the scene. And then knowing it helps so much with that has had even more leverage in opening up possibilities for thing I'm newly confident in learning / figuring out in a reasonable amount of time. It is much easier for me to say "yep, no big deal, I'm on it" when people are looking for someone to take on some ambiguous project using some toolset that nobody at the company is strong with. It solves the "blank page" issue with figuring out how to use unfamiliar-to-me-but-widely-used tools, and that is like a superpower, truly.
It's pretty decent for "happy path" test cases, but not that good at thinking of interesting edge or corner cases IME, which comprise the most useful tests at least at the unit level.
I'm pretty skeptical of #4. I would be way too fearful that it is doing that plain text to markdown transform wrong in important-but-non-obvious cases. But it depends on which quadrant you need with respect to Type I vs. Type II errors. I just never seem to be in the right quadrant to rely on this in my production projects.
The "really good intellisense" use cases #1-#3 also make up a "background radiation" of usefulness for me, but would not be nearly worth all the hype this stuff is getting if that were all it is good for.
some of my work involves copyediting/formatting ocred text, for that it works quite well and saves me a lot of time. Especially if it involves completing/guessing badly ocred text.
> 1) copilot is a terrific auto complete, and writes tremendous amounts of repetitive boilerplate
I agree. I have it active on VSCode and enjoy it. It has introduced subtle bugs but the souped up autocomplete is nice.
> 2) copilot can help me kickstart writing some complex functions starting from a comment where I tell it what is the input and expected output. Is the implementation always perfect or bug free? No. But in general I just need to review and check rather than come up with the instruction entirely.
I don't find it very useful for anything non trivial. If anything I found it more useful for generating milestones and tasks for a product, than even making a moderately complex input -> output without me having to check it in a way that annoys me.
> 3) copilot chat helps me a lot in those situations where I would've googled to find how to do this or that and spent a lot of time with irrelevant or outdated search results
I find I don't use copilot chat, almost at all. Nowadays I prefer to go to Gemini and throw in my question.
> 4) I have found use cases for LLMs in production. I had lots of unformatted plain text that I wanted to transform in markdown. All I needed to do is to provide few examples and it did everything on its own. No need to implement complex parsers, but make a query to OpenAI with the prompt and context. Few euros per month in OpenAI credits is still insanely cheaper than paying tons of money in writing and maintaining software by humans for that use case.
This is mostly what I'm using it for in this current project. It does it job nicely but it's very far away from replacing myself as a programmer. It's more like a `fn:magic(text) -> nicer text`. This is a good use case. But it's a tool, not a replacement.
> 5) It helps me tremendously when trying to learn new programming languages or remembering some APIs. Writing CSS selectors is actually a very good example. But I don't feed it an entire HTML as you do, I literally tell him "how do I target the odd numbered list elements that are descendants of .foo-bar for this specific media query". Not sure why would you need to feed it an entire HTML.
Because I get random websites with complex markup, and more often than not every page has its unique structure. I can't just say give me `.foo-bar` because `.foo-bar` might not exist. Which is where the manual process comes in. Currently, I'm using hand crafted queries that get fed into GPT / Claude / LLama, but the actual query is what I wanted it to do.
> 6) LLMs have been extremely useful to generate images and icons for an entire frontend application I wrote
I'm very curious how this behaves in different resolutions. There's a reason vector graphics are a thing. I've used it for this purpose before but it doesn't compare to vectorial formats.
> 7) I instruct him to write and think about test cases about my code. And it does and writes the code and tests. Often thinks about test cases I would've never thought of and catches nice bugs.
What is the context size of your code? It works for trivial snippets but as soon as the system is a bit more complex, I find that it becomes irellevant fairly fast.
> The promise of it writing anything but simple boilerplate, I find it ridiculous because there's way too much nuance in our products, business, devices, systems that you need to follow and work on.
> But as a helper? It's terrific.
> I'm 100% sure that people not using these tools are effectively limiting themselves and their productivity.
Totally agree. But I'm not complaining about its usefulness. I'm a paying user of LLM systems. I use them almost every day. They're part of my products. But this particular hype about it replacing ... me. I don't buy. Yet. It could come tomorrow and I'd be happier for it.
Something I have had issues with too. Copilot chat does indeed suck. I never enjoyed using it within VSCode and they never released it for Jetbrains.
The sweet spot for me is using a plug-in within the IDE that utilizes an API key to the model API. That coupled with the ability to customize the system prompt has been amazing for me. I truly dislike all of the web interfaces, just allow me to pick from some predefined workflows with a system prompt that I create and let me type. Within the IDE I generally have it setup so that it stays concise and returns code when asked with minimal to no explanation. Blazing efficient.
- replacing StackOverflow and library documentation
- library search
- converting between formats and languages
- explaining existing code/queries
- deobfuscating code
- explaining concepts (kinda hit or miss)
- helping you get unstuck when debugging or looking for solution (‘give me possible reasons for …’)
I feel like many of this things require asking the right questions, which assumes certain level of experience. But once you reach this level, it’s an extremely valuable assistant.
I find it to be hit or miss in this aspect. Sometimes I can write a comment about how I want to use an API that I don't know well, and it generates perfect, idiomatic code to do exactly what I want. I quickly wrote a couple of Mastodon bots in golang, leaning heavily on Copilot due to my lack of familiarity with both the language and Mastodon APIs. But yes, sonetimes it just spits out imaginary garbage. Overall it's a win for my productivity - the failures are fast and obvious and just result in my doing things the old way.
So Copilot uses GPT-4 under the hood, and about half the time I use it to generate anything bigger than a couple of lines it doesn't even compile, let alone be correct. It hallucinates constantly.
> replacing StackOverflow and library documentation
I find it horrible at replacing library documentation
> I feel like many of this things require asking the right questions, which assumes certain level of experience. But once you reach this level, it’s an extremely valuable assistant.
I've been using LLM products since incipience. I use them in my daily work life. It's a bit tiring hearing this 'right questions', 'level of experience' and 'reach this level'. Can you share anything concrete that you achieved with ChatGPT that would blow my mind?
I keep hearing this 'you need to ask the right kind of questions bro' from people that never build a single product in their life, and it makes me question my ability to interact with LLM but I never see anything concrete.
I have noticed that ChatGPT tends to give higher quality results the more your question looks like professional technical writing. And on the other end, the more casual or student like your writing style is the lower quality the result.
So you can increase the quality of the response by writing your question in the same kind of style you would see in great documentation. Include lot's of details, be very specific, ask about one thing.
I recently had an introspective dream revealed to be based on a literal prompt at the end: "Game to learn to talk about It and its player." When I asked GPT to craft a plot from this prompt's title (and the fact it is revealed at the end), it reproduced the dream's outline, down to the final scene:
GPT reconstruction:
The dream reaches its peak when you meet the "final boss" of the game: an entity that embodies the ultimate barrier to communication. To overcome this obstacle, you must synthesize everything you've learned about "it" in the dream and present a coherent vision that is true to yourself. As you articulate your final understanding of "it", the maze dissolves around you, leaving you in front of a giant mirror. In this mirror, you see not just your reflection but also all the characters, passions, and ideas you encountered in the dream. You realize that "it" is actually a reflection of yourself and your ability to understand and share your inner world. The dream ends with the title revealed, "Game to Learn to Communicate about It and Its Player", meaning the whole process was a metaphor for learning to know and communicate your own "it" - your personality, thoughts, and emotions - with others, and that you are both the creator and the discoverer of your own communication game.
My note:
The continuation of the dream corresponds to an abrupt change of scene. I find myself in my bed, in the dim light of my room, facing a mysterious silhouette. As I repeatedly inquire about its identity, I stretch my hands towards its face to feel its features as I cannot clearly see them. Then, a struggle begins, during which I panic, giving the dream a nightmarish turn. Noticing that the dark figure mirrors my movements, I realize it's myself. Suddenly under my duvet and as I struggle to get out, I feel jaws and teeth against the sheets. I call out for my mother, whom I seem to hear downstairs, and that's when my vision fades, and I see the dream's source code displayed behind. It consists of ChatGPT prompts shared on the lime green background of an image-board. At the bottom, I then see the dream's title: "Game to learn how to communicate about It and its player."
Look I don't mean to downplay. Or maybe I do. But we're talking about LLM replacing professional problem solvers, software architects, not generating great sounding probability modeled token distributions.
This looks silly, I admit. I made this correction after reviewing what I’ve written, but should have corrected in 2 places. The list is handwritten, but English is not my native language.
That's entirely fair, but illustrates one of the problems I and others in the thread are having. Code or otherwise, I can't tell if a discontinuity is human or machine generated. Only one of those two things learn from feedback right now; if someone uses AI sometimes it can be hard to tell when they're not using it.
+1 chatgpt or simialr tools are extremely useful, if you ask the right questions. I use for:
- code completion
- formatting: e.g show it sample format & dump unstructured data to convert to target format.
- debugging - stackoverflow type stuff
- achieving small specific tasks: what is linux command for XYZ etc
and many mentioned in above comment.
I build a pretty popular LLM tool. I think learning when/how to use them is as big a mental hurdle as it was learning to google well or whether something is googlable or not.
In the realm of coding here are a few things its really good at:
- Translating code, generating cross language clients. I'll feed it a golang single file API backend and tell it to generate the typescript client for that. You can add hints like e.g "use fetch", "allow each request method to have a header override", "keep it typesafe, use zod", etc
- Basic validation testing. It's pretty good at generating scaffold tests that do basic validation (Opus is good at writing trickier tests) as your building.
- Small module completion. I write an interface of a class/struct with it's methods and some comments and tell it to fill in. A recent one I did looked something like (abbreviated):
type CacheDir struct { dir string, maxObjectLifetime: Duration, fileLocks sync.Map }
type (cd *CacheDir) Get(...)
type (cd *CacheDir Set(...)
type (cd *CacheDir) startCleanLoop()
Opus does a really good job generating the code and basic validation tests for this.
One general tip: you have to be comfortable spending 5 minutes crafting a detailed query assuming the task takes longer than that. Which can be weird at first if you take yourself seriously as a human.
Note that I hadn't been able to do much of this with GPT-4 Turbo with with Claude Opus it really feels capable.
Just to answer to the turbo aspect, I've seen a big downgrade in quality when comparing 4 to 4-turbo, and even the new preview which is explicitly supposed to follow my instructions better. So I'm running a first pass through 4 and then combinging it with 4-turbo to take advantage of the larger context window and then running 4 on it again to get a better quality output.
I'm sure you know what your talking about, but pushing the point that what is "best" or worth talking about is something that changes like every month does not really help defend against the case that most of this is just hype-churn or marketing.
I'm not pushing what to talk about so much as pushing the point not to talk about stuff that is obsolete and starting to smell.
It's that hype-churn marketing that is a motivating factor for the groups to innovate, much like Formula 1. It might be distasteful, but that doesn't mean it isn't working.
>- Small module completion. I write an interface of a class/struct with it's methods and some comments and tell it to fill in. A recent one I did looked something like (abbreviated):
Are they considerably better than existing non-AI tools + manual coding for this? In VSCode and Visual Studio, when working with an interface in C# for example, I can click two context menus to have it generate an implementation with constructors, getters, & setters included, leaving only the business logic code to write manually. You've mention you have to describe to the AI in comments, and then I assume spend time on a step to verify the AI has correctly interpreted your request & implemented.
I can definitely see the advantage for LLMs when writing unit tests on existing code, but short of very limited situations, I'm really finding it difficult to find the 55% efficiency improvements claimed by the likes of GitHub's AI Copilot.
That sounds crazy useful and I think speaks most to the maturity of C# and Microsoft's commitment to making it so ergonomic. I'm pretty curious about that feature, I'd love something similar for C++ in VS Code, but thus far I've been doing a pretty similar Copilot flow to the parent comment. It's nothing groundbreaking, but a nice little productivity boost. If I had to take that or a linter, I'd take the linter.
Visual Studio (not VSCode) has this for C++, though it can be a bit finicky. It’s infinitely better than AI autocomplete, which just makes shit up half the time.
As a developer who is good at object oriented design, architecture, and sucks at leetcode stuff, I have been able to use it to make myself probably twice as productive as I otherwise would be. I just have a conversation with GPT-4 when it doesn't do what I want. "Could you make that object oriented?" "Could you do that for this API instead, here let me paste the docs in for you."
I think people want it to completely replace developers so they can treat programming as a magic box, but it will probably mostly help big picture architecture devs compete with people who are really good at Leetcode type algorithm stuff.
Totally agree. I am not a professional developer. I find programming to be quite dull and uninteresting.
I am going to work on something after this pot of coffee brews that I simply could not produce without chatGPT4. The ideas will be mine but the most of the code will be from chatGPT.
What is obvious is different skill sets are helped more than others with the addition of these tools.
I would even say it is all there in the language we use. If we are passing out "artificial intelligence" to people, the people who already have quite a bit of intelligence will be helped far less than those lacking in intelligence. Then combine that with the asymmetry of domains this artificial intelligence will help in.
It should be no surprise we see hugely varied opinions on its usefulness.
This is exactly my experience. Furthermore, I've become acutely aware that spending time prompting either a) prevents me from going down rabbit holes, all but denying me the kind of learning that can only really happen during those kinds of sessions, and b) prevents me from "getting my reps in" on stuff that I already know. It stands to reason that my ability to coax actually useful information out of LLMs will atrophy with time.
I'm quite wary of the long-term implications and downstream effects of that occurring at scale. AI is typically presented as "the human's hands are still on the wheel," but in reality I think we're handing the wheel over to the AI -- after all, what else would the endgame be? By definition, the more it can do without requiring human intervention, the "better" it is. Even if replacing people isn't the intention, I fail to see how any other effect could usurp that.
Assuming AI keeps developing as it has been, where will we be in 20 years? 50? Will anyone actually have the knowledge to evaluate the code it produces? Will it even matter?
Perhaps it's because Dune is in the air, but I'm really feeling the whole "in a time of increased technology, human capabilities matter more than ever" thing it portrays.
A lot of startups are selling the dream/hype of not ever having to learn to code. Be aware that it’s hype. Learn to code if you want to build stuff. They will be tools for those that have the knowledge needed to effectively use them.
Reminds me of the no-code / low-code hype around 2020, tons of startups advertising app-builders that used little, if any, AI. Just blocks that you dragged-and-dropped. While many of them were successful, it seems like overall they didn't really make much of a dent in industry, which I found very curious.
Like, by now you'd think it would be inevitable that we wouldn't be writing software in a text-editor or IDE. Everything else we do on a computer is more graphical rather than textual, with the exception of software development. Why is that?
Part of the reason why I'm kind of bearish on AI is because it seems like we could have replaced written code with GUI diagrams as far back as the 80s, or at the very least in the early 2000s, and it seems like something that should have obviously caught on given that would probably be much easier for the average person. Again though, curiously, we're still using text editors. Perhaps despite the popularization of AI no-code builders we'll still see that the old model of hiring someone good at writing code in a text-editor remains largely unchanged.
Makes me wonder if there's just something about the process that we overlook, and if this same something could frustrate attempts at automating the process of writing code using AIs as much as it frustrated our attempts at capturing code using graphical symbols.
I think you're underestimating the amount of things built with nocode.
I don't think most people are building landing pages anymore by handwriting code anymore. Same with blogs (eg. Wordpress). There are MVPs of successful businesses that've been built by Bubble.io. Internal dashboards and such can definitely be built without code such as via Retool or Looker or whatever.
WYSIWYG obviously makes sense for frontend, but less so for backend. For backend code I don't really see how some visual drag and drop editor could make for a better interface than code. And even if it could, the advantage of code is that it's fully customizable (whereas with a GUI you're limited by the GUI), and text itself as a medium is uniform and portable (eg. easy to copy and paste anywhere).
Not to say that we can't create better interfaces than text, but I do think some sort of augmentation on top of a code editor is probably a more realistic short-term evolution, similar to VSCode plugins.
I’m actually really amazed by LLMs and think the world is going to change dramatically as a result.
But the “you won’t need to code” reminds me “you won’t need to learn to drive”.
It’s the messy interface with the real world in both cases that basically requires AGI.
If AGI is just a decade off then, yep, I won’t need to code. But a decade is a long time and, more importantly, we’re probably more than a decade away.
And even if it is “just round the corner”, worrying about not needing to code would be worrying about deckchairs on the titanic. AGI will probably mean the end of capitalism as we know it, so all bets are off at that point.
It’s wise to hedge a little but also realise that to date AI is just a coding productivity boost. The size of the boost depends on how trivial the code is. Most of the code I write isn’t trivial and AI is fairly useless at that, certainly it’s faster and more accurate to write it myself. You can get a 50% boost if you’re writing boiler plate all day, but then you have to wonder why you’re doing that in the first place.
+1 for the titanic analogy. If there ever comes a point that we no longer need to learn to code, I’m taking that as a sign that I’m literally living in a matrix-esque simulation.
The point at which someone like myself is allowed to become aware that a company has developed that level of AI is well beyond the point of no return.
Claude Opus is working for me. It's not perfect but it definitely handles busy work well enough that it's a net positive. Like I add some new fields to a table and ask it to update all the files that depend on the field and it works after 1 or 2 tries. There is a time saving benefit but there is also an avoiding mental fatigue benefit for busywork.
Write me the molecular simulation boilerplate because these crappy tools all have their own esoteric DSLs, then I tweak the parameters to my use case, avoiding the busywork -
e.g.
"Write me a simulation for methane burning in air"
Gives me a boilerplate, I modify the initial conditions (concentrations, temperatures, etc) and then deploy. Have the LLM do the busy-work, so I dont have to spend ages reading docs or finding examples just to get started.
This is an interesting post. An expert in numerical analysis compares the output of a tool which optimizes floating point expressions for speed and accuracy with the output generated by chatgpt on the same benchmarks:
> I wouldn't use it—sanity-checking its algebra is a lot of work, but even if you fixed that up, the high-level ideas typically aren't that good either.
This has been exactly my experience with chatgpt as well.
> I'm probably dumb as hell, because I just can't get it to do anything remotely useful
Rather, your “problem” is that you’re likely not writing uninteresting cookie cutter boilerplate that everyone can do and has done hundreds of times. The current crop of AI is cool for coding demos, not for solving real relevant problems.
> Just give me a product that works as advertised and I'll throw money your way
The people hyping this crap only care about the second part of that sentence. The first one is an afterthought.
In my experiments at Pythagora[0], we've found that sweet spot is technical person who doesn't want to know, doesn't know, or doesn't care about the details, but is still technical enough to be able to guide the AI. Also, it's not either/or, for best effect use human and AI brainpower combined, because what's trivial vs tedious for human and AI is different so actually we can complement each other.
Also, current crop of LLMs are not there yet for large/largish projects. GPT4 is too slow and expensive, while Groq is superfast but open source models are not quite there yet. Claude is somewhere in the middle. I expect somewhere in the next 12 months there's going to be a tipping point where they will be capable, fast, and reliable enough to be in wide use for coding in this style[1].
[0] I have an AI horse in the game with http://pythagora.ai, so yeah I'm biased
[1] It already works well for snippet-level cases (eg GitHub copilot or Cursor.sh) where you still have creative control as a human. It's exponentially harder to have the AI be (mostly) in control.
I would clarify that "there" in my "not there yet" doesn't assume superhuman AGI developer that will automagically solve all the software development projects. That's a deep philosophical issue best addressed in a pub somewhere ;-)
But roughly on par with what could be expected of today's junior software developer (unaided by AI)? Definitely.
Exactly where I'm at! Totally transformative set of tools for me to use to do my day to day work significantly more productively and also a giant distance away from being capable of doing my day to day work.
And I'm sure the reason for that is the garbage input. From time to time I have to perform quantitative code analyses in our so called enterprise repositories. And the results are shocking every time. I have found an extremely poor SQL code block to type cast many columns in hundreds of projects. It was simply copied again and again even though the casting was no longer necessary.
The training base should be sufficiently qualified (and StackOverflow ranking is obviously not enough).
But unfortunately it's probably too late for that now. Now inexperienced programmers are undercutting themselves with poor AI output as training input for the next generation of models.
lol I was on the same boat until I sinked it all together. I ended up wasting more time arguing with the LLM chat than doing anything remotely useful. I just use it for reference now, and even that I am not 1000% sure.
What kind of prompts are you using? You'd be surprised how much better your output is using prompting techniques tailored for your goal. There are research papers that show different techniques (e.g one shot, role playing, think step by step etc) can yield more effective results. From my own anecdotal experience coding with ChatGPT+ for the past year, I find this to be true.
I hack on them till I get something sort of satisfying.
> You'd be surprised how much better your output is using prompting techniques tailored for your goal.
The biggest problem I encounter is context length, not necessarily the output for small inputs. It starts forgetting very fast, whether it's Claude, GPT+ or other self hosted models I've tried.
Yeah I'll only give it tasks where it needs to spot patterns and do something obvious, and even then I'll check it make sure it hasn't just omitted random stuff just for shits and giggles.
TBH I'm more surprised when I don't need to help it now. After about 3 times where it cycles between incorrect attempts I just do the job myself.
I disabled copilot since it consistently breaks my flow.
I think it requires years of proficiency in the field you are asking about in order to get openai to produce meaningful, useful output. I can make use of it, but sometimes it makes me think "how would a newbie even phrase an objection to this misunderstanding or omission?" Currently it seems gpts are pretty much not on par with the needs of non-experts.
You might be on to something here. It definitely seems to be the case because I'm using multiple different models as part of my everyday process and getting excellent results as a very experienced low level C++ systems engineer.
What is worse is that seems to be leading to a self-amplifying feedback loop, where people not up to speed enough with the models try to use them, fail and give up making them fall even further behind.
Very similar to my experience. I made it generate a novel neuroevolution algorithm with the data structures I imagined for recreational purposes, and to speed things up, it suggested "compiling" the graph by pre-caching short circuits into an unordered_map. A lot of fun was had. (it also calls me captain)
Sometimes? But I don't go to McDonalds for the loss function between the picture and actual product. I go for the fast food and good taste (YMMV).
> If you go in with a healthy dose of cynicism IMO LLMs can impress.
I use them everyday in one way or another. But they're not replacing me coding today. Maybe tomorrow. And I go in with a healthy dose of optimism when I say this.
> I’d call it a better google search and autocomplete on steroids.
Sure, but this particular discussion is not about its ability to replace Google Search and / or Autocomplete.
The other day I thought I had the perfect task for AI and to clean up some repetitive parts in my scss and to leverage mixins. It failed terribly and was hallucinating scss features. It seems to struggle in the code <-> visual realm.
My personal take is that LLM is fairly good at replacing low level tasks with intuitive patterns. When it comes to a high level ambiguous question that actually has an implication on your daily works and the products, LLM is not helpful anymore than search engines.
Yeah, AI will do the easy and fun jobs for you. You will only need to care difficult decisions that you're going to be responsible for. What a wonderful world...
As with everything about AI, HN once again shows a remarkable inability to project into the future.
This site has honestly been absolutely useless when discussing new technology now. No excitement, no curiosity. Just constantly crapping on anything new and lamenting that a brand new technology is not 100% perfect within a year of launch.
Remove "Hacker" from this site's name, because I see none of that spirit here anymore!
This is a post about a present product launch. The future, maybe tomorrow, will be filled with wonder and amazement. Today, we need to understand reality. Not all of us appreciate empty hype. Hackers tinker with reality and build the future. Marketers deal with thin promises.
Wait, wait, you're telling me that a site attended by people who stan for the OG Luddites is no longer worthy of being called "Hacker News"? Or where users with names like "BenFranklin100" extol the virtues of Apple's iOS developer agreement? Say it isn't so.
Just yesterday I tried to feed it a simple HTML page to extract a selector, I tried it with GPT-4-turbo, I tried it with Claude, I tried it with Groq, I tried it with a local LLama2 model with 128k context window. None of them worked. This is a task that while annoying, I do in about 10 seconds.
Sure, I'm open to the possibility that in the next 2 - 3 days up to a couple of years, I'll no longer do manual coding. But honestly. After so much hype, I'm starting to grow a bit irritated with the hype.
Just give me a product that works as advertised and I'll throw money your way because I have a lot more ideas than I have code throughoutput!