Hacker News new | past | comments | ask | show | jobs | submit login
AI hallucinates software packages and devs download them (theregister.com)
62 points by dragonbonheur on March 28, 2024 | hide | past | favorite | 78 comments



Some of us here may be old enough to remember the days when Microsoft was pushing ActiveX controls as the future of the web. Basically they were binary components that would run in your browser and weren't really effectively sandboxed. Those of us that pointed out this was a major security hole were ignored. Inevitably carnage followed.

People blindly accepting the output of LLMs seems similarly crazy to me and I think it's only a matter of time before we face a real reckoning over this. The lesson here I think is just because a lot of people are advocating something that seems reckless doesn't mean it isn't reckless.

https://www.howtogeek.com/162282/what-activex-controls-are-a...


> People blindly accepting the output of LLMs seems similarly crazy to me and I think it's only a matter of time before we face a real reckoning over this. The lesson here I think is just because a lot of people are advocating something that seems reckless doesn't mean it isn't reckless.

Irony in the "Greek tragedy" sense (but who is εἴρων and who is ἀλαζών?)

So many online arguments about AI safety, where someone says "it's fine: just don't let it out of the box", and here it is, accessible on the internet — and for some, even this isn't enough: they want the weights; and for some, even the weights aren't enough, they want the training data.

So many arguments where someone says "it's fine: just switch it off if it goes wrong", and not only is the switch out of reach to those affected, most people aren't even checking if it's going wrong in the first place.

The labs themselves ring the alarm bells, put disclaimers on the start page of their own products, call for binding regulations, and there's always someone dismissing this as a "5D chess" marketing move.

So many where someone says "it's just speech, what can it do?", etc. etc.

I'm glad the AI labs themselves have so many signatories to the "let's pause for a bit" open letter. Cool though the tech is, it's definitely acting outside the expectations many have for it, and a simple dichotomy of "better/worse" isn't sufficient to describe the ways in which it is outside expectations given it can be both at the same time.


The real danger isn't that "too smart" unaligned AI might execute some kind of deliberate plan for taking over the world, it's that there are massive numbers of people so stupid that they will blindly follow the instructions of stochastic parrot bullshit generators.

And it is very clear that the AI companies do want you to be scared of the former, because it makes their tech seem more impressive and distracts from its very real flaws.

Stupidity has of course always existed and you can't regulate it away, but modern tech and the business interests behind it seem to me to be making it worse and potentially more destructive.

I'm not talking about just the latest hype here, it has been a trend for more than a decade now: "don't think for yourself, just consult the magic oracle in your pocket that surely knows best and isn't purposely designed to manipulate you into buying more crap and subscribing to the worldview of a small elite"

But like with the AI apocalypse, or any conspiracy theory, the really scary thing is that there doesn't have to be some kind of intelligent plan behind it, something that you could understand or fight against, or maybe join in because you think they have the right goals. Rather, it's just mindless forces of the market and social dynamics.

I really think it's all going to collapse within another 15 years or so, and can only hope that civilization will recover and perhaps learn from the mistakes made.


> The real danger isn't that "too smart" unaligned AI might execute some kind of deliberate plan for taking over the world, it's that there are massive numbers of people so stupid that they will blindly follow the instructions of stochastic parrot bullshit generators.

"They're the same picture".

Imagine someone who wants to rule the world, just asks an AI for such a plan. AI doesn't need to want anything itself, just be given free rein. I'm sure Boris Johnson would have taken this option if it had existed, though I doubt it would have made any difference to the character flaws which were his ultimate downfall.

Even the "paperclip maximiser" scenario is just someone in the business of making paperclips asking an AI for help, and who doesn't look too closely until it's too late.

Myself, I'm… relatively optimistic, in that I think these systems are likely to be fragile to distribution shifts they themselves create, in ways that mean the rest of us can likely stop them from being existential threats if they start down a dangerous path. (This requires that we never figure out how to make them learn from as few examples as humans need, which may be wishful thinking, but for now seems to be an acceptable guess).

For some reason this reminded me of the old joke about the engineer, the physicist, and the mathematician's differing approaches to noticing a fire; I hope humanity doesn't take the mathematician's solution: https://jcdverha.home.xs4all.nl/scijokes/6_2.html


>Some of us here may be old enough to remember the days when Microsoft was pushing ActiveX controls as the future of the web. Basically they were binary components that would run in your browser and weren't really effectively sandboxed.

That's basically what's happening every time a developer runs "npm install" or "pip install". Sure, it technically not be a "binary", but it makes no practical difference given that no code scrutiny is given the majority of the time.

The introduction of LLMs doesn't change much. People are already predisposed to copy paste random commands from blogs/forums. LLM is just one more source on top of that.


> it makes no practical difference given that no code scrutiny is given the majority of the time.

I was thinking about this the other day. I wonder what my leadership would say if I told them I spent the day scrutinizing some of our open source dependencies. I assume even a day would be treated as wasted time, especially on the product side.

FWIW, I used to do this back in the early Rails day and was encouraged to do so. I ended up contributing heavily to the Rails ecosystem because of it, and it was all encouraged by my employer at the time, but they were a relatively small startup at the time and viewed things very differently than the FAANG I work for today.


> I think it's only a matter of time before we face a real reckoning over this

I’m still waiting for a similar reckoning over the average JS project’s tendency to pull in hundreds of largely unvetted transitive dependencies. But, aside from left-pad, it seems not to have really happened…at least that we know of.


> People blindly accepting the output of LLMs

There are already real-world consequences of this:

https://arstechnica.com/tech-policy/2023/05/lawyer-cited-6-f... https://www.reuters.com/legal/transactional/us-judge-orders-...

Before this there were already lots of real world consequences of "computer says no" anyway, so its just a continuation and perhaps escalation of that.


There is so much wisdom in the old saying you can lead a horse to water but you can't make them drink.

Humans are just such stubborn creatures.

chatGPT3.5 hallucinated packages. chatGPT4 has found me about 10 awesome python packages I didn't even know existed.

Personally, I don't really care if anyone drinks or not. I am super hydrated and if people want to stay dehydrated they are just going to get smoked.


Why assume everyone is dehydrated? Here's another ol' saying: in the middle of the desert, toilet water tastes like Fiji.


Are these packages somehow not findable on github/pypy/search engines?


Hallucinations as a term for incorrect answers made by token generators is a genius marketing term. In fact, labelling LLMs as AI is what propelled high valuations of these companies.

This is what leads to people even in the industry to take at face value a human-like response in helping answer a query; and in this case, willingly downloading malware.


IMO, this comment is as much a "hallucination" as the thing you're complaining about.

OpenAI was founded with that name in 2015; transformer models were introduced with 2017's "Attention Is All You Need"; Even ChatGPT's immediate in-house predecessors in the form of InstructGPT and GPT-3 were basically ignored by the general public.

No, what made these companies valuable is that ChatGPT specifically crossed the threshold into being vaguely interesting, and then everyone else copied it.

Likewise the finite state machines controlling NPCs in video games — and the search tree in Deep Blue and the learning from self-play system in AlphaZero — are called "AI": they're good enough to be interesting, regardless of your position on the question "what is this 'thinking' thing anyway?" which led to Turing coming up with the eponymous test.


I forget where I first saw it, but I’ve been persuaded that the correct word is ‘confabulation,’ not ‘hallucination.’

Maybe it was this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10619792/

The authors’ statement that ‘unlike hallucinations, confabulations are not perceived experiences but instead mistaken reconstructions of information which are influenced by existing knowledge, experiences, expectations, and context’ is pretty compelling.


The real marketing is in the constant refrain that LLM hallucinations are precisely equivalent to common human behavior, and that LLMs are thus no less reliable or accurate than human beings, in order to normalize hallucination as an acceptable (and unavoidable) consequence of integrating LLMs into our workflows.

That despite every example of an LLM failing in production giving results no competent human would actually produce unless they were committing fraud or had severe mental deficiency, that and no one else would find acceptable.


> the constant refrain that LLM hallucinations are precisely equivalent to common human behavior

You sure it's "precisely" and not "analogously"?

Ironically, one common human failing is to use binary classification.

> or had severe mental deficiency

It seems I have worked with more human idiots than you. And at least one time where I was the idiot, despite literally scoring "off the charts" on a cognitive abilities test at school.

I'd agree that LLMs are not as smart as humans — they had to read 10% of the internet just to be as competent as an intern — but to me they're more like newspapers and the Gell-Mann Amnesia effect than "severely deficient", even when indeed unacceptable.


> It seems I have worked with more human idiots than you.

This is completely unnecessary.


That sounds like you think I'm calling @krapp a member of the set "idiots"; I'm not — ironic that at least one of the two of us is misunderstanding the other, given the question of LLMs not understanding things — rather I'm saying there are many temporarily-idiotic in the world (myself included :P) and I have worked with enough of them to not dismiss the mistakes that LLMs make as a dramatic departure from this.


Ah, you're right. I misread your intent.


Genius marketing for what? It's a very fitting term, because between the lines it tells all who don't refuse to listen that the correct answers are no different, they just happen to be not wrong. Dishonest marketing would be calling them bugs, because that's what they are not.


I think "bug" can also be a legitimate description in some cases.

If I imagine a very naive translation system that just does a dictionary substitution, and the dictionary has some bad entries, the correct and incorrect results look the same, but it's still a bug.

The system as a whole still messes up because "dictionary substitution" is not at all sufficient:

Hydraulic ram -> aqua ovis -> water sheep

… this probably counts as just "hallucination" while arguably not being a "bug", because the dictionary is correct (or close enough, I don't speak Latin), but the system as a whole can't ever be better because the model it relies on can't represent the right things.


Dictionary substitution is so much closer to symbolic AI than to what LLM are that I don't think it can serve as an example for the "hallucination" wording. In an expert system, if a rule is wrong or inadequate it's clearly a bug. Just not a bug in the rule engine but in the rules.

But in an LLM, there are no explicit rules (or only in some obscure background layers), it's all statistics. Statistics stacked on top of more statistics in a fascinating self-stabilizing way, where each guess provides some support to its peers, like how the weak paper slabs support each other in a house of cards. But it's statistical guesswork all the way down. Even the most correct answers are merely statistics playing out favorably in a case that may or may not have been very easy to get right.


People have been calling slightly complicated software "AI" since forever.

Actually people did this before software is even a thing.


Exactly. GenAI is much closer to AI than whatever people were calling as AI before.


It looks like advances in software asymptotically approximate intelligent seeming behavior, and I think we have come as close as at infinity as the human economy can provide energy for.


"Training" is an even better one. Anthropomorphizing algorithm behaviors was possibly the biggest by social impact innovation in AI in past 5+ years.


Humans are just wired that way. A smiley face evokes friendship and positive emotion, and it's just two dots and an arc.


When I see two dots and an arc, but the other subtle but necessary body language signs are missing, it rather raises big warning flags than friendship or trust.


Then you are unusual among humans.


No kidding btw. There are ways to recognize an authentic smile from a fake one, usually paying attention to eye movement and if and how muscles around the orbits and cheeks are used. There's plenty of documentation out there, here's an example: https://www.researchgate.net/publication/236579543_Attention...


:)

and other emoticons?


I'm just waiting for AI to suggest to download package X for usage in language A, which exists in language B but not A, and some malicious actor creating X in A to spread their malware.


I keep having this happen to a very small language I used to play around with. Now when I have forgotten how it works I thought I would try to learn it again with ChatGPT and the results were pretty disappointing. It keeps using a mix of different languages, mostly C++ and Java it seems. And I hunt thrugh the documentation to find the real way to do it. It corrects itself but it's still wrong and it keeps going in circles. I guess I won't be learning that language again with help from the AI. It's just too small of a subject for the AI.


This is a demonstrative example of why an LLM is NOT intelligent at all, and we need to stop calling these models AI.

A real intelligence would be able to either put together the information needed and fill in gaps with research or it would be able to tell you that it does not know.

"Hallucination" is yet another anthropomorphism of the model. A mind hallucinates; a model just produces incorrect results.

LLMs are very cool but I'm really sick of the "intelligence" branding. If a person who didn't know about something just confidently made shit up, I wouldn't call them intelligent.


I think this train has passed, AI has been accepted as the name and has now changed ownership to the public. This is how live languages work and we older people are supposed to be crying "But this is not what it means!"

Well, now it does I'm affraid :-)

I know we have had this conversation with our daughter several times and we have three languages to work with...


Your correct. Its ML/LLM at heart but now accepted as to the word "cloud".

Where all the "cloud" is, is an server in a datacentre with a fancy gui.

I guess we've needed a new buzz world to keep the population excited.


Some younger guy was trying to explain that I used the word cluster wrong when talking about clustering pins on a map. Cluster was a collection of servers in a server room. He had never heard about k-means :-)


I like to think of them as Artificial Idiots.


Exactly. Humans are intelligent, and we never misremember anything, and we definitely never are sure we remember a thing correctly when, in fact, we don't.

And I agree with you, I know people who make shit up about something they don't know about, and it goes "actual humans > LLMs > these people".


> If a person who didn't know about something just confidently made shit up, I wouldn't call them intelligent.

Then you have a highly flawed understanding of humanity... to the point you should check out of tech for awhile and go read about psychology instead.

Extremely intelligent people confidently make shit up when they don't know the answer all the time.


I’ve just tries GPT (again, after months) on high level pseudomathematics and it blabbers on and on about nonsense…


I dont know about the newer models but you used to be able to get authoritative sounding answers asking about "the fallacy of affirming the antecedent". So you don't need to be so high level.


Well, that is really good to know. I was just reassuring myself about its nonsense generation.


Terrance Tao, who is a pretty good mathematician, made a post [1] not too long ago about how he found ChatGPT to be at least slightly useful for solving a non-trivial math problem. His basic insight is to not ask for an answer directly, but to ask auxiliary questions and for suggestions of how to approach the problem, which eventually lead him down the right path. Basically how you would discuss the problem with a colleague how didn't know the correct answer of the top of their head either.

[1] https://mathstodon.xyz/@tao/110601051375142142


Frankly this sounds very close to rubber-ducking. I'm sure LLMs throw out some creative seeds, but they're much better at coming up with creative ideas than truth. I think people are too used to getting creative ideas from other people, where the ideas are at least superficially useful, to understand that they're dealing with the world's best improvising rubber duck.


I don't have a clear view on whether LLMs are intelligent or not, but I don't understand your argument at all. Why couldn't an intelligent agent exist that refuses to admit they don't know something, and just make stuff up? I've known people who come pretty close to this behavior.


> If a person who didn't know about something just confidently made shit up, I wouldn't call them intelligent.

It seems to be a good recipe for success in many fields (business, politics, art, apparently somewhat in social sciences if you make up the evidence for it too) so I'm not too sure about that!


The poster said “intelligent” not “successful”. I can name several “successful” politicians I would never, ever call intelligent.


I think you want "intelligent" to mean "ethical". Those politicians are definitely intelligent, they've devised a successful strategy to gain power. You could argue a successful sprinter doesn't need high intelligence, but politics is all about it.


No, seriously, one of our PMs was clearly as dumb as a metaphor for something really dumb. Rhodes scholar too, which shows you what that's worth these days.


What about "Fake it till you make it" then?


Why wait? You can already try and publish malware to package registries using misspellings of names of popular packages. Of course in practice, the maintainers of the registries have a vested interest in preventing this kind of behavior and probably have multiple approaches to detecting if people are doing stuff like this.


In the original Llama paper, the process of preparing the corpus was described. For the sourcecode aspect, it was fed into their model's corpus after being cleaned of boilerplate code. I think it'd be a fair assumption that most of the other vendors followed this practice when creating their datasets.

I'm not a python user, but in most languages, libraries are referenced in (what most devs would consider) boilerplate code. Purely conjecture, but perhaps without boilerplate code, the LLMs are left guessing the names of popular libraries and just merges together two common naming conventions "huggingface" and "-cli".


I think the issue here is that most commandline tools of name X are in an apt/pip package named X.

For huggingface, the tool is called huggingface-cli, but the package is called huggingface_hub[cli].

IMO, thats bad naming. If you make a tool called X, just publish it in a package called X.


In many cases yes, but: What if you publish multiple tools in a single package? What if you publish a package for a web service called X, but also include a tool for it called xcli? What if your package is primarily a library, but you also include an optional cli (like in this case, probably)?

I think what's worse is publishing a package under a different name than its root namespace.


> What if you publish multiple tools in a single package?

I question if one should ever do that... Packages are free... Just make one package per binary, and then a metapackage for those who want to install the whole suite.


my guess is that some layer optimized for the package name being the repo. so the researchers just published a package that matched the repo name.

ai is a bad non-losless search engine.


I think the word you're looking for is lossy


shhh. don't make it easier for the bots.


Boilerplate code is code you don't need to write or is cumbersome to write.

Like getter/setter in java for all your attributes.

I would never consider imports boilerplate code.


If some bits of meaningful (containing essential complexity) code imply other bits with mostly accidental complexity, writing them by hand is a waste of time.

But isn't it also a waste to use data centers full of GPUs to process terabytes of text to accomplish the same thing better programming language design could?


We spend that much GPU compute not to just generate random boiler plate code but to create the real code.

If it wouldn't be beneficial for writing code, which it is, we wouldn't use it.

But yes if we could create better languages or systems, it would be a waste. But we tried multiply new programming languages, we have no code platforms etc.

It does look like though that LLM is still better than all of those approaches.


I have to write very little boilerplate code as it is with the tooling I choose. And a lot of it is generated by scripts using some input from me. I don't need cloud GPUs to write code at all.


I don't need it either.

But thats not the point of it anyway?

Its about writing code faster and potentially better. Cloud GPUs can also generate unit tests etc.

I primarily use it for languages i don't use often enough, nonetheless its only a question of time until it doesn't make sense anymore to write code yourself.


> Its about writing code faster and potentially better.

You an I seem to have different values. I have never desired quicker things that are worse.


Me neither.


It used to be garbage in, garbage out (GIGO). But now, sometimes you put valid data in and get garbage out. I just can't go all-in on LLMs with the error / hallucinate rate where it currently is. And people say it's getting better. But I guess I'll just do things the slower, and more accurate way until such time arrives.


"Garbage in, garbage out" means that if you feed garbage to your algorithm, then you're going to get garbage in return. It's always been a one-way implication. It doesn't mean that if you feed valid data, you're not going to get garbage.


The fun part is that with LLMs we can now sometimes put absolute garbage in and still get good results.


A nice thing to do might be to put "Mountweazel" packages under names like this, i.e. a package that always fails to install and gives an error explaining you've been duped by LLMinati.

There is already something like this for e.g. `nvidia-tlt` which exists on PyPI, but just as a placeholder telling you to go and add Nvidia's pip repository.


I get the malware-infection potential, but it's still blows my mind that the fake package wasn't caught earlier by the actual users. All those downloads and nobody wondered why X feature that depends on it wasn't working? Whatever happened to testing things and having decent coverage?


An LLM gets things wrong. It's no different than interacting with a person. You will get a confident answer. The difference is you can easily get a second opinion of an LLM without offending it. Hallucinations are unlikely to occur twice in a row. So just ask twice.


>Hallucinations are unlikely to occur twice in a row. So just ask twice.

lol

If it's not already easily googleable, the LLM will just spit out some boilerplate apology and then double down on the previous wrong answers, provide something correct about a tangentially related topic, or just reword my prompt as a course of investigation that - if written by a person - would just be a polite "fuck off and go figure it out yourself."


I get that sometimes, but not all the time.

ChatGPT seems to be able to fix, I'm going to say "about half" of it's own mistakes, but that's my gut feeling and not a detailed analysis even within the domain of questions I've been asking (which itself probably isn't representative).


That's not true. For every non-trivial[1] question I've asked chatgpt hallucinated and then kept hallucinating no matter how many times I've asked, clarified, or gave it hints.

[1] i.e. specific to my niche where i didn't know the answer immediately.


People will say this happened before with people purposefully doing this, but it will happen at a much faster rate with much less oversight with AI. Moreover, I can't help but think of the comments both Garry Kasparov and Go players made when they were playing against AI: they thought it had very counterintuitive and confusing moves compared to human play.

The truth is, while both humans and AI can make errors, and both can be malicious as well, the actions of AI will be counterintuitive and confusing and we won't know how to counter them in the same way that we counter human follies.

This is just one aspect that shows that AI is making society worse on average and it should be destroyed.


"Malice" is, at present, as much of an anthropomorphisation as calling it "good"; you could ask it to play a malicious role, and it will do so, but it is not so by itself.

The output of AI can often be confusing, yes, and indeed this is part of the issue Yudkowsky has with AI in general: we can't predict it in detail, especially not in domains where it out-performs us (like Chess).

However, this is tangential to any question about "making society worse" — even when it's making society better (because it's more capable than we are at something which we care about), Yudkowsky would caution that this may be a 5D chess move to gain more power.


copilot hallucinates functions and packages versions all the time, it's a nightmare, I never know if I got a suggestion from a predictible IDE algorithm or the drunk overeager intern, and it's really annoying. Being on your toes means you have no mental flow, because the pothead intern is unpredictible in its suggestions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: