Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
What Meta learned from Galactica, the doomed model (venturebeat.com)
112 points by swyx on Nov 17, 2023 | hide | past | favorite | 54 comments



I think the reason is because expectations were different. ChatGPT was released to the general public for general use. Galactica was really only noticed by the egg head press and egg head professionals (like myself) who rightly identified the failure of these statistical language models to have a grounding in facts. People whose job it is to notice details are going to push your model to the limit.

But it’s not like ChatGPT was better. If I recall, RLHF actually made hallucinations slightly worse. Even today OpenAI wouldn’t claim to be able to accurately summarize papers. It’s just that there was an endless supply of “write X in the style of Y” that was like catnip for journalists and kept them busy for months.


Meta/Yann LeCun were generating the hype around Galactica that caused expectations to be different. You can't say things like 'Type a text and Galactica will generate a paper with relevant references, formulas, and everything'[1] and expect your LLM not to make you look foolish. It wasn't about pushing the model to the limit, it was about setting realistic expectations of what it could do.

[1] https://statmodeling.stat.columbia.edu/2022/11/23/bigshot-ch...


> the failure of these statistical language models to have a grounding in facts

I feel the "grounding in facts" was a big mistake, and to a large extent still is. The worst case of it was symbolic linguistic and knowledge graphs. Why? Because symbols don't exist, facts and concepts and words are fuzzy boundaries, their meaning defined mostly or entirely through associations with other words. We don't learn symbolically, we don't have strong "grounding in facts" - not at the language layer.

I'd rather say that language and understanding, the way we do it, is statistical in nature, and "grounding in facts" is done through feedback.


Grounding in facts is the one thing where machines could actually improve upon humans (in the same way a calculator doesn't fail on every 10th calculation).


In the context of LLMs it's only ever going to work out that wath math/stats, because garbage in garbage out will taint LLM output roughly to the same degree as human generation. Because humans are the garbage producers in both scenarios.

To bypass the garbage in problem, you'll need a different AI structure from LLMs almost certainly, a proper logic model.


Is it the fundamental architecture of an LLM that makes this problematic, or the input?

Based on how it's reported that they work, I assume that even without garbage in their statistical reasoning of the next word would occasionally come up with bad output. E.g. they might come up with 2 + 2 = 5 because of mathematical inputs such as "1 + 2 + 2 = 5".


Except it's unlikely because they have a bit larger context and are modeling a bit more than one-dimensional probabilities of "what comes after this?".

I'd say that LLMs are more resistant to GIGO, as long as the fraction of garbage in training data is small - it'll look like outliers to the larger model.


ChatGPT relies very much on confirmation bias to do its magic. You ask something trivial you already know (and could have probably Googled in 15 seconds), you get back something glib and smooth in reply and you are wowed by how smart GPT seems to be.

It is significantly less impressive when you ask something you don't already know and can't Google easily.


That's more or less my experience with code generation. To be fair, I haven't paid for ChatGPT-4, which I'm told is much better. Still, if I ask 3.5 a leetcode style problem I've done, it comes up with a good solution. If I ask it a leetcode style problem I just made up, it generally comes up with garbage -- the solution doesn't work, solves a different problem than I asked, or ignores a restriction I imposed.

The place where I've found it really handy is if I want to make stuff up. Things like "Give me a mundane event that could happen to a 25-year-old leatherworker for a story I am writing." or "What should I name a piece of software that allows people to collaboratively draw a landscape scene?"


> It is significantly less impressive when you ask something you don't already know and can't Google easily.

What's more, you absolutely cannot trust it with that kind of query.

Its insane to me that LLMs are primarily being pitched as fact givers. That is such a bad fit... They are so much better at writing poetry and fiction in chunks, or maybe analyzing stuff from the context.


I actually love bing bot when searching for something very nuanced ("no, not like that ... [explanation]"). I haven't used Google's AI search yet; why learn a new tool?


I use it mostly for complex queries I can't Google and that's where it excels for me.


I use it _after_ doing complex googling to get an answer, and it is almost universally wrong, or misses the point completely.


It often just takes a few minutes of verifying against your understanding and use case, and ask a few more questions, and you can identify incorrect answers and arrive at the verified correct answer, or useful answer if there is no correct answer, rather quickly- which should be how most people operate in most interactions with information. Not to mention it can now be forced to search the web and cite its sources.

I think at some point it just becomes a preference of your interface to the data your searching, and I've been enjoying natural language


How would you even know that it's less impressive if you don't know the subject matter?


Could you discern a good teacher from a bad teacher when you were in school and the topics were new? Did you ever find out someone was unreliable after they gave you some facts and later you discovered they were wrong? You won't have a 100% rate of detection and it'll be lower than in subjects you know about, but after a few times you can see if the new stuff it tells you is right or not.


Because it can be easy to test the result even if you don't know how to come up with a solution ?

If I asked you to build a light switch I don't need to know anything about EE to see that the light turns on/off when I flip the switch.

LLMs often fail to provide functional solutions for non trivial stuff (not impressive) - and once you know more about the domain, you usually figure out that the light switch will likely burn down your house in few hours because it's overheating.


Yea ChatGPT wasn't better at non hallucinating, but RLHF made the model much more useful for the average consumer, so consumers used it, instead of just other ML people. And, like you said, consumers are a bit less critical than experts from the field.


>“write X in the style of Y”

"Write this proof in the style of a program"

I think there is something a lot deeper behind this


Rookie mistake, it should have been named Galactus. Though of course, even though Galactus has omniscience, it doesn't have futuresight, so perhaps it would still have been doomed.


Reference for everyone else: https://m.youtube.com/watch?v=y8OnoxKotPQ


The all knowing user service provider aggregator


Sure, but it still doesn't support ISO timestamps: https://github.com/acchiao/galactus/issues/34


Maybe I’m not understanding the article correctly, but what was the lesson? That Facebook should have told people it would lie, similar to how OpenAI did?

I’m actually interested in what the internal lessons for Facebook would’ve been to see their product so replaced by a similar product only two weeks later, but unless I’m misunderstanding everything, the article doesn’t really seem to touch in it. At least not beyond the fact that people still want the model, and, that it’s part of llama and Facebooks new push for more open models.


They do state the lesson. It just isn't interesting:

> “The gap between the expectation, and where the research was, was too big.”

> Overall, Pineau said, “If I was to do it today, we would just manage the release.”


I feel that the article doesn't really cover what aspects made Galactica more open to criticism and led to its reception, outside of pointing to a lack of awareness for hallucinations in the wider public, which were equally present during the initial launch of ChatGPT (as most users never encountered OpenAIs disclosures on that topic, heck, we had lawyers relying on these models months after hallucinations as a phenomenon had been more widely covered). Maybe I missed it, but honestly, the article doesn't contain much information about what Meta learned specifically from Galactica (over general reception to LLMs across the industry) other than that misadventure being the claimed reason for Llama 1's release strategy, which has already been superseded by being less restrictive on Llama 2 access.

For me, the main reasons why Galactica was more heavily criticized than ChatGPT were twofold. For one, the more scientific aspirations that this model had from the outset, proclaimed boldly in the associated paper [0] made any factual errors or problematic output far less justifiable, compared to a model that, through its branding and design, was more conversational in nature.

That, I feel, leads directly into the second important reason. The initial launch of ChatGPT, very cleverly, targeted users of all backgrounds, making it more likely that the initial interaction the majority of users had with the model was more conversational and humorous than the type of systematic picking apart the more scientifically inclined crowd tends to partake in. This led to initial coverage of ChatGPT being filled with amazement by laypeople, drowning out a lot of criticism and more measured reactions in a sea of hype, whereas Galactica was consistently bombarded with less favorable reactions.

Overall, I still feel that the approach they took in building Galactica is one that should be explored further, and I am somewhat saddened that more focused LLMs have become less favored by researchers. I remain hopeful that as we explore the limitations of universal, conversational models, there will be a resurgence of more specialized LLMs similar to Galactica.

[0] https://galactica.org/static/paper.pdf


I think you are spot on, a LLM focused on science and medicine has a much higher bar to pass when it comes to accuracy.

I tried Galactica when it came out, and I have to say that subjectively at least the results looked much inferior to what the benchmarks suggest. In the paper they claim to be substantially better than GPT-3 on their larger models, while in my personal experience even some generous queries produced garbage output. I cannot remember whether the version available at the site was the largest model, however.


GPT-3 was prior to ChatGPT. Since it was a non-instruction tuned model, the output was quite bad unless prompted precisely.


GPT-3 was instruct finetuned a while before ChatGPT was even released (see InstructGPT paper and announcement). What ChatGPT went through is RLHF to align its responses to a more conversational level, but you could give GPT-3 (davinci) instructions long before ChatGPT.


> the initial interaction the majority of users had with the model was more conversational and humorous than the type of systematic picking apart the more scientifically inclined crowd tends to partake in

If I remember correctly chatGPT was grilled and prompt hacked for months, it was the Twitter obsession of the day.


That is very true, but you had a simultaneous stream of amazed reactions that counteracted a lot of negative reporting in the beginning. Such a luxury wasn't afforded to Metas efforts at the time.


Yes, Meta's launch was pretty salty. No goodwill.


Maybe I misremember, but isn't the fault about expectations vs reality due to how this was introduced?

I had the feeling Meta released this with great fanfare saying "do science with it" so of course there was backlash when it was shown to hallucinate.

I mean the website still sounds like it https://galactica.org/mission/


"We got the introduction completely wrong" seems to've been one of the takeaways according to the article.

I suspect the underlying problem was the limitations being obvious to the research team and them being excited that it was -less- limited than previous attempts in the same direction and not realising how people who didn't find the limitations it still had obvious would react.

All very predictable in hindsight but because they didn't see it as a 'real' launch I don't think people with enough of an outside view were involved, and ... I think we've all been there with the "but that's obvious" problem, so (a) you're right (b) I'm still kinda sympathetic.


Galaxus did nothing wrong. They released a language model to the one community well equipped to handle the limitations of such models. It was advertised as experimental. I think it is far better than releasing a model to the public at large, which in aggregate is less able to understand that hallucinations can occur. They got trashed because (i) it came from meta and (ii) it got targeted by (maybe well-meaning, yet...) short-sighted scientists on Twitter. On the first point, big companies are scrutinised more and tend to be criticised more, it is unlikely that any big company could have released chatgpt without a huge backlash. I'm not sure "move fast and break things" would still work for meta today. OpenAI had a privileged position for thay kind of moonshot. Then there are the well documented Twitter clashes between Yann Lecun and opponents, which I believe made Galactica the perfect target when it was advertised on Twitter by Lecun. It felt a bit ridiculous at the time, like a wave of identity politics reaching science: my enemy did that so I'm going to work hard on trashing it regardless, and I'll double down on my opinions irrespective of new evidence.


> They released a language model to the one community well equipped to handle the limitations of such models. It was advertised as experimental. [...] Then there are the well documented Twitter clashes between Yann Lecun and opponents, which I believe made Galactica the perfect target when it was advertised on Twitter by Lecun.

This is how it was advertised on Twitter by Lecun: "Type a text and http://galactica.ai will generate a paper with relevant references, formulas, and everything."

If he wanted to advertise it as experimental and make clear its limitations he could have written something like "Type a text and http://galactica.ai will generate something that looks like a paper with made up references, incorrect formulas, and anything."


Disagree. They claimed it did useful stuff; many people (including ex-scientists like me who are not short-sighted) saw the examples and how easily it was made to produce inaccurate information. In science, truthiness matters a lot and saying you have a LLM that can summarize scientific articles, users can reasonably expect that the LLM produces accurate information at a much higher rate.


i think the only real mistake they made was taking twitter drama seriously.


Exactly. There was the exact same push back on chatgpt but they just didn't care.


So say we all.



I think I was in this meeting at Facebook. Yikes.


> A week after its release, Llama’s model weights were leaked by someone who posted the download link to 4chan

You filled in a form and downloaded them…

Not really a leak in my books


This model had exactly the feature I have been looking for the last three days! What I need is something to semi-reliably "Translate Python code to Math." Sadly, it doesn't seem available anymore and I can't find anything I could use. (There are only tools for translation in the opposite direction, i.e., math to code.)


Which is interesting (at least my simpleton perspective). The mathematical form is designed to communicate the essence of an idea rather than exact process, so seems much more suited to LLM sourced expression.


Your comment made me realize that we probably have much more publicly available code than math, and we have only started writing code en masse in the last few decades. This, in turn, dovetails nicely with your point that math communicates the essence.


All knowing user info provider service aggregator?


The problem was that they told us that it was an AI for science and when you confabulate science you definitely do a very bad impression.


From the article, it seems that the lesson they learned was that public feedback is bad, which was the wrong lesson to learn.


I remember the outcry when this thing was released, thinking, did we learn nothing from Microsoft's "racist" Tay bot? How did they not anticipate that journalists and twitter were going to try to make it parrot things that make Facebook look bad?


Anyone know if they plan to continue work on galactica? openai sucks at reviewing papers


Not related to galactus the all knowing user provider service aggregator


This article is absolutely whitewashing how bad the Galactica project is. On day one, people were using the scientific paper generator to make fake papers and wiki pages with real-looking citations supporting racist, sexist, homophobic, and genocidal ends.

This thing didn’t actually do much aside from hallucinate about the world in a scientific voice. The bot was allegedly supposed to help scientists explore connections between different complex theoretical fields, but it didn’t have a concept of what was real or practical, and the citations were all fake.

What value did they think was going to come from this aside from misinformation? At least ChatGPT could do your homework for you.


Huh? Is this implying that Galactica was there “first” and yet meta lost out because they wouldn’t tolerate hallucinations whereas OpenAI does?

Does venturebeat realise there was GPT3 before ChatGPT?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: