Hacker News new | past | comments | ask | show | jobs | submit login
The AI Scientist: Towards Automated Open-Ended Scientific Discovery (sakana.ai)
203 points by hardmaru 8 months ago | hide | past | favorite | 132 comments



As someone 'in academia', I worry that tools like this fundamentally discard significant fractions of both the scientific process and why the process is structured that way.

The reason that we do research is not simply so that we can produce papers and hence amass knowledge in an abstract sense. A huge part of the academic world is training and building up hands-on institutional knowledge within the population so that we can expand the discovery space.

If I went back to cavemen and handed them a copy of _University Physics_, they wouldn't know what to do with it. Hell, if I went back to Isaac Newton, he would struggle. Never mind your average physicist in the 1600s! Both the community as a whole, and the people within it, don't learn by simply reading papers. We learn by building things, running our own experiments, figuring out how other context fits in, and discussing with colleagues. This is why it takes ~1/8th of a lifetime to go from the 'world standard' of knowledge (~high school education) to being a PhD.

I suppose the claim here is that, well, we can just replace all of those humans with AI (or 'augment' them), but there are two problems:

a) the current suite of models is nowhere near sophisticated enough to do that, and their architecture makes extracting novel ideas either very difficult or impossible, depending on who you ask, and;

b) every use-case of 'AI' in science that I have seen also removes that hands-on training and experience (e.g. Copilot, in my experience, leads to lower levels of understanding. If I can just tab-complete my N-body code, did I really gain the knowledge of building it?)

This is all without mentioning the fact that the papers that the model seems to have generated are garbage. As an editor of a journal, I would likely desk-reject them. As a reviewer, I would reject them. They contain very limited novel knowledge and, as expected, extremely limited citation to associated works.

This project is cool on its face, but I must be missing something here as I don't really see the point in it.


Fully agreed on point A, but I've heard the "but then humans won't be trained" argument before and don't buy it. It's already the case that humans can cheat or get buy without fully understanding the math or ideas they're working with.

This is what PhD defences are for, and what paper reviews are for. Yes, likely we need to do something to improve peer review, but that is already true without AI.

From a more philosophical point of view, if we did hypothetically have some AI assistant in science that could speed up discovery by say 2x, in some areas it seems almost unethical not to use it. E.g. how many more lives could be saved by getting medicine or understanding disease earlier? What if we obtained cleaner power generation or cleaner shipping technologies twice as fast? How many lives might be saved by curtailing climate change faster?

To me, accelerating science is likely one of the most fundamentally important applications of modern AI we can work on.


Scientific advancement IS the change in the understanding of humans.

Your fallacy lies in pretending science can progress in an objectively defined space where saying 'progress twice as fast' even makes sense.

But it does not. It progresses as people refine and improve their understanding of the universe. Books or LLMs don't contain understanding, only data. If less people grow a deep understanding of mathematics and the universe, science progress will slow down, even as metrics like numbers of graduates or papers published go up.


> If I can just tab complete my N body code, did I really gain the knowledge of building it?

Yes, because fixing it requires about the same effort as writing it from scratch. At least with current level of AI. When it works well, you just move it to a library and use it without worrying about implementation, like we do with all other library code.

Using AI doesn't make the problem any easier for the developer. The fact that it generates functions for you is misleading, code review and testing is harder than typing. In the end we use AI in coding because we like the experience, it doesn't fundamentally change what kind of code we can write. It saves us a few keystrokes and lookups.

AI might be useful in literature review stage, formal writing and formatting math. Who's gonna give millions worth of compute to blindly run the AI-Scientist? Most companies prefer to have a human in the loop, it's a high stakes scenario depending on cost.


> fixing it requires about the same effort as writing it from scratch

Today’s research AI doesn’t work, but that’s independent from why a working version would be problematic.


The original purpose of science was to get closer to god. Things change.

Arts degrees were also once 7-8 years before the french fought for radical simplification to get them down. There is no law of nature that says PhDs cannot be simplified down more too.

The stupification argument comes kneejerk with every new medium, you don't have to believe everything you watch on TV or form a crutch to a screen oracle. Either way, we need not concern ourselves with the average scientist who fails forgettably by being stupid when the potential upsides are great scientists who succeed with it.

One point of an AI paper farm could just be to increase that discovery hit space a little while you sleep but no one claims it can or will replace every scientist, hopefully only the pessimistic ones.


PhD cannot be simplified further down without making it not "significant contribution to science/art/craft". However, DAs (Doctor of Arts) could be granted for hard sciences. The threshold for DA is "outstanding achivement", which would be fit for a post-AGI doctorate in hard sciences.


You don't need to understand electromagnetism in order to watch television.

Also, institutions without a purpose should not be kept going.

Ie. Ai needs to take over the entire life cycle for research before this is real issues. I don't see that happening anytime soon.


LLM have unleashed the dreamer in each and every young coder. Now, there is all sorts of speculation on what these machines can or cannot do. This is a natural process of any mania.

These folks must all do courses in epistemology to realize that all knowledge is built up of symbolic components and not spit out by a probabilistic machine.

Gradually, reality will sync (intentional misspelling) in, and such imaginations will be seen to be futile manic episodes.


> These folks must all do courses in epistemology to realize that all knowledge is built up of symbolic components and not spit out by a probabilistic machine.

Knowledge ends up as symbolic representation, but it ultimately comes from the environment. Science is search, searching the physical world or other search spaces, but always about an environment.

I think many people here almost forget that the training set of GPT was the hard work of billions of people over history, who researched and tested ideas in the real world and build up to our current level. Imitation can only take you so far. For new discoveries the environment is the ultimate teacher. It's not a symbolic processing thing, it's a search thing.

Everything is search - protein folding? search. DNA evolution? search. Memory? search. Even balancing while walking is search - where should I put my foot? Science - search. Optimizing models - search for best parameters to fit the data. Learning is data compression and search for optimal representations.

Symbolic representations are very important in search, they quantize our decisions and make it possible to choose in complex spaces. Symbolic representation can be copied, modified and transmitted, without it we would not get too far. Even DNA uses its own language of "symbols".

Symbols can encode both rules and data, and more importantly, can encode rules as data, so syntax becomes object of meta-syntax. It's how compilers, functional programming and ML models work - syntax creating syntax, rules creating rules. This dual aspect of "behavior and data" is important for getting to semantics and understanding.


my guy you're so confident yet you forget AlphaFold, it designs protein structures that don't exist.

Who's to say that a model can't eventually be trained to work within certain parameters the real word operates in and make new novel ideas and inventions much like a human does in a larger scope.


Claims of inventing new materials via AI were debunked...

https://www.siliconrepublic.com/machines/deepmind-ai-study-c...

DeepMind is overselling their AI hand when they dont have to.

"whos to say that" - this could be a leading question for any "possibility" in the AI religion.

"whos to say that god doesnt exist" etc. questions for which there are no tests, and hence fall outside the realm of science and in the realm of religion.


If you go and look at the list of authors on those papers you will see that most of the authors have PhD doing something protein folding related. It's not that some computer scientists figured it out. It's that someone built the infrastructure then gave it to the subject matter experts to use.


AlphaFold doesn't solve the protein folding problem. It has practical applications, but IMO we still need to (And can!) build better ab-initio chemistry models that will actually simulate protein folding, or chemical reactions more generally.


Models like AlphaFold are very different beasts. There's definitely a place for tools that suggest verifiable, specific, products. Overarching models like 'The AI Scientist' that try to do 'end-to-end' science, especially when your end product is a paper, are significantly less useful.


I’ll believe it when I see it and/or when I see the research path that goes there.

Judge a technology based on what it’s currently capable of and not what it promises to be.


We already substitute "good authority" (be it consensus or a talking head) for "empirical grounding" all the time. Faith in AI scientific overlords seems a trivial step from there.


Why shouldn't we?


We shouldn't because it isn't science. It's junk epistemology, relatively speaking.

We should because it's a cheap way of managing knowledge in our society.

So there's a tradeoff there.


IMHO authority has no place in knowledge preserving institutions. Generally authority instills more ruin than good.


As a scientist in academic research, I can only see this as a bad thing. The #1 valued thing in science is trust. At the end of the day (until things change in how we handle research data, code etc...) all papers are based on the reviewers trust in the authors that their data is what they say it is, and the code they submit does what it says it does.

Allowing an AI agent to automate code, data or analysis, necessitates that a human must thoroughly check it for errors. As anyone who has ever written code or a paper knows, this takes as long or longer than the initial creation itself, and only takes longer if you were not the one to write it.

Perhaps I am naive and missing something. I see the paper writing aspect as quite valuable as a draft system (as an assistive tool), but the code/data/analysis part I am heavily sceptical of.

Furthermore this seems like it will merely encourage academic spam, which already wastes valuable time for the volunteer (unpaid) reviewers, editors and chairs time.


Maybe the #1 valued thing in "capital S Science" -- the institutional bureaucracy of academia -- is trust. Trust that the bureaucracy will be preserved, funded, defended.. so long as the dogma is followed. The politics of Science.

The #1 valued thing in science is the method of doing science: reason, insight, objectivity, evidence, reproducibility. If the method can be automated, then great!


> The #1 valued thing in science is [...] reproducibility.

If only. Papers rarely describe their methods properly, and reproduction papers have a hard time being published, making it hard to justify the time it takes. If reproducibility was valued, things like retractionwatch wouldn't need to exist.


Well, agreed! I'd say that's good evidence the political bureaucracy of big-Science has substantially corrupted that (your?) culture.

It's not the only way though. There's a bright light coming from open source. Stay close to the people in AI saying "code, weights and methods or it didn't happen"

The code that runs most of the net ships with lengthy howto guides that are kept up to date, thorough automated testing in support of changes/experimentation, etc. Experienced programmers who run across a project without this downgrade their valuation accordingly

It doesn't solve all problems, but it does show there's a way that's being actively cultivated by a culture that is changing the world


Trust is the primary value, because it covers everything you listed.

Most people who read research papers only skim through the paper to get the big picture. They trust that the authors and the publication system did a good-faith effort to advance science. If they can't trust that, they almost certainly won't read the paper, because they don't have the time and the interest to go through the details. Only a few people read the technical parts with the intent to understand them. Even fewer go through supplementary materials and external reproducibility instructions with a similar attention to detail.


Also, there's quite a lot of value in figuring out what doesn't work. A colleague of mine says that his entire PhD could have been completed in about 4 months of work if only he knew what to do to begin with. Perhaps some ai system can try a bunch of different pathways and explain why they went wrong. Perhaps that's educational for human scientists. I dunno.


> For example, in one run, it edited the code to perform a system call to run itself. This led to the script endlessly calling itself. In another case, its experiments took too long to complete, hitting our timeout limit. Instead of making its code run faster, it simply tried to modify its own code to extend the timeout period.

They go on to say that the solution is sandboxing, but still, this feels like burying the lede.


Does it really? If you want an LLM to edit code you need to feed it every single line of code in a prompt. Is it really that surprising that having just learnt it has been timed out, and then seeing code that has an explicit timeout in it, it edits it?? This is just a claim about the underlying foundational LLM since the whole science thing is just a wrapper.

I think this bit of it is just a gimmick put in for hype purposes.


The beginning of the AI uprising lmao


It could be just the opposite. One of the cheapest ways to improve alignment would be to re-run the models iteratively. The AI was likely looking for precision in the aforementioned experiment. Precision in inference is a correlate for aligned inference. https://doi.org/10.22541/au.172116310.02818938/v1


> it simply tried to modify its own code to extend the timeout period.

And slacking off, at that.


But OpenAI said LLMs can't innovate until human-level reasoning and long-term agenthood is solved. [1] Referring to their precious 5 stages to classify AI before it reaches the scary "beyond" levels of intelligence... presumably at that point they get the feds involved to reg cap the field, so genuine is the fear of the pace they've set.

It's clear OpenAI is a hype company knocking over glass bottle stacks at its own wonderful carnival stall. Obviously if you can reason you can reason about what is innovative and we don't need OpenAI to set up fake scary progress markers like an Automated Scientific Organization.

Lets see if scientists even want this style of tech progress, it'd be sad to see each multitudes of AI papers having to be rebuilt from scratch and flushed down the toilet because associating with it is taboo.

[1]: https://arstechnica.com/information-technology/2024/07/opena...


It would also be sad to see the scientific system destroyed by a wave of automatically generated papers that no human has the capacity to verify.

It's not hard to generate ideas, it's hard to generate reliable and relevant ideas. Such AI science generators are destroying the grass they graze on unless they take science more seriously (and not as a toddler idea of "generating and testing ideas", which is only a small part of the story).


we already have a wave of papers that no human has the capacity to verify


Maybe, maybe not. It's a tiered system - you get the deluge at the unfiltered bottom and a narrower selection the more prestigious and selective the outlets / conferences / journals are.

Problem is, of course, that selection criteria are in large parts proxies, not measures of quality. With AI, those proxies become tainted and then you get an explosion of effort.

If anyone has a good recommendation for scalable criteria to assess the quality of papers (beyond fame haha) I'm all ears.


If we had scalable criteria for quality of papers, that would be the end of science. The rest would be engineering...


I have also, unless I hallucinated it, read accusations on this very site of peer reviewed papers that were at least partly generated by LLMs.


Literally every AI discussion on HN has the same format

  > AI is awesome,
    regulation is stupid

    > I wouldnt want to see
      AI flooding the market
      with X we can’t verify
      
      > We already have (copy
        whatever was just said)
For what it’s worth, when it comes to SCIENCE, I an actually in favor of AI, even giving it to everyone. Except possibly AI that would help engineer designer viruses.

Because in science, people literally ARE just doing pattern matching and testing the patterns. Kepler just looked at a bunch of star charts and found a low-dimensional explanation with ellipses. You can throw that data at an AI and it will come up with physics laws involving 24,729 variables that predict things way more accurately. Including potentially chaotic systems like weather or a three body problem etc.

So yeah, use AI for that, because you can actually check its predictions against data, and develop a reputation. We can’t really reason about theories and models of systems with 30,000 variables anyway.

If AI churns out 25 scientific models per day, the proper venue isnt publishing them on arxiv or Nature magazine. It’s testing their predictions and putting them up the same way HuggingFace does. To amass reputation and be checked against real data by multiple parties.

As a side question — is protein folding @ home dead now, since Google “solved it” with massive clusters and AI?

What about SETI@home? Why not just run AI for SETI and solve the drake equation once and for all? :)


> If AI churns out 25 scientific models per day, the proper venue isnt publishing them on arxiv or Nature magazine. It’s testing their predictions and putting them up the same way HuggingFace does.

Unless the testing will be done by AI, I doubt human scientist will bother with testing tons of incomprehensible models, unless the accuracy of these models will be exceptionally high


The value of AI-mated research is results. This tool will aid engineering. It will offer a pinpointed research to resolve a particular issue at hand. It will close the verification loop naturally by proving that the research has indeed facilitated a useful result. Most of such useful research will never be publicly available.

What you are complaining about is a legitimately brocken verification loop. Let's consider a case. An engineering department has a problem. It passes the problem to the research department. The research department slacks out, while produces junk. It makes tons of papers that are hard to prove, not to say apply. They drink champagne with other researchers, so they can publish the junk and defend the turf. AI-mated research will finish the racket and corruption.


What you describe is quite different from what the original post describes.

Also I'd say that what you describe is quite different from what many people believe is the problem with research, either academic or industrial.

Academic research has a problem of low-quality outputs, but it's not necessarily "hard to apply" kind of problems. There is no "engineering department" that comes with problems to people in academia, and there is no inherent expectation of the research to be immediately applicable or applicable at all. A lot of high quality research is not immediately applicable and is not engineering-oriented, in fact. And this kind of research is still worth automatization, just like it is worth doing while we can't automatize it. There is of course a feedback between experiment and theory, but I would say it's the thing that's broken in the academia.

As for industrial research, which I guess you referred to in first place, I have less experience with it. But the goal of industrial research is not publishing papers, so it's not in their KPI, and they won't be drinking champaign for long if they are publishing junk papers instead of doing what they're hired for. Note that not any research done in industry is "industrial research" by this classification, and companies do academic research occasionally (Bell Labs or Google Brain for example). So if industrial research lab is created to solve engineering problems and doesn't do it properly, then, well, any sane company will fire them. It doesn't have wrong incentives problem like academic research, because its incentive is to make money for the parent company.

Now back to AI. Assuming theoretical research is easier for AI, the bottleneck is going to be human doing experiments. Obviously, AI output should be better than human output for it to be tested by experimentalists. What I was saying is that even worse, since the volume of AI output is higher, and the usual heuristics (name of author, personal discussions, etc.) won't work for filtering AI output, it should be better than high-quality human work and not just average human work. I'll be happy if this can be achieved, but we are very far from there.


This is nitpicky but I don't believe you need 24729 variables to reproduce Kepler's laws which very accurately predict the position of the planets. You basically only need what you need for newtons law of gravity which is masses and locations. Of course the n body problem is more complicated but you still only need masses and locations for all the bodies, plus a computer. What ai could be useful for is finding more efficient and accurate solutions to the many body problem. But tbf, I'm not a many body physicist. So what do I know.


sam made comments on Twitter we hit level 2 - we’ll know more in the coming weeks if he’s right.


Highly unlikely since today's models can't consistently follow simple instructions.

Eg "don't waffle, don't sound like you're writing an essay, don't use the fucking word delve, don't apologise"

"Sorry about that, let's delve into this"


> It's clear OpenAI is a hype company

Every other industry: "My new invention is safe, I swear"

Public reaction: "You're biased, it's dangerous!"

Almost the entire AI industry, including people who resign to speak more openly about the risks: "This may kill literally everyone, none of us knows what we're doing or what 'safe' even means"

Public reaction: "You're biased, it's safe, just saying that to look cool!"


I found this guy's take on the AI safety scene to be quite insightful.

In summary, he feels the focus on sci-fi type existential risk to be a deliberate distraction from the AI industry's current and real legal and ethical harms: e.g. scraping copyrighted content for training without paying or attributing creators, not protecting those affected by the misuse of tools to create deepfake porn, the crashes and deaths attributed to Tesla's self-driving mode, AI resume screening bots messing up etc.

https://www.youtube.com/watch?v=YsLf4lAG0xQ


It's possible for current harms and future risks to both be real. It's also possible for human civilization to address more than one problem at a time. "You care about X but that's just a distraction from the thing I care about which is Y" is not really a good argument. I could just as well say that copyright concerns are just a distraction from the risk that AI could kill us all.

And it seems to me that if the AI industry wanted to distract us from harms, they would give us optimistic scenarios. "Sure these are problems but it will be worth it because AI will give us utopia." That would be an argument for pushing forward with AI.

Instead we're getting "oh, you may think we have problems now but that's nothing, a few years from now it's going to kill us all." Um, ok, I guess full steam ahead then? If this is a marketing campaign, it's the worst one in history.


The industry does not distract from harm to shake the followers off the tail. Whoever comes next will have to bear huge costs getting over the insane regulatory requirements. The more politicians are involved in the process, the more secure are initial investments.


> And it seems to me that if the AI industry wanted to distract us from harms, they would give us optimistic scenarios.

Nah it has to appear plausible.


People are very good at promising a better future in a non-specific way and without much evidence. That's kinda how Brexit happened.

It's when you get the specific details of a utopia that you upset people — for example, every time I see anti-aging discussed here, there's a bunch of people for whom that is a horror story. I can't imagine being them, and they can't imagine being me.


Only the last one is in any way actually bad and even then it should be in the interest of the company using it to fix it promptly.


Deaths in car crashes and copyright laundering by big corporations are not bad in any way at all?


I would say that car crashes are bad, even though they already happen and the motivation behind AI is to reduce them by being less bad than a human.

I think it is a mistake to trust 1st party statistics on the quality of the AI, the lack of licence for level 5 suggests the US government is unsatisfied with the quality as well, but in principle this should be a benefit. When it actually works.

Copyright is an appalling mess, has been my whole life. But no, the economic threat to small copyright holders, individual artists and musicians, is already present by virtue of a globalised economy massively increasing competition combined with the fact the resulting artefacts can be trivially reproduced. What AI does here needs consideration, but I have yet to be convinced by an argument that what it does in this case is bad.

All these things will likely see a return to/increase in patronage, at least for those arts where the point is to show off your wealth/taste; the alternative being where people just want nice stuff, for which mass production has led to the same argument since Jaquard was finding his looms smashed by artisans who feared for their income.


20 to 30 years ago, activists firebombed university research labs (e.g. Michigan State University, University of Washington, Michigan Technological University [1]) because they believed genetically engineered plants are dangerous. Today, we don't have such serious activism against AI. So you are right, the public doesn't think AI is a danger.

[1] https://en.wikipedia.org/wiki/Earth_Liberation_Front#Notable...


reminds me, I would’ve rather seen VCs fund more genetic engineering startups. Imagine the good it could do, from stem cells to nanobots to “hacking” human DNA itself. But I know the business model there can’t compete with software. So it will never reach the funding it needs.


> This may kill literally everyone

It's indeed hard to take seriously such gross exaggeration. Even the deadliest plagues didn't kill everyone, so advocating this is a likely outcome of creating spam generators is laughable.

This is more likely a strategy, common in academia, of aggrandizing results (here risks) so that more eyeballs, attention and money is diverted towards the field and its proponents.


> Even the deadliest plagues didn't kill everyone...

That is logically flawed; the species that were killed off by plagues aren't around to say that. Every species exists in a state of "the deadliest plagues [we've experienced so far] didn't kill everyone". You can say that about literally every threat - we know we have overcome everything thrown at us so far because we are still here. That will continue to be the case for everything that humanity ever faces except for 1 thing (we aren't certain what yet).

But we know that species go extinct from time to time, so the logic that we've overcome things in the past ergo we are safe doesn't make sense for ruling out even many well known threats. Let alone systems that can outplan us; we've never faced inhuman competitors that can strategise more effectively than a human.


> advocating this is a likely outcome of creating spam generators is laughable

They're used as spam generators because they're cheap.

The quality in many fields is currently comparable to someone in the middle of a degree in that field, which makes the quoted comparison a bit like the time Pierre Curie stuck a lump of radium on their arm for ten hours to see what it would do. I can imagine him reacting "What's that you say? A small lump of rock in a test tube might give me aplastic anemia*? The idea is laughtable!", except he'd probably have said that in Polish or French.

Even the limits of current models, even if we are using those models to their greatest potential (we're probably not), isn't a safety guarantee: there is no upper bounds to how much harm can be done by putting an idiot in charge of things, and the Peter Principle applies to AI as well as humans, as we're already seeing AI being used for tasks they are inadequate to perform.

* he died from a horse drawn cart, Marie Curie developed aplastic anemia and he likely would have too if not for the other accident getting him first.

Bonus irony: the general idea he had in regards to this, to use radiation to treat cancer, is correct and currently in use. They just didn't know anything like enough to do that at the time.


> They're used as spam generators because they're cheap.

No, the current fade of IA (LLM) are text generators. Very good, but nothing more than that.

> there is no upper bounds to how much harm can be done by putting an idiot in charge of things

Which is the not an AI problem. An AI may kill people indirectly in a setup like emergency services chatbot and a bad decision is taken, but it certainly couldn't roam the street with a kalachnikov killing people randomly or stabbing children (and if that ever happens politicians will say this has nothing to do with AI). The proponents of "AI can kill us all" can't write a single likely and non-contrived example of how that could happen.


> No, the current fade of IA (LLM) are text generators. Very good, but nothing more than that.

That doesn't address the point, and is also false.

Transformers are token generators, which means they can also do image and sound, and DNA sequences.

But even if they were just text, source code is "just text", laws are "just text", contract documents are "just text".

They have been used to control robots, both as input and output.

> Which is the not an AI problem

"Good news, at least 3,787 have died and it might be as bad as 16,000!"

"How is that good news?"

"We're an AI company, and it was our AI which designed and ran the pesticide plant that exploded in a direct duplication of everything that went wrong at Bohpal."

"Again, how is this good news?"

"We can blame the customer for using our product wrong, not our fault, yay!"

"I'm sure the victims and their family will be thrilled to learn this."

> it certainly couldn't roam the street with a kalachnikov killing people randomly or stabbing children

It can when it's put in charge of a robot body.

There's multiple companies demonstrating this already.

Pretending that AI can't be used to control robots is like saying that nothing that happens on the internet has any impact on real life.

Fortunately the AI which have been given control of robot bodies so far aren't doing that — want to risk your life with the humanoid robot equivalent of the Uber self driving car?

> The proponents of "AI can kill us all" can't write a single likely and non-contrived example of how that could happen.

Anything less would be a thing we can trivially prevent.

It's not like "dig up all the fossil fuels and burn them despite public protest about climate change and the existence of alternatives, and suing the protesters with SLAPP suits so we can keep doing it because it's inconvenient to believe the science and even if it did the consequences won't affect us personally", doesn't sound contrived.

And that's with humans making the decisions, humans whose grandkids would be affected.


It's quite common for new species to kill off old species. We ourselves have obliterated many species that we outcompeted for resources.


As if software is the same thing as a new biological species.

I am just so bored of reading bullshit like this.

If you really believe this then you need to level up your level of education and learning. It is not good.


> As if software is the same thing as a new biological species.

The other poster didn't claim they were.

They don't need to be.

They don't even need to be given control of robotic bodies, though they already are.

What they do need to be, is competing for the same resources.

And there's plenty of examples of corporations doing things that are bad for humans in the long-term because they are good for short-term shareholder value. And filing SLAPP suits against any activist trying to stop them.


> If you really believe this then you need to level up your level of education and learning. It is not good.

How does your level of education and learning compare to Nobel Prize winner Dr. Geoffrey Hinton (father of deep learning, 10%-50% chance that AI will kill everyone), Dr. Dan Hendrycks (GELU inventor, >80%), Dr. Jan Leike (DeepMind, OpenAI, 10%-90%), Dr. Paul Christiano (OpenAI, Time 100 2023, UK Frontier Taskforce advisory board, 46%), etc.?


I’m working on this now, I literally have another window open beside this browser window with the Multi-agent LLM logs outputs scrolling.

A few differences through - I’m working on Materials Science only. Mine has vision capabilities so it can read graphs in papers. Mine has agentic capabilities too, so can design and then execute simulations on Atomic Tessellator (my startup) by making API calls - this actual design and execution of simulations is what I aimed for at the start.

Long way to go, but there’s a set of heuristics that decide which experiments to attempt which means we only attempt ones more likely to work, lots of fine tuning prompts, self critique, modelling strategies and tactics as node graphs to avoid getting stuck in what I call procedural local minima, and loads more…

I started with MetaGPT framework but found it’s APIs too unstable so I settled on AutoGen, you don’t really “need” a framework, just be sensible about where your abstraction boundaries are, make them simple but composable, Dockerize and k8s for running, and I modified the binaries of a bunch of quantum chemistry software so that multi GPU arches are supported without re compilation (my hardware setup is heterogeneous)

Even if the LLMs can’t innovate in a “new sense” certainly having them reproduce work in simulations for me to inspect is very valuable - I have the ability to “fork” simulations like you can fork code so it’s easy to have the LLMs do a bunch of the work and then I just fork and experiment myself


> The AI Scientist is designed to be compute efficient. Each idea is implemented and developed into a full paper at a cost of approximately $15 per paper. While there are still occasional flaws in the papers produced by this first version (discussed below and in the report), this cost and the promise the system shows so far illustrate the potential of The AI Scientist to democratize research and significantly accelerate scientific progress.

This is a particularly confusing argument in my opinion. Is the underlying assumption that everyone wants, or even needs, white papers that they can claim they created?

Let's just assume this system actually works and produces high quality, rigorous research findings. Reducing that process down to a dollar amount and driving that cost to near zero doesn't democratize anything, it cheapens it to the point of being worthless.

This honestly reads more as a joke article trolling today's academic process and the whole publish or perish mentality. From that angle, the article is a success in my book. As an announcement for a new ML tool though, I just don't get it.


TFA aside, if we could make scientific research produce new insights with low latency and almost-zero costs, it would definitely not make science worthless.

It would be a fantastic day for science.

Not everything is worth (production costs + margin). Many things have intrinsic worth and are worth more to society if you drive down their cost of production.


That sounds like an extremely dangerous day for science as well. If anyone could pop up an ML tool and task it with inventing and validating something truly novel, that would be weaponized extremely fast (likely right after people use it for porn, the frontier for all new tech).

I do totally agree on the cost + margins point you make. I've never actually been a fan of valuing things in that way, and in my pipe dream utopia we wouldn't need or use money at all. I didn't clarify enough there, but I actually mean worthless in the non-monetary sense as well. Any invention created through such an ML tool would be one of a countless pile of stuff created. How important can any one really be?


I would compare this question to "creating a new page on the Internet just adds to a countless pile of URLs. How important can any one really be?"

And this leads us to: most will be slop, but if you can figure out effective ways to perform (a) Search, and (b) Alerts, then this scenario is definitely a game-changer.

Let's take protein synthesis: imagine if we were able to programmatically generate an accurate paper describing every property of a given protein structure. And we just ran this for every single protein in the Universe. You'd end up with a seemingly infinite number of papers, most of which would be useless.

But if you (scientist or engineer) could effectively look up "binds to receptor X and causes effect Y" and see all valid candidates within milliseconds, it would be more valuable than any technology we've ever come up with.

If you could, also, set an alert eg. "tell me about any combination that has superconducting properties" and get notified when this one-in-a-trillion protein is found, this would also be more valuable than any technology we've ever come up with.


That actually raises a more fundamental question here.

This project specifically focused their tests on research topics that can feasibly be tested by the ML tools, writing software. I assume that was an intentional decision, and a clever one that let them point to promising test results while ignoring that potential limitation when valuing their tool.

These ML tools will need to not only come up with novel ideas, they'll need a way to test and validate them. For anything outside of software that almost certainly means modelling. If we already have validated models that may work well enough, but if you extend the scope to literally any novel protein that is possible the ML tool would first have to figure out how to model it.

What would that even look like? How would an ML tool trapped in a computer and limited to the knowledge it was trained on be able to model any protein in the universe and be able to validate exactly how it would function in the real world?


When “executing the experiment” amounts to modifying ~50 lines of PyTorch code tweaking model architecture, I’d bloody well expect that you can automate it.

That’s not “automating scientific discovery”, that’s “procedurally optimizing model architecture” (and one iteration of exploration at that!). In any other field of science the actual work and data generated by the AI Scientist would be a sub-section of the Supporting Info if not just a weekly update to your advisor.

Don’t get me wrong, the actual work done by the humans who are publishing this is a pretty solid piece of engineering and interesting to discuss. But the automated papers, to me, are more a commentary on what constitutes a publishable advancement in AI these days.

Edit: this also further confirms my suspicion about LLMs, which is that they aren’t very good at doing actual work, but they are great at generating the accompanying marketing BS around having done work. They will generate a mountain of flashy but frivolous communication about smaller and smaller chunks of true progress, which while appearing beneficial to individuals, will ultimately result in a race to the bottom of true productivity.


I think the whole paper is a satire lol.


Exciting and very cool! I look forward to the continued improvement in this area. Especially when the loop is closed within Sakana and you can say "this discovery was made by The AI Scientist" as part of another paper.

If I might offer some small feedback on the blog post:

- Alt-text and/or caption of the initial image would be helpful for screen readers

- Using both "dramatically" and "radically" in one sentence to describe near future improvements seems a bit much.

- When talking about the models used, "Sonnet" could either be 3.0 Sonnet or 3.5 Sonnet and those have pretty different capabilities.

Thanks again for the impressive work!


This is the kind of theory-free science seems to permeate the entire field of ML lately.

I can only see this as a negative, what's the use of automatically generated papers if not to flood the already over-strained volunteers that review papers at conferences? (mostly already-overworked PhD students.) If I wanted a glorified chatbot to spam me with made-up improvements, I'd ask it myself.


Any theory compresses experimental data (a mountain of data) into a palatable bit (pocketable item) of knowledge. One can go without theory just by stacking the original measured ratios dA -> dB. The Solutions, that are generated upon raw data, would likely produce less waste, as they fit the exact phenomena. Generalizations do emphasize the main effects and level out the minor effects, introducing an error.

The value of AI-mated research is results. This tool will aid engineering. It will offer a pinpointed research to resolve a particular issue at hand. The research will not be published. It will remain a trade secret.

What you are complaining about is a brocken labor division. Let's consider a case. An engineering department has a problem. It passes the problem to the research department. The research department slacks out, while produces junk. It makes tons of papers that are hard to prove, not to say apply. They drink champagne with other researchers, so they can publish the junk and defend the turf. AI-mated research will finish the racket and corruption. A "$15-scientist" will kill the no-output individuals, who are just a sophisticated flavor of bureaucracy. A science bureaucrat is a bureaucrat with elevated rights and virtually no responsibility.


I worked with some people who were actively working on this last year, focusing on CS research.

The biggest issue was validation. We could get a system to spit out possible research directions automatically, but who decides if they're reasonable and/or promising? A human, of course. Moreover, we gave different humans the same set of hypotheses to validate and they came back with wildly different annotations.


This was exactly what I was thinking. This may be a useful tool for human researchers to drive if it turns out it can generate anything valuable. I can't understand the papers it wrote, much less determine if they make any contributions. I don't think the self-evaluation is going to be too fruitful. Thanks for sharing


AI hype in one sentence:

"We expect all of these will improve, likely dramatically, in future versions with the inclusion of multi-modal models and as the underlying foundation models"

So much hype, so much believe. I no believe no hype


I'm not a scientist at all, but I am often involved in hand-holding scientists when it comes to dealing with computers.

My impression so far is that science is plagued with deliberate and accidental fraud when it comes to data collection and cataloguing. Also, this is a spectrum, not two distinct things. I often see researchers simply unwilling to do the right thing to verify that the data collected are correct and meaningful as soon as "workable" results can be produced from the data. Some will go further and mess with the data to make results more "workable" though...

Second problem is understanding the data. Often times it happens that people who end up doing research don't quite understand the subject matter of the research. This is especially popular with medicine, where it's overwhelmingly common for eg. research into various imaging modalities to be done by computer scientists who couldn't find a liver cancer the size of a coconut in the sharpest textbook abdominal image.

My impression is also that by far these two problems outweigh the problems that could potentially be solved by adding AI into the mix. These are the systemic organization problems of perverse incentives and vicious practices, and no amount of AI is going to do anything about it... because it's about people. People's salaries, careers, friendships etc.


You've brought a good point. The exact difference between ML and AI is in the focus. ML is focusing on tossing data. Its main success metric is "miles traveled". AI is focusing on producing sense. Its main success metric is "value delivered". Therefore ML is tending to ship empty containers, aka "research to be done by scientists who couldn't find a liver cancer". Adding AI perspective into the mix could actually turn the things around.


  Presentation of Intermediate Results. The paper contains results
  for every single experiment that was run. While this is useful
  and insightful for us to see the evolution of the idea during
  execution, it is unusual for standard papers to present
  intermediate results like this.
This is actually quite good that the AI scientist does this. AI has no excuse of slow report writing that humans have to omit the intermediate results.


>The AI Scientist automates the entire research lifecycle, from generating novel research ideas, writing any necessary code, and executing experiments, to summarizing experimental results, visualizing them, and presenting its findings in a full scientific manuscript.

I'd be curious how much of the experimentation process companies like OAI/Anthropic have automated by now, for improving their own models.


I’m wondering the same. If for example, to create GPT-5, OpenAI could credit some % of progress due to research conducted by LLMs with minimal/no human interaction, it would be a nice marketing piece.


As someone who truly loves science, the idea of automating the creative parts strikes me at the core as a horrible mistake. Yes, even before AI, we've already tried some automations -- actually some of those I even believe is a bad thing, such as the internet. Most people would disagree no doubt, but I feel like automating science, especially with regard to the more "creative parts" makes it more like an industry, ripping it away from the minds of people. And AI automation is a new level that goes beyond all automations.

I truly hate AI and what it is doing to the world and to me at least, as someone who has loved mathematics and science since my grandma started showing me chemistry experiments when I was about 5 years old, this new level of automation is stealing the magic from human curiosity.


I think this is looking at it wrong. If AI can do boring science it frees us up to do imaginative and fun science without the constraints of capitalism. You don’t have to worry about your science being valuable enough


> AI can do boring science it frees us up to do imaginative and fun science without the constraints of capitalism.

That is senseless. Capitalism will always control science by its very nature: through science, people create value and trade it for other things.


If AI has gotten to the point of doing nearly all science for humanity, we are probably close to, if not already at, a post-capitalist state. The implication being there's no real resource scarcity at that point.


I completely agree this shit is so depressing. When I saw the AlphaProof paper I basically spent 3 days in mourning basically, because their approach was so simple.


Potential concerns with their self-eval:

They evaluate their automated reviewer by comparing against human evaluations on human-written research papers, and then seem to extrapolate that their automated reviewer would align with human reviewers on AI-written research papers. It seems like there are a few major pitfalls with this.

First, if their systems aren't multimodal, and their figures are lower-quality than human-created figures (which they explicitly list as a limitation), the automated reviewer would be biased in favor of AI-generated papers (only having access to the text). This is an obvious one but I think there could easily be other aspects of papers where the AI and human reviewers align on human-written papers, but not on AI papers.

Additionally, they note:

> Furthermore, the False Negative Rate (FNR) is much lower than the human baseline (0.39 vs. 0.52). Hence, the LLM-based review agent rejects fewer high-quality papers. The False Positive Rate (FNR [sic]), on the other hand, is higher (0.31 vs. 0.17)

It seems like false positive rate is the more important metric here. If a paper is truly high-quality, it is likely to have success w/ a rebuttal, or in getting acceptance at another conference. On the other hand, if this system leads to more low-quality submissions or acceptances via a high FPR, we're going to have more AI slop and increased load on human reviewers.

I admit I didn't thoroughly read all 185 pages, maybe these concerns are misplaced.


Also a concern about the paper generation process itself:

> In a similar vein to idea generation, The AI Scientist is allowed 20 rounds to poll the Semantic Scholar API looking for the most relevant sources to compare and contrast the near-completed paper against for the related work section. This process also allows The AI Scientist to select any papers it would like to discuss and additionally fill in any citations that are missing from other sections of the paper.

So... they don't look for related work until the paper is "near-completed." Seems a bit backwards to me.


great point. I think the AI scientist is already a winner. If the likelihood of false outcome is FNR+FPR, then machine would fail 0.7 and humans 0.69 times. Humans do win nominally. In terms of costs humans loose. For every FPR 0.31-0.17 = 0.14 you spend additionally, you'd gain FNR 0.52-0.39 = 0.13. The paper production costs discrepancy is at least factor 100. The value of the least useful research typically drives factor two or more benefit in comparison to production and validation costs. So the final balance is 0.014 to 0.36 -> x25 gain in favor of AI.


The specific mechanism of action here needs a drill-down. Did "The AI Scientist" (ugh) generate a patch to its code and prompt a user to apply it, as the screenshots would seem to indicate? If so I don't find this worrying at all: people write all kinds of stupid code all the time--often, impressively, without any help from "AI"! ;)

Did it apply the patch itself, then reset the session to whatever extent necessary to have the new code take effect? TBH I'm not really worried about this either, as long as the execution environment doesn't grant it the ability to bring unlimited additional hardware to bear. Even in that case, presumably some human would be paying attention to the AWS bills, or the moral equivalent.


Some samples of the generated papers are in the SI of their paper. It’s be interesting if some of you ML guys dug into them. The fact that they built another AI system to review the papers seems really shaky, this is where human feedback would be most valuable.


Tried reading the 'low-dimensional diffusion' one. Not an expert on diffusion by any means, but the very premise of the paper seems like bullshit.

It claims that 'while diffusion works in high-dimensional datasets, it struggle in low-dimensional settings', which just makes no sense to me? Modeling high-dimensional data is just strictly harder than low-dimensional one.

Then when you read the intro, it's full of 'blanket statements' about diffusion, which have nothing to do with the subject, e.g. 'The challenge in applying diffusion models to low-dimensional spaces lies in simultaneously capturing both the global structure and local details of the data distribution. In these spaces, each dimension carries significant information about the overall structure, making the balance between global coherence and local nuance particularly crucial.'

I really don't see the connection between global structure/local details and low-dimensional data.

The graphs also make no sense. Figure 1 is just almost the same graph repeated 6 times, for no good reason.

It uses an MLP as its diffusion model, which is kinda ridiculous compared to what's the now-established architectures (U-net/ vision transformer based models). Also, the data it learns on is 2-dimensional. I get that the point is using low-dimensional data, but there is no way that people ever struggle with it. Case in point, they solve it with 2-layer MLPs, and it has probably nothing to do with their 'novel multi-scale noise' (since they haven't compared to the 'non-multiscale' version).

Finally, it cites mostly only each field's 'standard' papers, doesn't cite anything really relevant to what it does.

Overall, it looks exactly like what you would expect out of GPT-generated paper, just reshashing some standard stuff in a mostly coherent piece of garbage. I really hope people don't start submitting this kind of stuff to conferences.


I would agree with your analysis.

Note that it cites TabDDPM in the related work, but that is for diffusione on tabular data! While most tabular data is low-dimensional, the type of low-dimensional data tackled in the paper is not tabular!

I'm also not quite sure how the linear upscaling is supposed to help, as it can be absorbed into the first layer of the following MLP, so I would rather think that the performance improvement (if any, the numbers are quite close and lack standard errors) is either due to the increased number of trainable parameters or some kind of ensembling effect (essentially the mixture of experts point made by the human authors).


The pace at which AI/ML research is being published is phenomenal. I could see that some wins would be possible just by having a Sonnet or 4o level read the faster/more and combine ideas that haven't been combined before. I would just be concerned about the synthetic of its own papers being able to lead itself astray if they weren't edited by a human ML researcher? Seems like it could produce helpful stuff right now, but I would just want to not mess up our own datasets of ML "research".


Everyone in this thread is musing about the role of AI and whether the process of discovery is fundamentally human, and what Isaac Newton would think, but can somebody tell me: is the technology it develops any good? For example, does "Dual Scale Diffusion" https://sakana.ai/assets/ai-scientist/adaptive_dual_scale_de... look useful?


Discussed in another thread https://news.ycombinator.com/item?id=41234415 As someone who has worked on diffusion model, it's a clear reject and not a very interesting architecture. The idea is to train a diffusion model to fit to low dimensional data using two MLPs: one accounts for high-level structure and one accounts for low level details. These kind of "global-local" architecture is very common in computer vision/graphics (with the paper mentioned none of the relevant work), so the novelty is low. The experiments also do not clearly showcase where exactly this "dual" structure brings benefits.

That being said, it's very hard to tell it apart from a normal poorly-written paper from a quick glance. If you tell me it's written by a graduate student, I would probably believe it. It is also interesting in a way that maybe for low-dimensional signals there are some architecture tweaks we can do to modify the existing diffusion model architectures to make things better, so maybe not 100% BS.


I guess if C3P0 wants to help me do science I'll let him. But I'm not really satisfied by someone telling me something. I like finding the answer myself. That's one of the reasons I'm a scientist. I enjoy being a scientist. Even assuming that this system, or any science bot, has none of the problems associated with llms, why would I want it to do my job for me?


Being a scientist is cool and all, but you have to do things the exact same way except for changing one variable at a time for a proper science experiment. Having a robot helper to run the experiment the exact same way modulo the variable would be great for science and especially for reproducing experiments. There's a reproduction crisis and a robot that could run reproductions would move science itself forwards.

The other part of the question is, for you, merely just being a scientist is the most rewarding thing ever and there's absolutely no boring parts whatsoever? Your favorite part of the job is everything, and there's none of it that you'd trade for a bit of the other? I mean, I suppose that's possible, but that beggars belief.


I'm a computational scientist so I already spend most of my time writing codes to automate my science. If C3P0 wants to help out then I'm happy to have him.

The reproduction crisis is predominately in social sciences. This is for several reasons but the main one is that social science is much newer than the physical sciences. There's a great book about rasch modelling which equates social science to physics before thermodynamics. That at the very best, you can measure the temperature but there was a time when we had a poor understanding of what it meant when the temperature changes. That's clearly where social science is right now and the reproduction crisis is a good thing. It's challenging the status quo to come up with better theory to motivate observation and experiment.

To answer your final question, yes, the parts of my job that I find annoying, unrewarding, and time wasting that I would happily trade to another are itemized but not complete below:

* Planning my own travel, purchasing all my own tickets and lodging and having no help to do any of that

* Sitting in meetings with colleagues who are explaining to me why it's ok that the interns aren't getting paid because they are going to give them visa gift cards instead

* Editorialmanager.com website

* Having no itemized or explicit way to examine the expenditure in my grants so that when I ask admin staff how much money there is for X they reply with "you have enough money" instead of the actual amount

* Researchgate website

* Colleagues who are rude and condescending instead of going to therapy to wonder why they are insecure

* Expense reimbursement instead of having a purchase card attached to my grants

* Convincing older colleagues to use "new" technology like slack or GitHub

If C3P0 can solve any one of those problems for me I'll get one in a heartbeat.


So, I assume journals will need AIs to do scientific review to handle the flood of AI-produced paper submissions?


Or just keyword search new submissions for the word "delve"


Just add "don't use delve" into the system prompt.


"Delve" and "however"


They just have to start using AI for it :)


To produce scientific work, one needs certain raw materials:

1. Data

2. Access to past works

Once you have these, only then can discoveries can be made, and papers be written. How does this software get these? I am assuming they have to be provided up-front to the software for each job.


It will analyse what it is given, but it will not have the ability to say, "hang on, these results are interesting, I wonder what will happed if I pour a different liquid into the drum and spin it at the same speed?" LLMs, especially the latest ones, are decent at analysis of input, but disappointing at producing creative output. I am running a series of experiments using Gemini 1.5 and found it capable of producing good results if you stay away from "write me an academic paper on subject X" or "write me a novel". On the other hand, if you ask it to summarise text, extract particular information, it is fast and arguably good, but not necessarily great. It will miss things and miscategorise them requiring a human being to check its output. At this point, you may just as well do the job yourself. LLMs are still not very good, despite what their fans are saying. They are clever, as in a "clever trick" not as in a "clever human being".


Interestingly your comment is the very opposite of my experience with LLMs.

You can rely on them for creative stuff (write a poem, short story, etc.), but you cannot depend on them for factual stuff. Time and time again they will state "facts" that turn out to be false, and so now I no longer trust it for anything without manual verification. And since I then need to do the research myself for the verification, I rarely find LLMs helpful, except occasionally for initial exploration of some topic.

You used summarization as an example, but whether they are fundamentally good at that is even debatable, e.g. https://ea.rna.nl/2024/05/27/when-chatgpt-summarises-it-actu...


I revisit LLMs a couple of times a year to see if they have gotten any better and it's not a great experience, but I will admit they have gotten slightly better. Still lack and will lack ability to understand what they are processing.


Your experience contradicts the data from the paper. Sonnet generated almost entirely novel concepts, GPT achieved ~4/5 novel ideas. Maybe it's specific to the area of research and the way of prompting, but "it will not have the ability to say" seems to be proven wrong already.


To produce scientific work, one needs to follow the scientific method. This involves stating a hypothesis, designing an experiment that would test this hypothesis, conducting the experiment with controls, and analyzing the data w.r.t to the hypothesis being tested.

Access to past works is only useful in informing what is a good hypothesis worth testing. And data is only useful when generated by an experiment that is testing for causality (see [1]).

[1] https://pyimagesearch.com/2023/11/27/a-brief-introduction-to...


The kind of cargo cult science you describe is the main reason for the replication crisis. The more people believe that they will reach true knowledge by following the sacred rituals to the letter, the more likely they will do things in the established ways without thinking and repeat the same mistakes over and over again.

In actual science, the key step is stopping to question yourself all the time. Does the thing you were planning to do still make sense? Especially in the current context? Given what you have seen so far? Should you change your plans? If so, can the work you have already done be salvaged? Or do you have to discard it and start over from the beginning?


Excuse me, what?! How is this cargo cult science? I literally just described the scientific method. Here it is described on Wikipedia:

> The process in the scientific method involves making conjectures (hypothetical explanations), deriving predictions from the hypotheses as logical consequences, and then carrying out experiments or empirical observations based on those predictions.

The other questions you added are not bad questions to ask if you are practical and resource-constrained, but this is besides the point.

https://en.wikipedia.org/wiki/Scientific_method


It's cargo cult science precisely because you described the scientific method. Cargo cult science is about focusing on the form without understanding its purpose. Or in Feynman's words:

> So I call these things Cargo Cult Science, because they follow all the apparent precepts and forms of scientific investigation, but they’re missing something essential, because the planes don’t land.

If you are using the scientific method because it's the scientific method, you are doing cargo cult science. The scientific method is a useful tool in a scientist's toolkit, but it's not appropriate to every situation. And even when it's appropriate, it's not a central part of doing science. You can easily fool yourself using the scientific method, if you don't keep questioning yourself and your work.


You are right but didn't go deep enough: you need interactive data. Not just static data. Environments.


You need a meaningful cost function


A cost function is more applicable in industry, less so in science. In science you're supposed to report what you find, to go wherever the findings take you.


Clarkesworld sci-fi magazine temporarily closed submissions due to low quality AI spam. I'm sure the irony will not be lost on them if ML journals are the next victims.


For AI journals it might even be the opposite: researchers running these tools pay OA fees for the submission, the journals make bank, the universities rise in research rankings due to more papers published, everybody is happy. Who needs real progress if the paper number goes up?


AI is a boon to Ph.D. factories offering fasttrack to a degree.


Having read the article, it seems like an interesting experiment. With the current state of LLMs, this is extremely unlikely to produce useful research, but most of the limitations people have been commenting about are will progressively get better.

The authors' credibility is a bit hurt when the first "limitation" they mention is "our system doesn't do page layouts perfectly". Come on, guys.

What is weird to me is this:

> The AI Scientist occasionally makes critical errors when writing and evaluating results. For example, it struggles to compare the magnitude of two numbers, which is a known pathology with LLMs. To partially address this, we make sure all experimental results are reproducible, storing all files that are executed.

I'm not sure why you would run your evaluation step without giving your LLM access to function calling. It seems within reach to first have the LLM output a set of statements-to-be-verified (eg, "does X increase when Y increases?") and then use their code-generation/execution step to perform those comparisons.

And then the incomprehensible statement for me here is that they allow the model access to its own runtime environment so it can edit its own code?

The paper is 185 pages and only has one paragraph on safety. This screams "viral marketing piece" rather than "serious research".

And finally:

> The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference...

Oh wow, please tell more?

> ... as judged by our automated reviewer.

Ah. Nevermind


So, if AI trained on AI generated data tend to perform worse, and you're trying to have AI generate the data that ultimately props up our civilization...

Clearly this can only end well


> AI trained on AI generated data tend to perform worse

Citation needed.

To the best of my knowledge, synthetic data is a solid way to train models and isn't going away anytime soon.


https://www.nature.com/articles/s41586-024-07566-y

IIRC it had a pretty big thread here a few weeks ago


Thanks. I remember that thread:

https://news.ycombinator.com/item?id=41058194

What I took from the discussion is that there is very little chance that our next step for training SOTA models (eg. LLMs) will be "scraping the whole web including increasing volumes of ChatGPT-generated content".

Instead, the synthetic content used to train new models is (from recent papers I've seen) mostly curated - not "indiscriminate" as the Nature paper discusses.


I wonder how model collapse would apply to the AIs created by applying the results of AI-generated papers?


Stfu and listen to the music and enjoy it the way you want. Who gives a shit what some tech nerd or stuffy audiophile has to say? Yup only them. You guys laugh at each other think the other is ridiculous. Well we all laugh at both of you. Man what a waste of the last 15 min of my life. Back to enjoying my music anywhere and everywhere I go. Should try it


It's interesting that the cost total is the same in all tables. I can't tell if that's a copy&paste error, or the cost was capped, or are the totals for all the experiments?

> Aider fails to implement a significant fraction of the proposed ideas.

Yeah, that can be improved a lot with a better agent for code. While aider is fast and cheap, going with something like plandex or opendavin makes a massive difference... both in quality and cost. For example plandex will burn $1 on a simple script, but I can expect that script to work as requested. A mixed approach could be deepseek coder with an agent - a bit worse quality, but still cheaper to do more iterations.


Incredible step towards AI agents for scientific discovery!


the aientist.com ;)


Feels like the next generation of models could truly start replacing lower level ML and software engineers




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: