Hacker News new | past | comments | ask | show | jobs | submit login
Large language models generate functional protein sequences across families (nature.com)
172 points by samwillis on May 13, 2023 | hide | past | favorite | 33 comments



Some notable technical information: It is a 1.2bln param model, trained on raw sequences only, can generate full length functional proteins of about 150-200 residues (approx lysozyme size). The generated proteins are very different to native ones (30-40% similarity).

The interesting thing about this model is that it also exhibit emergent capabilities. It was trained only on raw sequences but somehow managed to capture information about functionality and solubility of the folded proteins, and then implemented that in the generated sequences.

Amino acid sequences are just a bunch of jumbled words if you compared them to English. It usually has to go through folding to form proper "sentences" with meanings. I guess you can compare this to "grammar". This is probably the model managed to learn protein grammar purely by brute force. Now if only we can get a model in the range of 100bln parameters...


> purely by brute force

the new underlying model of the world. First it was: Who needs to understand an algorithm when you can just simulate everything? Now it's: Who needs to simulate everything when you can just let a large enough black box approximate a solution?


It’s only a black box because our brains aren’t good enough.


But if our brains came up with the black box, can that really be true?


Yes. All that is required to make that true is that our brains come up with a model that we cannot comprehend or that reflects a reality that we do not understand. That has been true since cycles and epicycles were used to describe the motion of celestial bodies and probably long before that.


> Who needs to simulate everything when you can just let a large enough black box approximate a solution

You are kind of describing science here. Take chemistry, it's a series of black boxes: chemical reaction patterns, bohr model, vsepr, orbital shell theory, perturbation theory


> Now if only we can get a model in the range of 100bln parameters

Do we have enough data to train such a large model in a meaningful way?


we've gotten pretty good at predicting proteins from genome assemblies, but we don't have the manpower to manually curate that data.

Swissprot, the project that manually curates and scores protein sequences, has about half a million 'supported' protein models in its last release: https://web.expasy.org/docs/relnotes/relstat.html

TrEMBL, the database that has curated + uncurated, has 249,308,459 proteins in its last release: https://www.ebi.ac.uk/uniprot/TrEMBLstats

Generating uncurated protein sequences has become fairly easy - I can generate a draft genome assembly in about a day, predicting proteins using MAKER or BRAKEr or other HMM-based pipelines takes another day or so. Most eukaryotes have about 20,000 to 30,000 genes that translate into proteins. I'm involved in a project that assembled 300 draft genomes in the last 6 months- now we have between 6 to 9 million proteins (most of them not 'novel' in function - novel genes/proteins are rare, and a HMM trained on known sequences cannot easily find novel genes).


Addendum:

To train an LLM you want 'diverse' training data, right? Not just Reddit posts but also technical manuals.

In protein space I'm confident we've sequenced the majority of what is out there. For the majority we don't know what they do, but we know that they are there (being expressed in the host). In every genome sequencing project we can find a similar looking gene for >95% of genes, with much of the remaining 5% being noise or contaminants.

To train LLMs we can't really generate more 'new' data in this space; this is it! Even 'my' 6 to 9 million projects won't add much new information, at most the LLMs will know the 'space' of possible changes for those existing proteins.


I think English is also just the rehash of the same 2000-3000 words. The ability to meaningfully arrange those words in new ways and in longer sentences are what make the current LLMs so powerful. In that way, these protein models can still learn more from the new data but they will definitely need some sort of reinforcement learning and grounding. That is where wetlab scientists will have to do. Verifying the generated protein is functional and stable, i.e. "making sense", is where we will need more data.

Only a miniscule amount of protein has verified structure and function. While an HMM can predict their functions pretty well based on homology, it can often be wrong. Having that kind of noise in the training data can degrade the performance of these protein models. Perhaps we need better curated dataset before we can scale these up.


We have only scratched the surface of what's out there.

Probably it's unlikely that we will find totally novel proteins. But it's a mistake to forget about prokaryotes and phages. Everywhere we look there is something new. The oceans are vast and full of life. And we have barely begun to dig into the life deeper in the crust.

Seeing how all of this varies will only make models stronger. It's an enormous collection relative to human language.


> Amino acid sequences are just a bunch of jumbled words if you compared them to English. It usually has to go through folding to form proper "sentences" with meanings. I guess you can compare this to "grammar".

Thanks for a very informative comment! Would you have a minute to give a concrete example of such a sentence (something that could be in a training set)? Whenever I add "folding" to queries containing a variation of amino acids, Google starts showing images instead of sequences. (I have no background in bioinformatics.)


There's a LOT of interesting stuff happening with large language models in biology:

Transformer trained on human genomes, learns to identify different genomic elements like enhancers without knowing what enhancers are https://www.biorxiv.org/content/10.1101/2023.01.11.523679v2

Similarly, a transformer trained on only plant genomes 'knows' how the strength of genomic variants' impact https://www.biorxiv.org/content/10.1101/2022.08.22.504706v2

Lots of experiments happening around GPT-4 too;

Using regular GPT-4 to curate cell type annotations https://www.biorxiv.org/content/10.1101/2023.04.16.537094v1

Testing GPT-4 across a variety of 'standard' tasks like linking gene IDs to protein IDs https://www.biorxiv.org/content/10.1101/2023.03.11.532238v1


It would be super interesting if we figure out that everything in life including language are fractals of molecular biology


When evaluating this work, it’s important to remember that the functional labels and protein family assignments on each of the 280 million input sequences were originally assigned by an HMM model using human curated sequence groups as part of the pfam project, so the model is predicting a prediction (or perhaps conditioned on a prediction would be more accurate).

Furthermore, the authors must engage a lot of human curation to ensure the sequences they generate are active. First, they pick an easy target. Second, they employ by-hand classical bioinformatics techniques on their predicted sequences after they are generated. For example, they manually align them and select those which contain specific important amino acids at specific positions which are present in 100% of functional proteins of that class, and are required for function. This is all done by a human bioinformatics expert (or automated) before they test the generated sequences. This is the protein equivalent of cherry-picking great examples of, for example, ChatGPT responses and presenting them as if the model only made predictions like that.

One other comment, in protein science, a sequence with 40% identity to another sequence is not “very different” if it is homologous. Since this model is essentially generating homologs from a particular class, it’s no surprise at a pairwise amino acid level, the generated sequences have this degree of similarity. Take proteins in any functional family and compare them. They will have the same overall 3-D structure—called their “fold”—yet have pairwise sequence identities much lower than 30–40%. This “degeneracy”, the notion that there are many diverse sequences that all fold into the same shape, is both a fundamental empirical observation in protein science as well as a grounded physical theory.

Not to be negative. I really enjoyed reading this paper and I think the work is important. Some related work by Meta AI is the ESM series of models [1] trained on the same data (the UniProt dataset [2]).

One thing I wonder is about the vocabulary size of this model. The number of tokens is 26 for the 20 amino acids and some extras, whereas for a LLM like Meta’s LLaMa the vocab size is 32,000. I wonder how that changes training and inference, and how we can adopt the transformer architecture for this scenario.

1. https://github.com/facebookresearch/esm

2. https://www.uniprot.org/help/downloads


I consider all the manual curation effectively a form of RLHF that can be imposed automatically later on. We saw how much this can improve a raw LLM by looking at the output of ChatGPT. Otherwise, the criticism of LLMs being just glorified autocomplete machines isn't that far from reality. In other words, it is just an expected requirement for LLMs to be effective.

You are probably right that lysozyme is an easy target and may have large sequence variety between homologs so saying "very different" for 30-40% is not correct. But that is only in the context of biology and protein structure and function. This is an LLM trained on primary sequences only. It doesn't know anything about the folds or domains or functional sites (unless I am wrong and those are part of the metadata fed to it during training). Yet it did learn enough to generalize to the point that even with only 30-40% identity, it still produces soluble proteins with the same function. I am sure you know that at 40% differences, one protein can be in an entirely different superfamily from another. So it is still an impressively low identity score.

Also, I think it is more appropriate to compare the amino acids to things like the alphabets than vocabs. Domains would probably be an equivalent to LLaMa vocab.


No because an rlhf step is kind of independent, manually curated is really hard to fully disentangle from the original prediction.

There are a lot of named proteins that have names which are "legacy", sometimes assigned by homology that probably misses important ways that biology uses the protein that were discovered later.


Perhaps fine-tuning is a better word? I am unsure what is the process that let an LLM switch from just a next word prediction tool to a chatbox. Instruction tuning?

The author basically chose some of the output based on set criteria. I think this can eventually be automated and embedded into the protein language model the same way ChatGPT now has guardrails and specific ways to answer questions, instead of following up with the most likely sentence, e.g.: asking it what is the capital of france get an output of another question about what is the capital of germany.


Instructing finetuning or RLHF. Both instances are "just" next-word predictors. Instruction tuning just changes the goals of the predictions. Doesn't necessarily make a model "smarter"(didn't for GPT-4) but it does make it for accessible.


I'm surprised that Salesforce has a research division, and they're working on something like this.


The game theory of useless corporate research departments is that by spinning plates for you they’re not building moats for your competitors. There is quite a lot of money to be saved by essentially nerd sniping with large wads of cash.


I know that Salesforce is committed to the 1% initiative, so maybe this falls into that. 1% can do a lot at their revenue.


AI generating a virus/prion/whatever that we synthesize without understanding what it does is the easiest way to the bad singularity people were warning us about.


The bad singularity is for those who can't afford the coming life subscription plan.




Pretty sure the preprint is at https://www.biorxiv.org/content/10.1101/2021.07.18.452833v1..... Good for checking out what the editors have done.


That “large language models” are unfortunately named becomes clearer when they’re applied to non-language domains. They’re auto-associative models that learn mappings between discrete elements in a sequence or set.


2022, btw


>Published: 26 January 2023

Well, technically it is out in 2023 but sure, you can argue it was probably in the prepub state since 2022...


I was supposed to be reply to another comment. The GitHub is from 2022:

https://github.com/salesforce/progen


This is so coool!


Ooh, can't wait for generative cancer. Thanks AI.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: