When evaluating this work, it’s important to remember that the functional labels and protein family assignments on each of the 280 million input sequences were originally assigned by an HMM model using human curated sequence groups as part of the pfam project, so the model is predicting a prediction (or perhaps conditioned on a prediction would be more accurate).
Furthermore, the authors must engage a lot of human curation to ensure the sequences they generate are active. First, they pick an easy target. Second, they employ by-hand classical bioinformatics techniques on their predicted sequences after they are generated. For example, they manually align them and select those which contain specific important amino acids at specific positions which are present in 100% of functional proteins of that class, and are required for function. This is all done by a human bioinformatics expert (or automated) before they test the generated sequences. This is the protein equivalent of cherry-picking great examples of, for example, ChatGPT responses and presenting them as if the model only made predictions like that.
One other comment, in protein science, a sequence with 40% identity to another sequence is not “very different” if it is homologous. Since this model is essentially generating homologs from a particular class, it’s no surprise at a pairwise amino acid level, the generated sequences have this degree of similarity. Take proteins in any functional family and compare them. They will have the same overall 3-D structure—called their “fold”—yet have pairwise sequence identities much lower than 30–40%. This “degeneracy”, the notion that there are many diverse sequences that all fold into the same shape, is both a fundamental empirical observation in protein science as well as a grounded physical theory.
Not to be negative. I really enjoyed reading this paper and I think the work is important. Some related work by Meta AI is the ESM series of models [1] trained on the same data (the UniProt dataset [2]).
One thing I wonder is about the vocabulary size of this model. The number of tokens is 26 for the 20 amino acids and some extras, whereas for a LLM like Meta’s LLaMa the vocab size is 32,000. I wonder how that changes training and inference, and how we can adopt the transformer architecture for this scenario.
I consider all the manual curation effectively a form of RLHF that can be imposed automatically later on. We saw how much this can improve a raw LLM by looking at the output of ChatGPT. Otherwise, the criticism of LLMs being just glorified autocomplete machines isn't that far from reality. In other words, it is just an expected requirement for LLMs to be effective.
You are probably right that lysozyme is an easy target and may have large sequence variety between homologs so saying "very different" for 30-40% is not correct. But that is only in the context of biology and protein structure and function. This is an LLM trained on primary sequences only. It doesn't know anything about the folds or domains or functional sites (unless I am wrong and those are part of the metadata fed to it during training). Yet it did learn enough to generalize to the point that even with only 30-40% identity, it still produces soluble proteins with the same function. I am sure you know that at 40% differences, one protein can be in an entirely different superfamily from another. So it is still an impressively low identity score.
Also, I think it is more appropriate to compare the amino acids to things like the alphabets than vocabs. Domains would probably be an equivalent to LLaMa vocab.
No because an rlhf step is kind of independent, manually curated is really hard to fully disentangle from the original prediction.
There are a lot of named proteins that have names which are "legacy", sometimes assigned by homology that probably misses important ways that biology uses the protein that were discovered later.
Perhaps fine-tuning is a better word? I am unsure what is the process that let an LLM switch from just a next word prediction tool to a chatbox. Instruction tuning?
The author basically chose some of the output based on set criteria. I think this can eventually be automated and embedded into the protein language model the same way ChatGPT now has guardrails and specific ways to answer questions, instead of following up with the most likely sentence, e.g.: asking it what is the capital of france get an output of another question about what is the capital of germany.
Instructing finetuning or RLHF. Both instances are "just" next-word predictors. Instruction tuning just changes the goals of the predictions. Doesn't necessarily make a model "smarter"(didn't for GPT-4) but it does make it for accessible.
Furthermore, the authors must engage a lot of human curation to ensure the sequences they generate are active. First, they pick an easy target. Second, they employ by-hand classical bioinformatics techniques on their predicted sequences after they are generated. For example, they manually align them and select those which contain specific important amino acids at specific positions which are present in 100% of functional proteins of that class, and are required for function. This is all done by a human bioinformatics expert (or automated) before they test the generated sequences. This is the protein equivalent of cherry-picking great examples of, for example, ChatGPT responses and presenting them as if the model only made predictions like that.
One other comment, in protein science, a sequence with 40% identity to another sequence is not “very different” if it is homologous. Since this model is essentially generating homologs from a particular class, it’s no surprise at a pairwise amino acid level, the generated sequences have this degree of similarity. Take proteins in any functional family and compare them. They will have the same overall 3-D structure—called their “fold”—yet have pairwise sequence identities much lower than 30–40%. This “degeneracy”, the notion that there are many diverse sequences that all fold into the same shape, is both a fundamental empirical observation in protein science as well as a grounded physical theory.
Not to be negative. I really enjoyed reading this paper and I think the work is important. Some related work by Meta AI is the ESM series of models [1] trained on the same data (the UniProt dataset [2]).
One thing I wonder is about the vocabulary size of this model. The number of tokens is 26 for the 20 amino acids and some extras, whereas for a LLM like Meta’s LLaMa the vocab size is 32,000. I wonder how that changes training and inference, and how we can adopt the transformer architecture for this scenario.
1. https://github.com/facebookresearch/esm
2. https://www.uniprot.org/help/downloads