we've gotten pretty good at predicting proteins from genome assemblies, but we don't have the manpower to manually curate that data.
Swissprot, the project that manually curates and scores protein sequences, has about half a million 'supported' protein models in its last release: https://web.expasy.org/docs/relnotes/relstat.html
Generating uncurated protein sequences has become fairly easy - I can generate a draft genome assembly in about a day, predicting proteins using MAKER or BRAKEr or other HMM-based pipelines takes another day or so. Most eukaryotes have about 20,000 to 30,000 genes that translate into proteins.
I'm involved in a project that assembled 300 draft genomes in the last 6 months- now we have between 6 to 9 million proteins (most of them not 'novel' in function - novel genes/proteins are rare, and a HMM trained on known sequences cannot easily find novel genes).
To train an LLM you want 'diverse' training data, right? Not just Reddit posts but also technical manuals.
In protein space I'm confident we've sequenced the majority of what is out there. For the majority we don't know what they do, but we know that they are there (being expressed in the host).
In every genome sequencing project we can find a similar looking gene for >95% of genes, with much of the remaining 5% being noise or contaminants.
To train LLMs we can't really generate more 'new' data in this space; this is it! Even 'my' 6 to 9 million projects won't add much new information, at most the LLMs will know the 'space' of possible changes for those existing proteins.
I think English is also just the rehash of the same 2000-3000 words. The ability to meaningfully arrange those words in new ways and in longer sentences are what make the current LLMs so powerful. In that way, these protein models can still learn more from the new data but they will definitely need some sort of reinforcement learning and grounding. That is where wetlab scientists will have to do. Verifying the generated protein is functional and stable, i.e. "making sense", is where we will need more data.
Only a miniscule amount of protein has verified structure and function. While an HMM can predict their functions pretty well based on homology, it can often be wrong. Having that kind of noise in the training data can degrade the performance of these protein models. Perhaps we need better curated dataset before we can scale these up.
We have only scratched the surface of what's out there.
Probably it's unlikely that we will find totally novel proteins. But it's a mistake to forget about prokaryotes and phages. Everywhere we look there is something new. The oceans are vast and full of life. And we have barely begun to dig into the life deeper in the crust.
Seeing how all of this varies will only make models stronger. It's an enormous collection relative to human language.
Do we have enough data to train such a large model in a meaningful way?