"Around five times the size of the notorious Books3 dataset that was used to train AI models like Meta’s Llama, the Institutional Data Initiative's database spans genres, decades, and languages, with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries. Greg Leppert, executive director of the Institutional Data Initiative, says the project is an attempt to “level the playing field” by giving the general public, including small players in the AI industry and individual researchers, access to the sort of highly-refined and curated content repositories that normally only established tech giants have the resources to assemble. "
^ this is pretty cool and interesting. The collaboration they're doing with Boston Public Library to make articles similarly accessible also sounds pretty exciting.
Was Teletubbies a useful source for our current jobs? Probably not directly but these things help for general understanding of the world when starting off with zero understanding
Not a lot of call to discern a hawk from a heronsaw (or handsaw) anymore, although that does say something about the power of LLM's, transcription errors and judgement.
It seems testable in the sense that you could train a small model, like GPT-2, on the whole dataset, then on the dataset - Shakespeare and compare the loss after training.
In the long tail of questions people ask, I would think asking about Shakespeare is relatively common (explain a play, explain a character, identify a quote, ...).
Unless you want to “write a Shakespearean sonnet about Bayes’ rule”. Here’s what Claude sonnet (ahem) came up with:
"Bayes's Revelation"
What knowledge held before new facts arrive,
Our prior thoughts, though incomplete they be,
Must yield and change as evidence does strive
To shape posterior probability.
Through multiplicative steps we find our way,
As likelihood with prior beliefs combine,
The normalizing constant holds its sway,
While truth emerges, clearer line by line.
O Reverend Bayes, thy theorem divine
Shows how our minds should learn from what we see,
Each observation helps our thoughts refine,
Till posterior wisdom sets uncertainty free.
In statistics' realm, this rule holds fast:
Each present builds upon what's known from past.
The sonnet weaves together the key components of Bayes' theorem - prior probability, likelihood, and how they combine to form the posterior probability. Would you like me to explain any of these statistical concepts in more detail?
Although it would be an easier change now that Bard is Gemini…
^ this is pretty cool and interesting. The collaboration they're doing with Boston Public Library to make articles similarly accessible also sounds pretty exciting.