Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"Around five times the size of the notorious Books3 dataset that was used to train AI models like Meta’s Llama, the Institutional Data Initiative's database spans genres, decades, and languages, with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries. Greg Leppert, executive director of the Institutional Data Initiative, says the project is an attempt to “level the playing field” by giving the general public, including small players in the AI industry and individual researchers, access to the sort of highly-refined and curated content repositories that normally only established tech giants have the resources to assemble. "

^ this is pretty cool and interesting. The collaboration they're doing with Boston Public Library to make articles similarly accessible also sounds pretty exciting.



Not sure if Shakespeare is going to be useful source for most of the tasks the users want to use LLMs for


Was Teletubbies a useful source for our current jobs? Probably not directly but these things help for general understanding of the world when starting off with zero understanding


Not a lot of call to discern a hawk from a heronsaw (or handsaw) anymore, although that does say something about the power of LLM's, transcription errors and judgement.


My thoughts exactly, except for either creative writing or essays on Shakespeare


It seems testable in the sense that you could train a small model, like GPT-2, on the whole dataset, then on the dataset - Shakespeare and compare the loss after training.


In the long tail of questions people ask, I would think asking about Shakespeare is relatively common (explain a play, explain a character, identify a quote, ...).


Why not? Shakespeare's prose is exceptional, and all in the form of dialog. It seems an exemplary, even ideal source of language to emulate.


Unless you want to “write a Shakespearean sonnet about Bayes’ rule”. Here’s what Claude sonnet (ahem) came up with:

"Bayes's Revelation"

  What knowledge held before new facts arrive,
  Our prior thoughts, though incomplete they be,
  Must yield and change as evidence does strive
  To shape posterior probability.

  Through multiplicative steps we find our way,
  As likelihood with prior beliefs combine,
  The normalizing constant holds its sway,
  While truth emerges, clearer line by line.

  O Reverend Bayes, thy theorem divine
  Shows how our minds should learn from what we see,
  Each observation helps our thoughts refine,
  Till posterior wisdom sets uncertainty free.

  In statistics' realm, this rule holds fast:
  Each present builds upon what's known from past.

  The sonnet weaves together the key components of Bayes' theorem - prior probability, likelihood, and how they combine to form the posterior probability. Would you like me to explain any of these statistical concepts in more detail?                

Although it would be an easier change now that Bard is Gemini…




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: