Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

we've like barely trained these things?

the entirety of common crawl is 424 terabytes. that's merely 6 days of 8K raw video.



This is for LLM's which deal mainly with text. An entire book can be stored as .42 MB according to https://www.quora.com/How-many-megabytes-are-in-a-book.

424 terrabytes text is over a billion books worth of data. On the common crawl website it even says "Over 250 billion pages spanning 17 years." That's an impressive amount of information.


LLMs can deal with more than text. Impressive today is nothing tomorrow


The technology that allows an LLM to "see" images and video is completely different though. It's not what is being trained on common crawl.


not really. embeddings are embeddings. check out llava


Comparing common crawl to video makes no sense. Common crawl is text extracted from webpages. 424 terabytes of pure text contains exponentially more text than I will read in my entire life.


I think this is a good thing to keep in mind. If you compare that to how much information a young human gets as input for example, it really puts things into perspective.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: