Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Did I miss any discussion of what the "processing" is?

Using the Stanford Part-Of-Speech tagger, my goofy project, Ashurbanipal, can tag all the words in one book in about 8 seconds on one core, or ~25,000 books from the Project Gutenberg 2010 DVD image on my 4-core (hyperthreaded) laptop with a 10GB JVM heap in about 8 hours.



Nope, there was almost no mention of what this was actually used for. The closest I found was a mention of the final output:

"single output files, tab-delimited with data available for each year, merging in publication metadata and other information about each book"

[edit] More info in a link at the bottom of the article: http://blog.gdeltproject.org/3-5-million-books-1800-2015-gde...




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: