To put some perspective on this:
ryan@3G08:~/Desktop/bleh$ pdftotext David-Foster-Wallace-Infinite-Jest-v2.0.pdf
ryan@3G08:~/Desktop/bleh$ python dfw.py
size of vocabulary: 30725
The man passed Shakespeare by 1,896 words with that book.
code:
import nltk
from nltk.stem import *
import string
raw = open("/home/ryan/Desktop/bleh/David-Foster-Wallace-Infinite-Jest-v2.0.txt",'rU').read()
exclude = set(string.punctuation)
raw = ''.join(ch for ch in raw if ch not in exclude)
raw = raw.lower()
tokens=nltk.word_tokenize(raw)
stemmer = PorterStemmer()
stemmed_tokens = set()
for token in tokens:
stemmed_tokens.add(stemmer.stem(token))
print "size of vocabulary:", len(set(stemmed_tokens))
The man passed Shakespeare by 1,896 words with that book.
code: