Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The problem you cite only exists if you explicitly want to estimate the underlying vocabulary of the writer. However, as a description of this particular corpus the vocabulary sizes are perfectly valid and exact rather than an estimate.


If we are really only interested in the number of unique words in the first 35,000 lyrics each of these artists have produced and not in what they say about the artists themselves or how the number generalizes to the rest of their body of work, then yes, the analysis is exact and perfect. I don't think that's really the goal, though. We are interested in drawing inferences about the artists and their work. As I say above, the rankings are correct modulo noise (there is noise, unless we wouldn't find it meaningful that these numbers could be different for different 35,000 word samples for the same artist and it is impossible that they could be different for the first 35,000 words due to causes unrelated to those that we are trying to measure), but the magnitudes of the differences could be pretty far off.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: