Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The estimate of vocabulary size here is based on the number of unique words used. This seems like it is strongly biased: if two artists have the same size vocabulary, but one has released more albums and thus used more words, that artist will probably have used more unique words. To underscore this point, the number of unique words used by Aesop Rock is half of the estimated vocabulary size of the average college student, although to be fair that estimate is the number of words that an individual can recognize, not the number of words they use. (Edit: the bias is somewhat mitigated by the fact that the same number of words is used to estimate the vocabulary for each artist, but the bias is not dependent on sample size alone but also upon the size of the artist's underlying vocabulary; see my comments below.)

The underlying problem is one of estimating the cardinality of a multinomial distribution given a fixed number of samples. In isolation this problem is ill-posed, since it is always possible that there is a word in a given lyricist's vocabulary that he uses with very low frequency and that is unlikely to appear in any sample, but with appropriate prior information it may be possible to obtain an accurate estimate.

This is not my field, but a brief Google Scholar search shows that there are several papers on estimating vocabulary size, or equivalently, estimating the number of species based on sampling. There is a somewhat dated review (http://cvcl.mit.edu/SUNSeminar/BungeFitzpatrick_1993.pdf) that details some methods of estimation (in this case, I believe we are in the domain of "infinite population, multinomial sample" with unequal class sizes). The paper notes that there is no unbiased estimator available without assumptions on the distribution of word use frequencies, but some of the proposed estimators may be more accurate than the naive estimate used here.



This is why it's "number of unique words used within the artist's first 35,000 lyrics." Sample size is held constant. (Except, maybe, those who haven't yet written 35k words?)


Sample size is already controlled for. See the second paragraph, immediately before the graphic: "I used each artist’s first 35,000 lyrics. That way, prolific artists, such as Jay-Z, could be compared to newer artists, such as Drake."


Are you looking at a different link than us? The intro reads "I decided to compare this data point against the most famous artists in hip hop. I used each artist’s first 35,000 lyrics. That way, prolific artists, such as Jay-Z, could be compared to newer artists, such as Drake." and the title of the infographic (if that's all you looked at) reads "# of Unique words used within artist’s first 35,000 lyrics"

This seems to address your concern completely?


Yes, I missed that, even though it is very clearly spelled out (oops!). It makes the ordinal comparison valid (modulo noise), but it does not completely address the concern. If you have two artists, and artist 1 uses 5,000 unique words in 35,000 lyrics while artist 2 uses 10,000 unique words in 35,000 lyrics, artist 2's vocabulary may be substantially more than twice as large as artist 1's. It is unlikely that a lyricist exhausts their entire vocabulary in such a small sample, particularly if their vocabulary is large and contains many words that they use infrequently. http://www.jstor.org.libproxy.mit.edu/stable/2284147 has a correction that can be applied, although even there the author notes that, when applied to James Joyce's Portrait of an Artist, their technique appears to greatly underestimate Joyce's total vocabulary.


This is a very good point - Aesop Rock, for example, uses one unique word every 5 words (7k unique in 35k), and if this does not stop, maybe he would continue and we would find the same average in 70k or 120k words. After all, you still have to have filler words like "to", "a", "the", "have", etc - he could be saturating the spots where he can put uniques.

So this could substantially underrepresent vocabularies. There are only so many unique words you can put in a sentence. As an extreme, if we looked at the first hundred words of every rapper, we would not find a hundred unique words in any of them: (due to repeats of grammatically common words) even though, clearly, all rappers have a vocabulary over a hundred.

I wonder if this is a fatal flaw? How can we estimate where the distortion stops? (For example, if someone uses 1000 words in their first 35,000, intuitively this seems to imply to me that's most of their stock. But if someone uses 5,000 in 35,000 - that is not so clear at all.)


The paper I linked to in my previous comment uses Zipf's law (briefly, the frequency of word use is inversely proportional to its rank; more at http://en.wikipedia.org/wiki/Zipf%27s_law) to estimate the "distortion." This should produce a better estimate than the naive method, but there are still problems: the plot on the Wikipedia page shows that Zipf's law is not a particularly good fit to word frequency for Wikipedia past the ~10,000th word, and it's not clear that rap music represents a typical natural language corpus. It is probably still possible to devise a correction if one knows how word use frequencies are distributed.

A second related problem that that paper touches on toward the end is that sequential words from the same text are not independent samples from an author's vocabulary. Two artists may have the same vocabulary, but if one artist uses more non-sequiturs, fewer articles, fewer repeated phrases, or generally tries to use more unique words within a given song, then that artist will come out ahead in the measure used here. I'm not sure how much of a problem this really is for comparing lyrics between artists (depending on what is of interest, it may actually be desirable), but it may explain the poor showings for Shakespeare and Melville, since prose is likely to repeat words more frequently than rap lyrics for reasons unrelated to the authors' vocabularies. (FWIW, even conservative estimates put Shakespeare's vocabulary at >15,000 words, which would be hard to measure in a sample of 35,000 words.)


OP here. vocab is non-linear:

mdaniels.com/vocab/scatter.png

so the average would change as your sample size grows.

35K was the threshold where things didn't get garbled (like the 100 word example that you mention).

The threshold is also impacted by who I include. If I went to 50K, I'd lose out on rappers like Drake.


Repetition also seems like an huge issue given a 35,000 lyric limit. If someone repeats the same line 5-7 times it's hardly reasonable to count that when estimating vocabulary.

Edit: A quick check found 7 repetitions of the same 8 word phrase in a DMX song. Which I chose becase he was at the bottom of the list. http://www.azlyrics.com/lyrics/dmx/whatthesebitcheswant.html


this is an issue that i was hoping would cancelled out, given the fact that I use the same analysis for every rapper. In short, it's an exact unique word count, but it's relationally accurate.


The problem you cite only exists if you explicitly want to estimate the underlying vocabulary of the writer. However, as a description of this particular corpus the vocabulary sizes are perfectly valid and exact rather than an estimate.


If we are really only interested in the number of unique words in the first 35,000 lyrics each of these artists have produced and not in what they say about the artists themselves or how the number generalizes to the rest of their body of work, then yes, the analysis is exact and perfect. I don't think that's really the goal, though. We are interested in drawing inferences about the artists and their work. As I say above, the rankings are correct modulo noise (there is noise, unless we wouldn't find it meaningful that these numbers could be different for different 35,000 word samples for the same artist and it is impossible that they could be different for the first 35,000 words due to causes unrelated to those that we are trying to measure), but the magnitudes of the differences could be pretty far off.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: