The estimate of vocabulary size here is based on the number of unique words used...

superuser2 · on May 4, 2014

This is why it's "number of unique words used within the artist's first 35,000 lyrics." Sample size is held constant. (Except, maybe, those who haven't yet written 35k words?)

vadskye · on May 4, 2014

Sample size is already controlled for. See the second paragraph, immediately before the graphic: "I used each artist’s first 35,000 lyrics. That way, prolific artists, such as Jay-Z, could be compared to newer artists, such as Drake."

logicallee · on May 4, 2014

Are you looking at a different link than us? The intro reads "I decided to compare this data point against the most famous artists in hip hop. I used each artist’s first 35,000 lyrics. That way, prolific artists, such as Jay-Z, could be compared to newer artists, such as Drake." and the title of the infographic (if that's all you looked at) reads "# of Unique words used within artist’s first 35,000 lyrics"

This seems to address your concern completely?

simonster · on May 4, 2014

Yes, I missed that, even though it is very clearly spelled out (oops!). It makes the ordinal comparison valid (modulo noise), but it does not completely address the concern. If you have two artists, and artist 1 uses 5,000 unique words in 35,000 lyrics while artist 2 uses 10,000 unique words in 35,000 lyrics, artist 2's vocabulary may be substantially more than twice as large as artist 1's. It is unlikely that a lyricist exhausts their entire vocabulary in such a small sample, particularly if their vocabulary is large and contains many words that they use infrequently. http://www.jstor.org.libproxy.mit.edu/stable/2284147 has a correction that can be applied, although even there the author notes that, when applied to James Joyce's Portrait of an Artist, their technique appears to greatly underestimate Joyce's total vocabulary.

logicallee · on May 4, 2014

This is a very good point - Aesop Rock, for example, uses one unique word every 5 words (7k unique in 35k), and if this does not stop, maybe he would continue and we would find the same average in 70k or 120k words. After all, you still have to have filler words like "to", "a", "the", "have", etc - he could be saturating the spots where he can put uniques.

So this could substantially underrepresent vocabularies. There are only so many unique words you can put in a sentence. As an extreme, if we looked at the first hundred words of every rapper, we would not find a hundred unique words in any of them: (due to repeats of grammatically common words) even though, clearly, all rappers have a vocabulary over a hundred.

I wonder if this is a fatal flaw? How can we estimate where the distortion stops? (For example, if someone uses 1000 words in their first 35,000, intuitively this seems to imply to me that's most of their stock. But if someone uses 5,000 in 35,000 - that is not so clear at all.)

simonster · on May 4, 2014

The paper I linked to in my previous comment uses Zipf's law (briefly, the frequency of word use is inversely proportional to its rank; more at http://en.wikipedia.org/wiki/Zipf%27s_law) to estimate the "distortion." This should produce a better estimate than the naive method, but there are still problems: the plot on the Wikipedia page shows that Zipf's law is not a particularly good fit to word frequency for Wikipedia past the ~10,000th word, and it's not clear that rap music represents a typical natural language corpus. It is probably still possible to devise a correction if one knows how word use frequencies are distributed.

A second related problem that that paper touches on toward the end is that sequential words from the same text are not independent samples from an author's vocabulary. Two artists may have the same vocabulary, but if one artist uses more non-sequiturs, fewer articles, fewer repeated phrases, or generally tries to use more unique words within a given song, then that artist will come out ahead in the measure used here. I'm not sure how much of a problem this really is for comparing lyrics between artists (depending on what is of interest, it may actually be desirable), but it may explain the poor showings for Shakespeare and Melville, since prose is likely to repeat words more frequently than rap lyrics for reasons unrelated to the authors' vocabularies. (FWIW, even conservative estimates put Shakespeare's vocabulary at >15,000 words, which would be hard to measure in a sample of 35,000 words.)

danielsf · on May 5, 2014

OP here. vocab is non-linear:

mdaniels.com/vocab/scatter.png

so the average would change as your sample size grows.

35K was the threshold where things didn't get garbled (like the 100 word example that you mention).

The threshold is also impacted by who I include. If I went to 50K, I'd lose out on rappers like Drake.

Retric · on May 4, 2014

Repetition also seems like an huge issue given a 35,000 lyric limit. If someone repeats the same line 5-7 times it's hardly reasonable to count that when estimating vocabulary.

Edit: A quick check found 7 repetitions of the same 8 word phrase in a DMX song. Which I chose becase he was at the bottom of the list. http://www.azlyrics.com/lyrics/dmx/whatthesebitcheswant.html

danielsf · on May 5, 2014

this is an issue that i was hoping would cancelled out, given the fact that I use the same analysis for every rapper. In short, it's an exact unique word count, but it's relationally accurate.

andreasvc · on May 4, 2014

The problem you cite only exists if you explicitly want to estimate the underlying vocabulary of the writer. However, as a description of this particular corpus the vocabulary sizes are perfectly valid and exact rather than an estimate.

simonster · on May 4, 2014

If we are really only interested in the number of unique words in the first 35,000 lyrics each of these artists have produced and not in what they say about the artists themselves or how the number generalizes to the rest of their body of work, then yes, the analysis is exact and perfect. I don't think that's really the goal, though. We are interested in drawing inferences about the artists and their work. As I say above, the rankings are correct modulo noise (there is noise, unless we wouldn't find it meaningful that these numbers could be different for different 35,000 word samples for the same artist and it is impossible that they could be different for the first 35,000 words due to causes unrelated to those that we are trying to measure), but the magnitudes of the differences could be pretty far off.