The paper I linked to in my previous comment uses Zipf's law (briefly, the frequency of word use is inversely proportional to its rank; more at http://en.wikipedia.org/wiki/Zipf%27s_law) to estimate the "distortion." This should produce a better estimate than the naive method, but there are still problems: the plot on the Wikipedia page shows that Zipf's law is not a particularly good fit to word frequency for Wikipedia past the ~10,000th word, and it's not clear that rap music represents a typical natural language corpus. It is probably still possible to devise a correction if one knows how word use frequencies are distributed.
A second related problem that that paper touches on toward the end is that sequential words from the same text are not independent samples from an author's vocabulary. Two artists may have the same vocabulary, but if one artist uses more non-sequiturs, fewer articles, fewer repeated phrases, or generally tries to use more unique words within a given song, then that artist will come out ahead in the measure used here. I'm not sure how much of a problem this really is for comparing lyrics between artists (depending on what is of interest, it may actually be desirable), but it may explain the poor showings for Shakespeare and Melville, since prose is likely to repeat words more frequently than rap lyrics for reasons unrelated to the authors' vocabularies. (FWIW, even conservative estimates put Shakespeare's vocabulary at >15,000 words, which would be hard to measure in a sample of 35,000 words.)
A second related problem that that paper touches on toward the end is that sequential words from the same text are not independent samples from an author's vocabulary. Two artists may have the same vocabulary, but if one artist uses more non-sequiturs, fewer articles, fewer repeated phrases, or generally tries to use more unique words within a given song, then that artist will come out ahead in the measure used here. I'm not sure how much of a problem this really is for comparing lyrics between artists (depending on what is of interest, it may actually be desirable), but it may explain the poor showings for Shakespeare and Melville, since prose is likely to repeat words more frequently than rap lyrics for reasons unrelated to the authors' vocabularies. (FWIW, even conservative estimates put Shakespeare's vocabulary at >15,000 words, which would be hard to measure in a sample of 35,000 words.)