Hacker News new | past | comments | ask | show | jobs | submit login
Text Mining South Park (kaylinwalker.com)
202 points by eamonncarey on Feb 10, 2016 | hide | past | favorite | 48 comments



I was in the process of reading this when I thought to check who this person is. Of course, by that time the site had failed, so I haven't read the whole thing yet.

But, it seems to me that the author is falling in to a trap many an unwary data "scientist" falls by not understanding the discipline of Statistics.

When one has the entire population data (i.e. a census), rather than a sample, there is no point in carrying out statistical tests.

If I know ALL the words spoken by someone, then I know which words they say the most without resorting to any tests simply by counting.

No concept of "statistical significance" is applicable because there is no sample. We can calculate the population value of any parameter we can think of, because, we have the entire population (in this specific instance, ALL the words spoken by all the characters).

FYI, all budding data "scientists" ...


Why so bitter and angry? As far as I can see, his calculations make sense and lead to interesting results. Instead of philosophical nitpicking, why not help him improve his understanding by explaining how you would have calculated/formalized/modeled this thing, so the scare-quote data "scientists" can learn something?

By the way, we definitely don't hear all words that these characters speak in their lives. It's implied in the story that there are conversations that we don't get to see in the actual episodes, but nevertheless these imaginary characters speak a lot more. For example we don't see each and every breakfast, lunch and dinner discussion, we don't hear all their words in the classroom etc.

Now of course the sampling isn't random, because the creators obviously "select" the more interesting bits of the characters' lives, but in statistics we always make assumptions that simplify the procedure but are known to be technically wrong.


But we must protect ourselves from the research parasites! Man the ramparts and ready the harsh words!


You're treating this sample-is-the-population issue as if it's resolved in the statistics literature. It is not. Gelman has written on this [1][2], as the issue comes up in political science data frequently. As Gelman points out, 50 states are not a sample of states—it's the entire population. Similarly, the Correlates of War [3] data is every militarized international dispute between 1816-2007 that fits certain criteria—it too is not a sample but the entire population.

Treating his population as a large sample of a process that's uncertain or noisy and then applying frequentist statistics is not inherently wrong in the way you say it is. It may be that there's a better way to model the uncertainty in the process than treating the population as a sample, but that's a different point than the one you make.

[1]: http://andrewgelman.com/2009/07/03/how_does_statis/

[2]: http://www.stat.columbia.edu/~gelman/research/published/econ... (see finite population section)


> Similarly, the Correlates of War [3] data is every militarized international dispute between 1816-2007 that fits certain criteria—it too is not a sample but the entire population.

Its the entire population of wars meeting a certain criteria in that time frame. If that is the topic of interest, then it is also the whole population. OTOH, datasets like that are often used in analysis that is intended to apply to, for instance, "what-if" scenarios about hypothetical wars that could have happened in that time frame, in which case the studied population is clearly not the population of interest, but is taken to be -- while there may be specific reasons to criticize this in specific cases for reasons other than "its the whole population, not a sample" -- a representative sample of a broader population.


Exactly. There is an interpretation where the "population" is interpreted as a mathematical ideal process (with potentially infinite information content) and any real, physical manifestation is considered a "sample".

The old-school interpretation is stricter and considers both the "population" and the "sample" to be physical real things. It's understandable because these methods were developed for statistics about human populations (note the origin of the terminology), medical studies etc. (The word "statistics" itself derives from "state").

Somehow, frequentist statisticians are usually very conservative and set in one way of thinking and do not even like to entertain an alternative interpretation or paradigm... I'm not sure why it is so.


As an economist, I am also aware of the logical contortions we have to go through to be able to run regressions on historical data (i.e. pretty much all of economic data). None of this applies here. The data generating process consists of the minds of the writers.

For your reasoning to be applicable here, you have to put together a model of the data generating process from which you can derive a proper model that allows inference. What exactly are the assumptions on P( word_i | character_j ) that make it compatible with these particular tests' assumptions?


Hi, I'm the author. I appreciate the time you've taken to read and provide constructive criticism of my work. Here's my full write up (on GitHub, so it should continue to work): https://github.com/walkerkq/textmining_southpark/blob/master...

I was working under the assumption that we do not know ALL the words since the show's been renewed through 2019. This covers the first 18 seasons.

Additionally, the counting up their most frequent words produced results with very little semantic meaning - things like "just" and "dont" - which can be seen in this (really boring) wordcloud: https://github.com/walkerkq/textmining_southpark/blob/master...

Looking into the log likelihood of each word for each speaker produced results that were much more intuitive and carried more meaning, like ppod said below: I think the idea is that what we are really trying to measure is something unobservable like the underlying nature of the character or the writers' tendencies to give characters certain ways of speaking.


The point I am making is simple: You can calculate whatever you want to calculate, but there is no room for statistical testing because you do not have a probability sample, and, no sampling variation.

Yes, there will be future episodes, but you are not claiming that you are predicting what these characters will say in those future episodes (in which case your whole setup is rather inappropriate).

Also, I suggest you think very hard about this statement:

> The log likelihood value of 101.7 is significant far beyond even the 0.01% level, so we can reject the null hypothesis that Cartman and the remaining text are one and the same.

Even if the statistical test you employed were appropriate, this is not the conclusion you draw from it.

Also, are you confusing p = 0.01 with 1% or did you really choose p = 0.00001 as the significance level for your test?


A simple tf-idf would get you similar results without a t-test.

I think that is what parent is implying.


> If I know ALL the words spoken by someone, then I know which words they say the most without resorting to any tests simply by counting.

From the text, the author is performing statistical testing (chi-sq) for which words are most unique to a character, not which words they say the most. (although the two metrics are somewhat correlated)


As I said, I could not read the whole thing. As I was skimming, I noticed the tests, tried to load the main page, and I was disconnected.

Once again, "words that are most unique" to character is a parameter that can easily be counted from the set of ALL words with no sampling uncertainty because, yes, we have the population.


I wouldn't say easily. Keep in mind that checking if something is "unique," it needs to be checked against every other character as well.

For example, the Top 5 Unique Words for Randy Marsh per the analysis are:

stan, stanley, lorde, shelly, son

I downloaded the dataset and quickly calculated the Top 5 Most Frequently Said Words for Randy from the entire population. Those are:

what, stan, yeah, ok, huh

All characters on the show are saying those words (Except "stan"). That's why log-likelihood/tdfif is used on a per-character basis.


It's the likelihood part he is bitching about, not the inverse frequency.


I think the idea is that what we are really trying to measure is something unobservable like the underlying nature of the character or the writers' tendencies to give characters certain ways of speaking. We can say that Stan uses a word at a rate certain rate corrected for that words base rate in the corpus, and compare this with the rate for another character. If that difference in rate is very small, it's true that we still know for certain that the difference is absolutely true for this corpus, but it may not reflect any substantive difference between the characters.

If this is the view taken, then the population is all of the text that might have been generated by the data-generating process of the scripts -- things like the writers' mental models of the characters. In this view the actual scripts are just a sample from all of the scripts that could have been written while keeping the variable of interest (the characters' character) constant.


Also, I am going to go out on a limb here and guess that R's `read.csv` doesn't do what one hopes it would when fed this CSV:

    10,3,Brian,"You mean like the time you had tea with
    Mohammad, the prophet of the Muslim faith?
    Peter:
    Come on, Mohammad, let's get some tea.
    Mr. T:
    Try my ""Mr. T. ...tea.""
    "
Well, it seems people are not understanding the problem with this line. Here is the screenshot of the original script: http://imgur.com/pcu5N2U

    Brian: 	You mean like the time you had tea with Mohammad, the prophet of the Muslim faith? [flashback #3]
    Peter: 	Come on, Mohammad, let's get some tea. [Mohammad is covered by a black box with the words "IMAGE CENSORED BY FOX" printed several times from top to bottom inside the box. They stop at a tea stand.]
    Mr. T: 	Try my "Mr. T. ...tea." [squints]
There, three characters speak.

However, R's read.csv will assign all three characters' speech to Brian: http://imgur.com/gLpPKdl

   > x[596, ]
       Season Episode Character
    596     10       3     Brian
              Line
    596 You mean like the time you had tea with Mohammad, the prophet of the Muslim faith? \nPeter:\nCome on, Mohammad, let's get some tea. \n

    > x[597,]
        Season Episode Character
    597     10       3     Brian
                                                Line
    597 You mean like the time you had tea with Mohammad, the prophet of the Muslim faith? \nPeter:\nCome on, Mohammad, let's get some tea. \nMr. T:\nTry my "Mr. T. ...tea." \n
as well as seemingly duplicating part of the conversation.

PS: In addition, both Muhammad and Mohammad appear, presumably under-counting the references to the prophet.


I took a look at the code in the author's GitHub repository.

The data sources are CSVs in this repository: https://github.com/BobAdamsEE/SouthParkData/

Looks like all the data is preprocessed, with everyone mostly having only 1 line. (Actually, it appears the line you note in 10-3 is broken!) You can make an argument that the script isn't processed correctly, but that's beyond the scope of the analysis, although a note might be helpful.


It's my repository. I'll look at how the python script handles flashback events later today. Thanks for the feedback!


It appears that there are two issues that affect small parts of the captured datasets:

1) Colored character names are not handled properly. I looked for <th> tags, not <th bgcolor="beige"> tags.

2) Character names that start with a lower case character are not handled. This may have to do with other episodes using lower case prefixed table headers for stage directions, I have to double check.


why not? that's a valid single csv record with 4 "columns". When surrounded by quotes it IS legal for a csv entity to span multiple lines.


And, did you notice that the other lines comprise other characters' speech?


Just tested, it handles that fine. (R 3.1.3)


Sure, if you mean attributing Mr. T and Peter's speech to Brian is fine, then, yes, it handles it fine.


This implies there aren't future episodes upon which this type of statistical analysis could be applied.

This also strongly implies you think the author is a 'budding data scientist' out of his/her league.

This is very much a 'sample' given the context that South Park is still releasing new episodes.

FYI all elitist 'statisticians' ...


If one is trying to figure out what characters will say in future episodes based on their speech in previous episodes, then you are in a prediction context, not significance testing context.

As far as I can tell, there are a lot of people out of their leagues going around with the title "data scientist".

This is not a sample. This is a census at this point in time. The fact that there will be another population tomorrow does not change the fact that you have the entire population of all words spoken by all characters up to today.

I am not a statistician. I am an economist who knows enough about statistics and econometrics to know when a significance test is applicable.

Also, do note the issue that R's csv parsing is going to mis-attribute some characters' speech to others. GIGO speaks loud.


You're the worst kind of intelligent person tbh.

Why be a nitpicking pedant when it is clear this is intended as a throwaway exercise whose only application is predictive...?


You're the one calling people "data scientists", OP didn't even use the word "science" anywhere in the article.


Would the fact that he/she does not have the future text in his sample/population and that he uses this dataset as a sample of all the southparks to be ever written (in a prediction mode) make this make sense


Hm. The show is still running? Then the show can be considered a sample of what the characters (ok, the writers) will say/put in their mouths. The statistics then have predictive value.



By that definition: a complete sample?


No.

> A complete sample is a set of objects from a parent population that includes ALL such objects that satisfy a set of well-defined selection criteria.[3] For example, a complete sample of Australian men taller than 2m would consist of a list of every Australian male taller than 2m. But it wouldn't include German males, or tall Australian females, or people shorter than 2m ...

So, the entire set of all words spoken by South Park characters, by definition, is the population of all words spoken by South Park characters.

For this to be a complete sample, it needs to be a sample out of a larger population. What is that population?


Your argument is sound

this seems fitting, though https://giphy.com/gifs/week-media-person-RL0xU1daTlMoE


It's necessary because others refuse to listen and change their position, even when presented with evidence they are wrong.


Here's the accompanying GitHub repo: https://github.com/walkerkq/textmining_southpark


> Reducing the sparsity brought that down to about 3,100 unique words [from 30,600 unique words]

What does that mean? Does he remove words that are only said once or twice?

Can anyone point me to a text explaining the difference between Identifying Characteristic Words using Log Likelihood and using tfidf. ?


Relevant line in code:

   # remove sparse terms
   all.tdm.75 <- removeSparseTerms(all.tdm, 0.75) # 3117 / 728215
I believe it corresponds to the tfidf factor.


I've found an image, which i'm guessing it taken from the site: http://imgur.com/IEudyni, worth looking at if the sites still down.


I would have loved to see log characterization for the canadians characters, even if they aren't part of the main cast


This is amazing, I wonder what results you'd get from The Simpsons


Not sure subtitles contain character information but the people running https://frinkiac.com/ might have the data.


Pretty interesting. This Large Scale Study of Myspace (http://www.cc.gatech.edu/projects/doi/Papers/Caverlee_ICWSM_...) paper shows a similar method for finding characteristic terms, using Mutual Information.


This should be nominated for an igNobel


I wonder how the results would change if it was based not on words but rather by lines (not string lines but actor lines in conversation).

Its also funny how Stan talks more than Kyle given the show now has a recurring joke that makes fun of Kyle's long educational dialogues.


Maybe because of Kyle's decision to not give long speeches last season (:


It would definitely change. For instance I'd expect Kyle's words-per-sentence (or at least his 90th percentile sentence length) to be higher than Stan's, due to his speeches.


  Error establishing a database connection
Someone has a cached version please?





Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: