The Big Data Backlash

onion2k · on March 31, 2014

"I suggest that ‘Big Data’ analyses are no more prone to this kind of problem than any other kind of analysis."

The notion that 'big data' is just as susceptible to bad statistical analysis if you ignore dynamics in the incoming data is entirely true. In that sense 'big data' is the same as every other kind of analysis. But there is another notable difference between 'big data' analysis and any other kind: the marketing. 'Big data' solutions are often sold as something you can just plug in to your infrastructure, set up some data feeds, and out pops an insightful trend analysis telling you things about your customers that you could never have understood with Excel alone. That is what's wrong with 'big data'.

Used properly, by an intelligent statistician ("data scientist") who knows their field well, big data tools are very useful for doing the things that statisticians do only faster. But that is all. Big data tools don't magically do clever statistical analysis on their own, and they are not a replacement for statisticians, despite what some people seem to think.

walshemj · on March 31, 2014

It just makes practible analysis that before would have been to costly before. I suspect most of the backlash is from MBA students who could make a spread sheet draw purty pictures but crash and burn when required to do any real work.

By that argument Lyons tea shops should have never have built LEO - after all they had a manual system that could track the cost of a bun down to fractions of a farthing (1/4 of a penny)

pyduan · on March 31, 2014

I feel Dr. Goodson is fighting a straw man here. Nowhere does the original article make the statement that Google used an "unbelievably complex model that no one could ever understand", or that we had no understanding of why GFT worked (merely that the assumptions were not made explicit).

What it did say though was that "Big Data practitioners" (whatever that means) too often tend to make the mistake of ignoring threats to external validity [1] because dealing with big datasets gives researchers a false sense of confidence in the assumption that "N = all". This is a very valid point and IMHO the most important takeaway of the original article, but it is not addressed here.

On a side note, I feel the pothole example (where using a mobile app to crowdsource pothole detection in Boston unknowingly led to a bias towards younger and more affluent areas) would have been more relevant towards discussing that thesis than focusing on the details of GFT.

What Tim Harford (the author of said article) denounces is the recent trend that sees people claiming Big Data is "the end of theory" [2], where traditional theories and hypothesis testing are being replaced by theory-free procedures such as validating metrics against a hold-out set. The problem is that while previously the relative scarcity of data lead social science researchers to carefully consider their data sources, the abundance of passively generated data we have now tends to cause them to forget to do their due diligence when assessing the threats to the external validity of their findings. Normally such concerns naturally arise when you're deciding on your data collection methodology or your experiment design, but Big Data analyses do tend to be different in that they focus much more on finding latent relationships in existing raw data. This also increases the risk to unknowingly fall prey to the multiple comparisons problem [3], which is the other important point Harford touches on but isn't really addressed in this post.

In that respect, yes, Big Data analyses can be more prone to this kind of problem. Even if you dismiss it as a case of bad scientists and not bad science, it remains that the general population is much more prone to overly trust findings that came out of such analyses, and articles such as Harford's are important to remind both practitioners and laypeople to be careful.

[1] http://en.wikipedia.org/wiki/External_validity

[2] http://www.theguardian.com/news/datablog/2012/mar/09/big-dat...

[3] http://en.wikipedia.org/wiki/Multiple_comparisons_problem

martingoodson · on March 31, 2014

Nowhere does the original article make the statement […] that we had no understanding of why GFT worked (merely that the assumptions were not made explicit).

It’s very difficult to read the folllowing line from the FT article and reach that conclusion:

“The problem was that Google did not know – could not begin to know – what linked the search terms with the spread of flu.”

But the authors of the science paper that Tim Harford refers to did ‘begin to know’ how Google Flu Trends worked [1]. That’s how they developed several reasonable suggestions for what caused the over-prediction problem. In particular, the suggestion that changes in the google search algorithm caused a bias in the flu trends results, could easily be tested [2]. Perhaps we would use a large dataset to do that. And that’s ok.

The suggestion that statisticians suddenly forget all of their training when n reaches a certain threshold is a misrepresentation of the facts. There are bad analyses based on large data sets just as there are bad analyses based on small data sets. We have tools to deal with large datasets and multiple comparisons [3]. We don't need to throw our hands in the air and panic.

[1] http://www.sciencemag.org/content/343/6176/1203

[2] For instance we could check that to see if the over-predictions of flu cases started on the same day as the search algorithm change.

[3] To pick a random example: http://en.wikipedia.org/wiki/False_discovery_rate

pyduan · on March 31, 2014

Thanks for weighing in, but you are completely misrepresenting my argument by reducing "Big Data" to "data that reaches a certain n" and responding to claims I did not make. As you are probably aware, doing these kind of analyses require more than pure statistics; they also require solid understanding of good experiment design, and this is precisely what I was arguing is at higher risk of breaking down in these types of analyses.

To clarify my previous post, what I was referring to (and what I believe is what is commonly referred to) when I was talking about Big Data is a specific albeit vaguely defined trend of analysis that tends to focus on:

a) mining data out of large, unstructured existing datasets

b) leveraging data that has been passively generated, ie. are byproducts of normal activity and not a conscious experiment design decision

c) maximizing predictive power as opposed to validating a theory

And yes, there are many ways to do bad analyses on small data sets, but that is not the point I (or, I believe, Harford) was making. The point is these types of analyses, because of their nature, tend to require additional care regarding external validity, because:

a) many make the mistake to think that big n means you don't need to worry about sampling biases, ie. what Harford was referring to as the "N = all" fallacy and what I believe was his main point; you'll agree it tends to be more common in "big n" analyses

b) since data collection is not an issue, the real challenge is in data cleaning, which requires special care because you need to think about potential biases in the way the data was generated (a process you had no control about); this can be trickier than it sounds when exploring large datasets that were passively generated, because every feature is potentially subject to these biases and some are less than obvious. The Boston case was a good example, but now consider that in many real-world datasets almost all features are subject to similar considerations (and may all be subject to different biases)

c) the focus on predictive power when using theory-free metrics leads to a risk of overfitting when the possible sources of heterogeneity are not understood (ie. the assumptions are not made explicit)

d) since they've been optimized for predictive power, they give a false sense of security ("it worked on the validation set!"); this is compounded by the fact these models will often work for a while (as is the case in GFT) before breaking down [1]

e) since there is a stronger focus on exploratory analysis, addressing the multiple comparisons problems is not as trivial as you make it sound; the issue is not the statistical tools we have at our disposal [2] but making all your assumptions explicit, which is trickier in the exploratory phase (because by doing this initial phase, you are already implicitly dismissing or selecting relationships to study)

f) since these analyses tend to be very application-oriented and to function at a large scale, mistakes have the potential to be much more destructive (for example, false positives in the Target example). This is compounded by the fact that due to the technical challenges in handling complicated data, and because applications are often found in tech companies, many people who do these analyses come from a computer science background and are not necessarily well trained in statistics or econometrics

Again, none of these are insurmountable; no one is actually dismissing Big Data analyses as a whole, but they present some unique opportunities for screwing up.

[1] Incidentally, this is precisely why I said earlier that discussing the details of GFT seemed only tangential to the point: yes, the Google researchers were well aware of the limitations of the method, and so is Harford ("Google Flu Trends will bounce back, recalibrated with fresh data – and rightly so"). The relevant point is not whether the Google researchers were right, but the false sense of certainty it instills for the consumers of the research, something I also made explicit at the end of my last post.

[2] Although some make a pretty good case that it is, but this is beyond the scope of this comment: http://www.nature.com/news/scientific-method-statistical-err...

Edit: I forgot to address the first part of your reply. While of course the researchers knew that people looked for these terms because they are concerned by their health, what is missing is why these specific queries are important: you may well find that some of these queries are more related to general concern, while some are specifically about treatment options, and others about vaccination. These may not evolve at the same time and in the same direction, and while they may have been indistinguishable in the past, it's entirely possible the first type of queries will be disproportionately affected by changes in the Google algorithm vs. others, or that some of the assumptions are only valid for one type of query and not the others. In economics (which is Harford's background), this is often considered insufficient when deciding whether to add a variable to a model.

mturmon · on March 31, 2014

Thanks for this respectful, detailed, and analytical contribution.

I agree that "Big Data" is a tendency that is worth talking about as if it is a new thing. There are edge cases that reside on the border between conventional moderate-n statistical analysis, and large-n "vacuum up lots of data and try to extract correlative information" approaches. Examples like the pothole-location collection and GFT are emblematic of something new, that's well beyond this fuzzy boundary.

So it's not surprising that there are new issues. Some of the fixes may be old-fashioned, but some may not.

We should also face the fact that the people doing this work often don't have any formal statistical training, so it's on the community to highlight the important pitfalls.

mathattack · on March 31, 2014

My interpretation of the original article was FT saying, "Let's not get too ahead of ourselves on promising the world." Today's Big Data's promises were AI promises of the past. The hype may be overdone (the Oakland As couldn't win the world series on Moneyball ideas, and eventually they stopping working so well) but that doesn't change the reality that many businesses run better when driven by data.

crayola · on March 31, 2014

"I suggest that ‘Big Data’ analyses are no more prone to this kind of problem than any other kind of analysis."

To an extent, large data volumes make it more difficult for the statistician to be as nimble. Trying different algorithms, different specifications, different ways to approach the data is part of the statistical workflow; not everything can be easily parallelized and run on a Hadoop cluster.

There are insights a statistician can quickly obtain (few hours) from a carefully selected random sample of a few million observations, in memory, in a single R or Python process. The same analysis for the complete, multi-terabyte data would be rather more painful or costly to obtain.

Of course data scientists such as Martin Goodson know that (though their bosses do not always) and are used to doing exploratory analysis or prototyping on sample that fit in RAM.

einhverfr · on March 31, 2014

It's not just volume but variety too. Most big data solutions are intended to handle large varieties of data as well as large volumes.

Once you get there, all bets are off.....

intslack · on March 31, 2014

The "Big Data Backlash" isn't anything new: Nassim Taleb's conversation about it in his last book, in which he presents a mathematical proof that noise-to-signal increases exponentially with data, is scathing.

"Big data" means anyone can find fake statistical relationships, since the spurious rises to the surface.

http://www.wired.com/2013/02/big-data-means-big-errors-peopl...

tormeh · on March 31, 2014

But all big data algorithms are not simple.