Wednesday, May 21, 2014

Big data, the problem(s)

In a recent press essay, we learn (gasp!) that there are problems with big data. And, not just one, but several. Who knew?!

Well, here's the way Gary Marcus and Ernest Davis see it [annotated by me in the brackets]:
[First]... although big data is very good at detecting correlations, especially subtle correlations that an analysis of smaller data sets might miss, it never tells us which correlations are meaningful [that is, which have utility for your situation or application].

Second, big data can work well as an adjunct to scientific inquiry but rarely succeeds as a wholesale replacement.

Third, many tools that are based on big data can be easily gamed

Fourth, even when the results of a big data analysis aren’t intentionally gamed, they often turn out to be less robust than they initially seem

A fifth concern might be called the echo-chamber effect, which also stems from the fact that much of big data comes from the web. Whenever the source of information for a big data analysis is itself a product of big data, opportunities for vicious cycles abound.

A sixth worry is the risk of too many [bogus]correlations

Seventh, big data is prone to giving scientific-sounding solutions to hopelessly imprecise questions

FINALLY, big data is at its best when analyzing things that are extremely common, but often falls short when analyzing things that are less common