Google flu trends has long been the go-​​to example for anyone asserting the rev­o­lu­tionary poten­tial of big data. Since 2008 the com­pany has claimed it could use counts of flu-​​related Web searches to fore­cast flu out­breaks weeks ahead of data from the Cen­ters for Dis­ease Con­trol and Prevention.

Unfor­tu­nately, this turned out to be what I call big-​​data hubris. Col­leagues and I recently showed that Google’s tool has drifted fur­ther and fur­ther from accu­rately pre­dicting CDC data over time. Among the under­lying prob­lems was that Google assumed a con­stant rela­tion­ship between flu-​​related searches and flu preva­lence, even as the search tech­nology changed and people began using it in dif­ferent ways.

That failure is the big-​​data era’s equiv­a­lent of the Chicago Tribune’s “Dewey Defeats Truman” head­line in 1948. After public-​​opinion sur­veys erro­neously pre­dicted Dewey’s vic­tory, the New York Times declared polling “unable to com­pute sta­tis­ti­cally the unpre­dictable and unfath­omable nuances of human char­acter.” Yet 64 years later, polling is used widely and suc­cess­fully. In aggre­gate it pre­dicted the overall margin of the latest ­pres­i­den­tial elec­tion within tenths of a per­centage point, as well as the out­come in all 50 states. Sur­veys remain the bread and butter of social-​​science research.

Read the article at MIT Technology Review →