Why Google Flu is a failure: the hubris of big data

It seemed like such a good idea at the time.

People with the flu (the influenza virus, that is) will probably go online to find out how to treat it, or to search for other information about the flu. So Google decided to track such behavior, hoping it might be able to predict flu outbreaks even faster than traditional health authorities such as the Centers for Disease Control (CDC).

Instead, as the authors of a new article in Science explain, we got "big data hubris."  David Lazer and colleagues explain that:
“Big data hubris” is the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis.
The folks at Google figured that, with all their massive data, they could outsmart anyone.

The problem is that most people don't know what "the flu" is, and relying on Google searches by people who may be utterly ignorant about the flu does not produce useful information. Or to put it another way, a huge collection of misinformation cannot produce a small gem of true information. Like it or not, a big pile of dreck can only produce more dreck. GIGO, as they say.

Google's scientist first announced Google Flu in a Nature article in 2009. With what now seems to be a textbook definition of hubris, they wrote:
"...we can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day."
They obtained this remarkable accuracy entirely from analyzing Google searches. Impressive - if true.

Ironically, just a few months after announcing Google Flu, the world was hit with the 2009 swine flu pandemic, caused by a novel strain of H1N1 influenza. Google Flu missed it.

The failures have continued. As Lazer et al. show in their Science study, Google Flu was wrong for 100 out of 108 weeks since August 2011.

One problem is that Google's scientists have never revealed what search terms they actually use to track the flu. A paper they published in 2011 declares that Google Flu does a great job. The official Google blog last October makes it appear that they do an almost perfect job predicting the flu for previous years.

Haven't these guys been paying attention? It's easy to predict the past. Does anyone remember the University of Colorado professors who had a model that correctly predicted every election since 1980? In August 2012, they confidently announced that their model showed Mitt Romney winning in a landslide. Hmm.

Flu cases this year, which are dominated by H1N1.
A bigger problem with Google Flu, though, is that most people who think they have "the flu" do not. The vast majority of doctors' office visits for flu-like symptoms turn out to be other viruses. CDC tracks these visits under "influenza-like illness" because so many turn out to be something else. To illustrate, the CDC reports that in the most recent week for which data is available, only 8.8% of specimens tested positive for influenza.

When 80-90% of people visiting the doctor for "flu" don't really have it, you can hardly expect their internet searches to be a reliable source of information.

Google Flu is still there, and you can still look at its predictions, even though we know they are wrong. I recommend the CDC website instead, which is based on actual data about the influenza virus collected from actual patients. Big data can be great, but not when it's bad data.

7 comments:

  1. 'The Hubris of Big Data' hits the nail on the head. People have been trying for decades to predict the stock market with piles of data (and that data is likely far more accurate than search terms entered into a Google search box), yet the market still remains an unpredictable place.

    A few months ago a friend asked me what he could do with the firehose of Twitter data (serious question, he's a software engineer). I told him he could collect memes (as a joke, of course).

    ReplyDelete
  2. Why do they get to spy on us with out asking, even if I did get the flu witch i'm not god willing , it's not the law to follow me with out my consent. HIPPA law would not allow non health care employees to know my med history f Google for lies

    ReplyDelete
    Replies
    1. Why do they get to spy on you without asking? Did you really ask why they can track what you search for on their search engine...from your Google+ account?

      Delete
  3. Interesting piece.. as always, your output is only as good as your input - hope this can be used positively as a warning not to rush to conclusions when dealing with big data.

    ReplyDelete
    Replies
    1. This comment has been removed by a blog administrator.

      Delete
  4. google flu is still useful. Even if they were just copying from CDC etc.
    Because their data is well formatted, easily available and international
    and nicely computer-readable and covers many regions.

    Their scaling is sometimes bad, but they are still good at predicting the
    (timing of the) peak of the flu-waves. CDC gives no such predictions at all.
    (I saw, the French give them somehow)

    ReplyDelete
    Replies
    1. sorry, that was from me, gsgs. Fee also threads about it at flutrackers.com

      I tried to add this to the post above and deliberately gave a wrong
      spam-bot-number, but it was posted anyway ;-)


      gsgs

      Delete

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS

Note: Only a member of this blog may post a comment.