Why Google Flu is a failure: the hubris of big data

It seemed like such a good idea at the time.

People with the flu (the influenza virus, that is) will probably go online to find out how to treat it, or to search for other information about the flu. So Google decided to track such behavior, hoping it might be able to predict flu outbreaks even faster than traditional health authorities such as the Centers for Disease Control (CDC).

Instead, as the authors of a new article in Science explain, we got "big data hubris."  David Lazer and colleagues explain that:
“Big data hubris” is the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis.
The folks at Google figured that, with all their massive data, they could outsmart anyone.

The problem is that most people don't know what "the flu" is, and relying on Google searches by people who may be utterly ignorant about the flu does not produce useful information. Or to put it another way, a huge collection of misinformation cannot produce a small gem of true information. Like it or not, a big pile of dreck can only produce more dreck. GIGO, as they say.

Google's scientist first announced Google Flu in a Nature article in 2009. With what now seems to be a textbook definition of hubris, they wrote:
"...we can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day."
They obtained this remarkable accuracy entirely from analyzing Google searches. Impressive - if true.

Ironically, just a few months after announcing Google Flu, the world was hit with the 2009 swine flu pandemic, caused by a novel strain of H1N1 influenza. Google Flu missed it.

The failures have continued. As Lazer et al. show in their Science study, Google Flu was wrong for 100 out of 108 weeks since August 2011.

One problem is that Google's scientists have never revealed what search terms they actually use to track the flu. A paper they published in 2011 declares that Google Flu does a great job. The official Google blog last October makes it appear that they do an almost perfect job predicting the flu for previous years.

Haven't these guys been paying attention? It's easy to predict the past. Does anyone remember the University of Colorado professors who had a model that correctly predicted every election since 1980? In August 2012, they confidently announced that their model showed Mitt Romney winning in a landslide. Hmm.

Flu cases this year, which are dominated by H1N1.
A bigger problem with Google Flu, though, is that most people who think they have "the flu" do not. The vast majority of doctors' office visits for flu-like symptoms turn out to be other viruses. CDC tracks these visits under "influenza-like illness" because so many turn out to be something else. To illustrate, the CDC reports that in the most recent week for which data is available, only 8.8% of specimens tested positive for influenza.

When 80-90% of people visiting the doctor for "flu" don't really have it, you can hardly expect their internet searches to be a reliable source of information.

Google Flu is still there, and you can still look at its predictions, even though we know they are wrong. I recommend the CDC website instead, which is based on actual data about the influenza virus collected from actual patients. Big data can be great, but not when it's bad data.

A DNA Sequencing Breakthrough for Pregnant Women

DNA sequencing has made its way to the clinic in a dramatic new way: detecting chromosomal defects very early in pregnancy.  We've known for 25 years that traces of fetal DNA can be detected in a pregnant women's blood. But these traces are very small, and until now, we just didn't have the technology to detect an extra copy of a chromosome, where the DNA itself is otherwise normal.

Last week, in a study published in The New England Journal of Medicine, Diana Bianchi and colleagues showed how DNA sequencing can detect an extra copy of a chromosome with remarkable accuracy. This report heralds a new era in prenatal DNA testing.

First, some background: three copies of chromosome 21 causes Down syndrome, a genetic disease that causes intellectual disability and growth delays. Down syndrome is also called trisomy 21, where trisomy = 3 copies of a chromosome instead of the normal 2 copies. Much less common is Edwards syndrome, caused by three copies of chromosome 18. Edwards syndrome, or trisomy 18, has much more severe effects, with the vast majority of pregnancies not making it full term. Having an extra copy of any other chromosome almost always causes an early miscarriage. For many reasons, prospective parents want to know if a fetus carries any of these abnormalities.

The accuracy of the new test is remarkable. Out of 1914 young, healthy pregnant women, there were just 8 pregnancies where the fetus had an extra chromosome, and the test detected all 8. What was most impressive was its low false positive rate: in total, the new DNA-based test had just 9 false positives (for either chromosome 21 or chromosome 18 trisomy).  By contrast, the conventional screening test, which also identified all 8 true cases, produced 80 false positives, nearly 9 times as many as DNA sequencing.

Why does this matter? In most cases, women with a positive result on one of these tests will opt for amniocentesis ("amnio"), an invasive procedure where a doctor inserts a long needle directly into the womb and collects a sample of amniotic fluid. Amnio almost always gives a definitive answer about Down syndrome. With the conventional method, its false positive rate is so high that even with a positive test, over 95% of amnios will be negative, versus 55% with the new DNA sequencing test. Or to put it another way, as Bianci et al. wrote:
"if all women with positive results had .. decided to undergo an invasive procedure, there would have been a relative reduction of 89% in the number of diagnostic invasive procedures."
89% fewer invasive procedures is a huge reduction, not only in costs but in stress for the parents and risk to the baby (because amnio carries a small risk of miscarriage).

With DNA sequencing getting faster and cheaper every year, it might be surprising that we are only now seeing it used to detect trisomy. The difficulty with detecting an extra copy of a chromosome is that the DNA sequence itself is normal. If you sequence the genome, you won't find any mutations that indicate that the fetus has an extra chromosome copy. This is where the remarkable efficiency of next-generation sequencing comes in.

In a matter of hours, modern sequencing machines can sample millions of small fragments of DNA. We can use computational analysis to determine which fragments come from the fetus, and how many came from each chromosome. If any chromosome has three copies, we'll see a 50% increase in DNA from that chromosome. The power of sequencing lies in large numbers: because we can sequence many fragments from each chromosome, a 50% increase is easy to detect.

The method that Bianchi used to detect trisomy was published in 2011 by Amy Sehnert and colleagues from 2011, some of whom are contributors to the new NEJM study. [Side note: they use a software program called Bowtie, developed by my former student Ben Langmead, to do the analysis.] The method is likely to get even better over time, further reducing the false positive rate.

The American College of Obstetricians and Gynecologists has already recommended DNA testing for pregnant women at high risk of fetal aneuploidy (an extra chromosome). To be precise, they recommend that high-risk pregnant women be offered fetal DNA testing as an option, after they get genetic counseling. This new study, which was conducted in a low-risk population, shows that the benefits of prenatal DNA testing should offered to all women.