No, Google didn't just create AI that could "build your genome"

Most scientists don't have their own PR machine to hype their work. After reading the announcement last week from Google's genomics group, I realized that's probably a good thing.

Wired article last Friday reported that "Google is giving away AI that can build your genome sequence." Sounds impressive–two high-tech innovations (AI and genomes) in the same title! Unfortunately, the truth is somewhat different. It turns out that Google's new "AI" software is little more than an incremental improvement over existing software, and it might be even less than that.

I'm going to have to get into the (technical) weeds a bit to explain this, but it's the only way to set the record straight. The Wired piece opens with this intriguing challenge:
"Today, a teaspoon of spit and a hundred bucks is all you need to get a snapshot of your DNA. But getting the full picture—all 3 billion base pairs of your genome—requires a much more laborious process."
Interesting, I thought. The writer (Megan Molteni) seems to be talking about genome assembly–the process of taking billions of tiny pieces of DNA sequence and putting them together to reconstruct whole chromosomes. This is something I've been working on for nearly 20 years, and it's a fascinating but very complex problem. (See our recent paper on the wheat genome, as one of dozens of examples I could cite.)

So does Google have a new genome assembly program, and is it based on some gee-whiz AI algorithm?

No. Not even close. Let's look at some of the ways that the Google announcement and the Wired article are misleading, over-hyped, or both.

1. The Google program doesn't assemble genomes. That's right: even though the Wired piece opens with the promise of "getting the full picture" of your genome, the new Google program, DeepVariant, doesn't do anything of the sort. DeepVariant is a program for identifying small mutations, mostly changes of a single letter (called SNPs). (It can find slightly larger changes too.) This is known as variant calling, or SNP calling, and it's been around for more than a decade. Lots of programs can do this, and most of them do it very well, with accuracy exceeding 99.9%.

How could Wired get this so wrong? Well, the Wired piece is based on a Google news release from a few days earlier, called "DeepVariant: Highly Accurate Genomes With Deep Neural Networks," written by the authors of the software itself. Those authors, who obviously know what their own software does, make the misleading statement that DeepVariant is
"a deep learning technology to reconstruct the true genome sequence from HTS sequencer data with significantly greater accuracy than previous classical methods."
If you read on, though, you quickly learn that DeepVariant is just a variant caller (as the name implies). This software does not "reconstruct the true genome sequence." That's just wrong. To reconstruct the sequence, you would need to use a program called a genome assembler, a far more complex algorithm. (I should add that many genome assemblers have been developed, and it's an active and thriving area of research. But I digress.)

The Wired article also points out that
"the data produced by today’s [sequencing] machines still only produce incomplete, patchy, and glitch-riddled genomes."
Yes, that's true. Again, though, DeepVariant does nothing to fix this problem. It can't assemble a genome, and it can't improve the assembly of an "incomplete, patchy" genome.

2. Wild hyperbole: the caption on the lead image in the Wired piece says "Deep Variant is more accurate than all the existing methods out there."  The Google press release, presumably the source for that caption, claims that DeepVariant has "significantly greater accuracy than previous classical methods."

No, it does not. This is the kind of claim you'd never get away with in a scientific paper, not unless you rigorously demonstrated your method was truly better than everything else. The Google team hasn't done that.

How good is it? First, let me remind you that variant calling programs have been around a long time, and they work very well. An incremental improvements would be nice, but not "transformative" or a "breakthrough"–words that the Google team didn't hesitate to use in their press release. They also used the word "significant," which they'd never get away with in a scientific paper, not without statistics to back it up. Press releases can throw around dramatic claims like these without anyone to check them. That's not a good thing.

About a year ago, the Google team released a preprint on bioRxiv that shows that their method is more accurate (on a limited data set) than an earlier method called GATK, which was developed by the same author, Mark DePristo, in his former job at MIT, which he left to join Google. GATK is quite good, and is very widely used, but other, newer methods are much faster and (at least sometimes) more accurate. The Google team basically ignored all of the other variant calling programs, so we just don't know if DeepVariant is better or worse than all of them. If they want to get this preprint published in a peer-reviewed journal, they're going to have to make a much better case.

(As an aside: a much-less hyped method called 16GT, published earlier this year by a former member of my lab, Ruibang Luo, is far faster than DeepVariant, just as accurate, and runs on commodity hardware, unlike DeepVariant which requires special resources only available in the Google Cloud. And it does all this with math and statistics–no AI required. But I digress.)

(Another aside: if we really wanted to get into the weeds, I would explain here that the "AI" solution in DeepVariant is transformation of the variant calling problem into an image recognition problem. The program then uses a method called deep neural networks to solve it. I have serious reservations about this approach, but suffice it to say that there's no particular reason why treating the problem as an image recognition task would provide a large boost over existing methods.)

3. More wild hyperbole. The Google news release opens with a sentence containing this:
"in the field of genomics, major breakthroughs have often resulted from new technologies."
It then goes on to describe several true breakthroughs in DNA sequencing technology, such as Sanger sequencing and microarrays, none of which had any contribution from the Google team. Then–pause for a deep breath and a paragraph break–we learn that "today, we announce the open source release of DeepVariant." Ta-da!

I can only shake my head in wonder. Does the Google team truly believe that DeepVariant is a breakthrough on a par with Sanger sequencing, which won Fred Sanger the 1980 Nobel Prize in Chemistry? This is breathtakingly arrogant.

4. DeepVariant is computationally inefficient. Even if it is better than earlier programs (and I'm not convinced of that), DeepVariant is far slower. While other programs run on commodity hardware, it appears that Google's DeepVariant requires a large, dedicated grid of computers working in parallel. The Wired article explains that two companies (DNAnexus and DNAStack) had to invest in new GPU-based computer hardware in order to run DeepVariant. An independent evaluation found that DeepVariant was 10 to 15 times slower than the competition. Coincidentally, perhaps, Google's press release also announces the availability of the Google Cloud Platform for those who want to run DeepVariant.

No thanks. My lab will continue to use 16GT, or Samtools, or other variant callers that do the job much faster, and just as well, without the need for the Google Cloud. As a colleague remarked on Twitter, the "magic pixie dust of 'deep learning' and 'google'" doesn't necessarily make something better.

Genomics is indeed making great progress, and although I applaud Google for dedicating some of its own scientific efforts to genomics, it's not helpful to exaggerate what they've done so far, especially when they take it to this level. Both the Google news release and the Wired article contain the sort of over-statements that make the public distrust science reporting. We don't need to do that to get people excited about science.

No comments:

Post a Comment

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="">FoS</a> = FoS

Note: Only a member of this blog may post a comment.