Top 10 genome papers of all time

It’s December, and that means “top 10” lists are starting to appear for the year. I’ve put together a top 10 list, but it’s not just for the year 2008. The genome era is far enough along that we can now ask the question, “what are the top 10 genome papers to date?” The first complete genome of a free-living organism was published in 1995 (Haemophilus influenzae) and literally hundreds of genomes have appeared since then - thousands, if you count virus genomes.

How does one measure the importance of a genome to science? Of course I could give you my subjective list, but I was looking for an objective measurement, one that anyone would have to admit is reasonable. The one I chose – the obvious one, really – is the number of scientific citations that the original genome paper has collected. This measure has a bias towards older papers, because newer papers haven’t yet had time to accumulate as many citations, but all of the papers on the Top 10 Genomes list are at least 6 years old. I will revise this list in the future to accommodate updates in the citation counts.

The other question is how to count citations. After looking at several sources, I chose ISI’s Web of Science citation index. Google Scholar is another option, and I used it as well, but I found that Google is less accurate – it uses a heuristic method to collect citations, and it frequently double-counts references, especially for papers with large numbers of authors. I listed both counts in the Top 10 list, but the ranking follows ISI where there’s a disagreement.

So here they are! The Top 10 Genome Papers include 5 bacteria, 3 model organisms, and the two human genome papers right at the top. Not surprisingly, all 10 appear in Nature or Science (5 in each journal). All of the first authors are different, and three were authored by consortia without a traditional first author. And for those who want to argue about which of the two human papers deserves #1, ISI gives a clear edge to the publicly-funded effort, while Google Scholar, curiously, ranks the Celera Genomics effort (which I was part of) well ahead of the public project. My subjective list would have included the malaria genome paper (MJ Gardner et al, Nature 2002) – TB and malaria are the two greatest infectious disease killers of humans – but it came in at #12 using citation criteria. But it’s much newer than #9 and #10, so I'm betting it will move up in the future – stay tuned.

[Note that I’ve also created a separate web page for this list.]

Top 10 genome papers of all time

Criteria for inclusion: a paper must be the first description of the complete or near-complete genome of a species, and it must describe the DNA sequence as well as relevant sequencing methods and biological discoveries revealed by the initial sequencing of the genome. Rankings are based on citation counts, with the ISI Web of Science taking priority over Google Scholar, which is less accurate as it uses heuristic rules to gather citations. Counts from both databases are provided. Citation counts are current as of December 2008.

1. Initial sequencing and analysis of the human genome
International Human Genome Sequencing Consortium
Nature 409:6822 (15 Feb 2001), 860-921.
Times Cited: 6,416
Google Scholar: 5,542

2. The sequence of the human genome
JC Venter, MD Adams, EW Myers, et al. (274 authors)
Science (16 Feb 2001), 1304-1351.
Times Cited: 4,588
Google Scholar: 6,502 [Note that Google places this paper at #1]

3. The Complete Genome Sequence of Escherichia coli K-12
FR Blattner, G Plunkett, CA Bloch, et al.
Science 277:5331 (5 Sept 1997), 1453-1462.
Times Cited: 3,327
Google Scholar: 3,625

4. Whole-genome random sequencing and assembly of Haemophilus influenzae RD
RD Fleischmann, MD Adams, O White, et al.
Science 269:5223 (28 July 1995), 496-512.
Times Cited: 3,075
Google Scholar: 2,651

5. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence
ST Cole, R Brosch, J Parkhill, et al.
Nature 393:6685 (11 June 1998), 537-544.
Times Cited: 2,858
Google Scholar: 3,163 [Note that Google places this paper at #4]

6. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana
The Arabidopsis Genome Initiative (143 authors)
Nature 408:6814 (14 Dec 2000), 796-815.
Times Cited: 2,689
Google Scholar: 1,728 (Google has real trouble tracking this "group author" name)

7. The genome sequence of Drosophila melanogaster
MD Adams, SE Celniker, RA Holt, et al.
Science 287:5461 (24 Mar 2000), 2185-2195.
Times Cited: 2,632
Google Scholar: 3,002

8. Initial sequencing and comparative analysis of the mouse genome
Mouse Genome Sequencing Consortium
Nature 420:6915 (5 Dec 2002), 520-562.
Times Cited: 2,188
Google Scholar: 1,763

9. The complete genome sequence of the gastric pathogen Helicobacter pylori
JF Tomb, O White, AR Kerlavage, et al.
Nature 388:6642 (7 Aug 1997), 539-547.
Times Cited: 1,960
Google Scholar: 1,325

10. Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii
CJ Bult, O White, GJ Olsen, et al.
Science 273:5278 (23 Aug 1996), 1058-1073.
Times Cited: 1,811
Google Scholar: 1,425 [Note that Google places this paper at #9]

5 comments:

  1. Interesting... on a per gene basis, some of the viral genomes might make the list!!

    ReplyDelete
  2. Mark my words, in 10 years the platypus will be the top of this list!

    ReplyDelete
  3. I do not think citations is the right metric. I mean, sure it is useful to see which genome papers have lots of citations. But I think in many cases people cite a collection of genome papers. I would be perhaps more interested in normalizing the citations by paper if one could and thus if a paper cites three genome papers, each paper only gets 1/3 of a citation. For example, many many papers cite both the Celera and the public human genome papers when they are taking about the human genome.

    Another issue is that for some reason it seems to me that when the human genome data is used, people cite the genome papers. But when data is used from other genome project, (e.g., E. coli) the genome paper is not cited and either just a Genbank ID is used or even that is not listed. I am not sure what the sociology of this is, but I think the impact of some genomes is being undercounted by this citation metric because of how people decide when to cite the genome paper when they use the genome data.

    ReplyDelete
  4. Jonathan,
    I agree that citations over-simplify, but I don't have a better metric that is easy to compute, and that is objective. I've often been frustrated, like you, when I see a paper cite nothing more than a GenBank ID for a genome, rather than the genome paper.
    If anyone wants to suggest (or compute) another ranking metric, please do.

    ReplyDelete

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS

Note: Only a member of this blog may post a comment.