AGBT talk: David Jaffe, Broad Institute of MIT and Harvard

Title: High-Quality Draft Assemblies of a Dozen Vertebrate Genomes from Massively Parallel Sequence Data”

[EDIT (2011-02-10): Note the speaker has provided some clarifications to these notes in the comments below.  I have struck out comments that have been clarified, and made references to the authors comments where warranted.]

[I walked in at 2:00 as the 454 seminar was wrapping up, instead of 2:30, so I’ve been waiting a while for this talk… the suspense has been building!]

In 2000, vertebrate sequencing cost ~$2,000,000 per genome.  29 were done.

By 2010, BGI used SOAP denovo to do more genomes using illumina alone.  Coverage of segmental duplications, etc were raised as issues.

Goal: “Convince you that it is possible”

evidence: 1.  control genomes (human + mouse) 2. new vertebrate genomes.

How do we get there?

  • ALLPATHS-LG (lg is for large genomes)
  • new algorithms need to evolve with the new lab techniques
  • approach: sequence and assemble blind, then compare to reference when available.
  • Goal: each new genome should not be a research project.

Lab recipe: PET 200bp, 2000bp, 6,000bp and 40kbp (which required development of new methods. – poster by Louise Williams)

Algorithm philosophy:Discussion of small/large k values for assembly. (use large k for  specificity, small k for sensitivity.)  Goal with allpaths is to use all possible biological information – don’t lose reads that are relevant.

Challenge 1: Reads are inaccurate.  Remove errors, but don’t remove SNPs.   (algorithm discussed. used k to identify read stacks, look outside the k to try to correct poor quality non-matching positions.)

[I missed challenge 2… I thought that was all challenge 1.]

Challenge 3 :coverage is uneven;

Solutions: Use High Coverage!  improve lab process, improve algorithms. (See Gnirke talkfor better methods for GC rich areas. [Unfortunately, I didn’t go to that session.])

[Again, I missed challenge 4, which I couldn’t differentiate from part of challenge 3.]

Challenge 5: You don’t know the exact answer for what the genome says.

Challenge 6: Computations are large: billions are reads, need all of your reads in memory at once for assembly.

Solution: Buy more ram. [Ok, really?  that’s not a challenge, if throwing money at it solves the problem that easily, it’s just a funding issue – EDIT: See comments below.]

On to the experiment: (control)

Sequenced Ms.B6 mouse (finished genome available) and NA12878 cell line (sequenced by 1000 genomes).  Three way comparison: Allpaths-lg, SOAPdenovo. (No other one is available. [really?  It thought trans-abyss and velvet were also available and able to do this…. maybe not, I’m not an expert on assembly.]

Criteria: Continuity, Completeness, Accuracy.

Comparison summary: Capillary and allpaths do ok (capillary is slightly better) and soapdenovo is always trailing the pack.

[A whole LOT of slides that look like identical bar graphs – you’ll have to make do with the summary above.]

Doing de-novo assemblies for 15 genomes (7 fish, 8 mammals).  [ooo coelacanth genome!]

Excluding genomes [did he mean scaffolds?] in which 40kb data sets are not good (good def’n by physical coverage 20x or greater).  This removes a lot of the variability in the quality.

“Fish are hard”

Assembly discovers sequences you can’t get any other way.

Even if you do alignment based stuff, there is great value in assembly as a complimentary approach.

Allpaths-lg future:  lots of improvement, error can be driven towards zero.

Yes, you can do whole genome assembly of vertabrates.

[I don’t think I learned much new in this talk – but I don’t really see much difference between the assemblers – EDIT:See comments below.  My take home message is that ALLPATHS-LG can be used on larger data sets, but then again, they claimed that they need massive amounts of RAM…  I’ll just suggest people check out the software and decide for themselves.]

4 thoughts on “AGBT talk: David Jaffe, Broad Institute of MIT and Harvard

  1. Pingback: Tweets that mention AGBT talk: David Jaffe, Broad Institute of MIT and Harvard | --

  2. Hi,

    I don’t think we’ve met. I’m very impressed that you were able to blog so much information….

    I hope you’re not offended, but I do have a few comments on your blog on my talk:

    – “Solution: Buy more ram. [Ok, really? that’s not a challenge, if throwing money at it solves the problem that easily, it’s just a funding issue.]”

    It’s not more money. One needs a large computer to solve the problem. The choice in architecture is between a distributed cluster and a shared memory box. There are reasons why a distributed cluster might be preferable, but it does not cost less than a shared memory box.

    – “[really? It thought trans-abyss and velvet were also available and able to do this…. maybe not, I’m not an expert on assembly.]”

    As we report in our PNAS paper, we tried abyss. In fact we had an extensive series of emails seeking help, but the end results were unsatisfactory. According to Daniel Zerbino, the upper limit on genome size for Velvet is ~100 Mb. Believe me, we tried to get other assembly software to run. It doesn’t exist (or at least didn’t exist at the time).

    – “I don’t really see much difference between the assemblers.”

    I’m not sure what you mean by this. For example, there is a staggering difference in scaffold size between ALLPATHS-LG and SOAPdenovo.

    All the best,


    [Note: editied for formatting only – apf.]

    • Hi David,

      Thanks for your comment – I’m pretty sure we haven’t met, although I did enjoy your talk – and I’m glad you allowed me to blog it.. I’m not offended in the slightest with your corrections, In fact, I’m thrilled that you’ve taken the time to let me know where my notes were incorrect. Frankly, that’s one of the most striking advantages of having put them on the web – I can find out where I’ve drawn the wrong conclusion and correct it. I’ll make a note at the top of the post that others should read your comments below.

      In reply to your comments, I thought I’d explain some of my thoughts:

      Your comment: “It’s not more money. One needs a large computer to solve the problem”

      I understand the difference, but one usually has to work within the constraints of RAM/Disk access speed. This is probably the most basic algorithm trade off there is in computer science. As far as solving it, you can either write more elegant code, or increase the ram. In this case, it sounded like the solution was simply to buy a machine large enough to support the algorithm, rather than trying to find an elegant solution to run with less. In the business world, I’ve often heard it described that there are two types of problems: Those that can be solved by throwing money at the problem (eg, buying more RAM), and those that can’t (eg, innovation is needed.). II didn’t mean it as an insult, just that the approach of adding more ram to solve difficult problems is one that can be solved by funding ever increasing computers.

      My comment: “[really? It thought trans-abyss and velvet were also available and able to do this…. maybe not, I’m not an expert on assembly.]”

      Thank you for the correction – I haven’t used either myself, although I know the authors of both Velvet and Abyss. I had thought that Abyss had been scaled to larger projects, but as I pointed out, I’m not an expert, so I’ll simply withdraw that statement.

      My Comment: “I don’t really see much difference between the assemblers.”

      I think this may be a reflection of what I was able to take away from the talk – although I enjoyed it, I wasn’t entirely certain what the main mechanistic differences were between ALLPATHS-LG and the other assemblers – I realize that this talk may not have been the place to get into it, but it wasn’t clear to me. I think the best interpretation of this comment should be “I don’t see much difference between the assemblers, and I should look further into it.”, rather than “there is no difference”, because I am certain there are many.

      Again, thanks for your comments and clarifications.

  3. Pingback: Wrapping up AGBT | SNP Genotyping

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.