Title: High-Quality Draft Assemblies of a Dozen Vertebrate Genomes from Massively Parallel Sequence Data”
[EDIT (2011-02-10): Note the speaker has provided some clarifications to these notes in the comments below. I have struck out comments that have been clarified, and made references to the authors comments where warranted.]
[I walked in at 2:00 as the 454 seminar was wrapping up, instead of 2:30, so I’ve been waiting a while for this talk… the suspense has been building!]
In 2000, vertebrate sequencing cost ~$2,000,000 per genome. 29 were done.
By 2010, BGI used SOAP denovo to do more genomes using illumina alone. Coverage of segmental duplications, etc were raised as issues.
Goal: “Convince you that it is possible”
evidence: 1. control genomes (human + mouse) 2. new vertebrate genomes.
How do we get there?
- ALLPATHS-LG (lg is for large genomes)
- new algorithms need to evolve with the new lab techniques
- approach: sequence and assemble blind, then compare to reference when available.
- Goal: each new genome should not be a research project.
Lab recipe: PET 200bp, 2000bp, 6,000bp and 40kbp (which required development of new methods. – poster by Louise Williams)
Algorithm philosophy:Discussion of small/large k values for assembly. (use large k for specificity, small k for sensitivity.) Goal with allpaths is to use all possible biological information – don’t lose reads that are relevant.
Challenge 1: Reads are inaccurate. Remove errors, but don’t remove SNPs. (algorithm discussed. used k to identify read stacks, look outside the k to try to correct poor quality non-matching positions.)
[I missed challenge 2… I thought that was all challenge 1.]
Challenge 3 :coverage is uneven;
Solutions: Use High Coverage! improve lab process, improve algorithms. (See Gnirke talkfor better methods for GC rich areas. [Unfortunately, I didn’t go to that session.])
[Again, I missed challenge 4, which I couldn’t differentiate from part of challenge 3.]
Challenge 5: You don’t know the exact answer for what the genome says.
Challenge 6: Computations are large: billions are reads, need all of your reads in memory at once for assembly.
Solution: Buy more ram. [Ok, really? that’s not a challenge, if throwing money at it solves the problem that easily, it’s just a funding issue – EDIT: See comments below.]
On to the experiment: (control)
Sequenced Ms.B6 mouse (finished genome available) and NA12878 cell line (sequenced by 1000 genomes). Three way comparison: Allpaths-lg, SOAPdenovo. (No other one is available. [really? It thought trans-abyss and velvet were also available and able to do this…. maybe not, I’m not an expert on assembly.]
Criteria: Continuity, Completeness, Accuracy.
Comparison summary: Capillary and allpaths do ok (capillary is slightly better) and soapdenovo is always trailing the pack.
[A whole LOT of slides that look like identical bar graphs – you’ll have to make do with the summary above.]
Doing de-novo assemblies for 15 genomes (7 fish, 8 mammals). [ooo coelacanth genome!]
Excluding genomes [did he mean scaffolds?] in which 40kb data sets are not good (good def’n by physical coverage 20x or greater). This removes a lot of the variability in the quality.
“Fish are hard”
Assembly discovers sequences you can’t get any other way.
Even if you do alignment based stuff, there is great value in assembly as a complimentary approach.
Allpaths-lg future: lots of improvement, error can be driven towards zero.
Yes, you can do whole genome assembly of vertabrates.
[I don’t think I learned much new in this talk – but I don’t really see much difference between the assemblers – EDIT:See comments below. My take home message is that ALLPATHS-LG can be used on larger data sets, but then again, they claimed that they need massive amounts of RAM… I’ll just suggest people check out the software and decide for themselves.]