Steven Salzerg – Center for bioinformatics and computational biology.

[Told we should blog everything, so I’ll take that at face value.  I did miss copying the title of the talk, but it’s spanning his group’s three applications.]

A tale of three programs:

  • Bowtie (published 2 years ago)
  • Tophat (spliced alignment) (published 2 years ago)
  • Cufflinks (nature biotech – less than a year ago)

By the way, these are all open source and free.  [YAY!!!]

First challenge: do alignment, and do it fast.  3Billion base pair ref, and millions of reads – if you don’t do it fast, you’ll wait a LONG time.  Goal: get the best alignment.

Subproblem: if you want ALL alignments, it’s not necessarily the same problem.

Mapping requires an “index”.  NGS needed better ways because of time constraints.  Suffix trees/hashtables/etc are all WAY too slow and are the same size of the genome(12Gb -35Gb). Thus, needed an index that required 2Gb or less memory.

Solution: Burrows-Wheeler.Takes 1/3 to 1/2 byte per base. Thus human genome is 1.1Gbyte.  You also get better speed.  Time to search is linear, proportional to length of read. [Very cool.  Loved it when I saw it the first time 2 years ago.]

Bowtie: can use this index to behave like Maq, SOAP, BWA, and can call snps using Maq interface.

Explanation of Burrows wheeler:  Must start by creating a burrows wheeler matrix.  Generate all rotational permutations of original text. (Move first letter to end, ad nauseum).  Then sort alphabetically.  The transform is simply the last column – which is simply a re-ordering of the original string.

Original algorithm was for compression. [I did not know that.]

Bowtie: hugely faster than it’s competitor – older table of speeds, outdated, but still convincing: 20seconds for Bowtie versus 91 hours for SOAP… of course, this is outdated.  (it was about a ~38x speedup over Maq.)

Maq has now been replaced by BWA, which also uses Burrows wheeler transform.  SOAP2 also uses Burrows wheeler, as well.

Bowtie now handles: longer reads, paired reads, and now Solid reads.  (Natively in colourspace.)

Current comparison w new dataset:

  • Bowtie: 1805seconds
  • SOAP2: 2969 seconds
  • BWA: 11,659 seconds

[Wow, that’s still a big disparity].  Percent aligned, however isn’t so good for SOAP2 (81.6%)and BWA (89.5) versus Bowtie (93.1%).

Do the programs map the same reads?  Mostly yes – ~7% of  reads are mapped by only one program… and they don’t always agree. [I’ve seen this comparing other aligners, but neat to see here as well.]

Bowtie2 is coming soon – fully functional handling of indels, any number of indels permitted, any read length permitted, expect it in the spring… soon!  [No date promised tho.]

http://bowtie.cbcb.umd.edu

44,000 downloads of the paper, 41,000 downloads of program.

Ok – other programs!

Challenge #2: Spliced alignment

Designed for RNA-Seq.  Do alignements first with bowtie.  Assemble contigs (exons) from reads.  Then, use tophat to find putative splice sites, no more than one read length away.  Set how long you want your longest intron – Concatenate all possible mates, then use bowtie to assemble a matrix on the fly and do the remaining alignments.

Now uses colourspace natively.

A new “flavour” called “Tophat-fusion”, which finds fusion genes.  [Discussion of some of the findings – which are cool, and include a known case.]  Does this entirely ab initio.

Challenge #3: Assembling the transcriptome

Cufflinks: for isoform assembly,

  • assemble as many reads into as few (highly probable transcripts as possible, ab initio.
  • Quantitate all those transcripts.

How many transcripts are there?  ~90% of human genes have isoforms.  We’ve been ignoring this too long.

Cufflinks helps you disambiguate phasing of exons – you can guess this from coverage, but it’s pretty poor. [Yes, I’ve tried it – fail.]  Instead, you can use cufflinks, and it does much better because you can trace the spans.

Example case, using 430M reads –  found 13.5k known isoforms, 14,5k novel isoforms, 4k of which were detected at multiple time points.  Each time this is redone, you find new isoforms for human and for mouse.

Also works for different transcription initiation. [showing some neat data here – can’t draw on my blog… too bad.]

[Ok, as always, score one for the open source software – this was pretty cool, i may have to check it out. ]

BTW, Open invitation for anyone who wants to join the project! [awesome… I <3 this lab.]

4 thoughts on “Steven Salzerg – Center for bioinformatics and computational biology.

  1. Pingback: Tweets that mention Steven Salzerg – Center for bioinformatics and computational biology. | Fejes.ca -- Topsy.com

  2. Bowtie is impressive speed-wise – great for RNA-seq which is probably what he stuck to in his talk for the 3 applications. But what about non-RNA applications suchs as exome resequencing? Bowtie is clearly not suited.

    For mutation finding, which requires accurate alignment of SNVs and indels, BFAST is the best. I hope Bowtie2 is compared directly with BFAST…

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.