>Bowtie and Single End Mapped Paired End Data

>Strange title for a posting, but it actually makes sense, if you think about it.

One of the researchers here has been working on an assembly of a reasonably small fungal genome, using short read technology with which I’ve been somewhat involved. It’s been an educational project for me, so I’m glad I had the opportunity to contribute. One of the key elements of the paper, however, was the use of Paired End Tags (PET) from an early Illumina machine to assess the quality of the assembly.

Unfortunately, an early run of the software I’d written to analyze the Maq alignments of the paired ends to the assemblies had a bug, which made it look like the data supported the theory quite nicely – but alas, it was just a bug. The bug was fixed a few weeks ago, and a re-run of the data turned up something interesting:

If you use Maq alignments, you’re biasing your alignments towards the smith-waterman aligned sequences, which are not an independent measure of the quality of the assembly. Not shocking? I know – it’s pretty obvious.

Unfortunately, we hadn’t put it in context before hand, so we had to find another way of doing this. We wanted to get a fairly exhaustive set of alignments for each short read – and we wanted to get it quickly, so we turned to Bowtie. While I wasn’t the one running the alignments, I have to admit, I’m impressed with the quick performance of the aligner. Multi-matches work well, the file format is intuitive – similar to an eland file, and the quality of the information seems to be good. (The lack of a good scoring mechanism is a problem, but wasn’t vital for this project.)

Anyhow, by performing a Single End Tag style alignment on PET data, while retaining multimatches, we were able to remove the biases of the aligners, and perform the pairings ourselves – learning more about the underlying assembly to which we were aligning. This isn’t something you’d want to do often, unless you’re assembling genomes fairly regularly, but it’s quite cool.

Alas, for the paper, I don’t think this data was good enough quality – there may not be a good story in the results that will replace the one we thought we had when the data was first run… but there were other good results in the data, and the journey was worth the trip, even if the destination wasn’t all we’d hoped it would be.

As a sidenote, my code was run on an otherwise unused 16-way box, and managed to get 1592% CPU usage, at one point, while taking up 46Gb of RAM. (I should have processed each lane of reads separately, but didn’t.) That’s the first time I’ve seen CPU usage that high on my code. My previous record was 1499%. I should note that I only observe these scalings during the input processing – so the overal speed up was only 4x.

The code is part of the Vancouver Short Read Analysis Package, called BowtieToBedFormat.java, if anyone needs a process similar to this for Bowtie Reads. (There’s no manual for it at the moment, but I can put one togther, upon request.)

>Maq Bug

>I came across an interesting bug today, when trying to work with reads aligned by Smith-Waterman (flag = 130), in PET alignments. Indeed, I even filed a Bug Report on it.

The low down on the “bug” or “feature” (I’m not sure which it is yet), is that sequences containing deletions don’t actually show the deletions – they show the straight genomic sequence at that location. The reason that I figure this is a bug instead of by design is because sequences with insertions show the insertion. So why the discrepancy?

Anyhow, the upshot of it is that I’m only able to use 1/2 of the Smith-Waterman alignments maq produces when doing SNP calls in my software. (I can’t trust that the per-base quality scores align to the correct bases with this bug) Since other people are doing SNP calls using MAQ alignments… what are they doing?

>Field Programmable Gate Arrays

>Yes, I’m procrastinating again. I have two papers, two big chunks of code and a thesis proposal to write, a paper to review (it’s been done but I have yet to type out my comments..), several major experiments to do and at least one poster looming on the horizon – not to mention squeezing in a couple of manuals for the Vancouver Package Software. And yet, I keep finding other stuff to work on, because it’s the weekend.

So, I figured this would be a good time to touch on a topic of Field Programmable Gate Arrays or FPGAs. I’ve done very little research on this topic, since it’s so far removed from my own core expertise, but it’s a hot topic in bioinformatics, so I’d be doing a big disservice by not touching on this subject at all. However, I hope people will correct me if they spot errors.

So what is an FPGA? I’d suggest you read the wikipedia article linked above, but I’d sum it up as a chip that can be added to a computer, which has the ability to optimize the way in which information is processed, so as to accellerate a given algorithm. It’s a pretty cool concept – move a particular part of an algorithm into the hardware itself to speed it up. Of course, there are disadvantages as well. Reprogramming is (was? – this may have changed) a few orders of magnitude slower than processing information, so you can’t change the programming on the fly while processing data and still hope to get a speed up. Some chips can change programming of unused sub-sections, while other algorithms are running… but now we’re getting really technical.

(For a very good technical discussion, I suggest this book, of which I’ve read a few useful paragraphs.)

Rather than discuss FPGAs, which are a cool subject on their own, I’d rather discuss their applications in Bioinformatics. As far as I know, they’re not widely used for most applications at the moment. The most processor intensive bioinformatics applications, Molecular Modeling and drug docking, are mainly vector-based calculationd, so vector chips (eg Graphics Processing Units – GPUs) are more applicable for them. As for the rest, CPUs have traditionally been “good enough”. However, recently the following two things seem to have accelerated this potential mariage of technology:

  1. The makers of FPGAs have been looking for applications for their products for years and have targeted bioinformatics because of it’s intense computer use. Heavy computer use is always considered to be a sign that more efficient processing speed is an industry need – and FPGAs appear to meet that need – on the surface.
  2. Bioinformatics was doing well with the available computers, but suddenly found itself behind the processing curve with the advent of Second Generation Sequencing (SGS). Suddenly, the amount of information being processed spiked by an order of magnitude (or more), causing the bioinformaticians to scream for more processing power and resources.

So, it was inevitable that FPGA producers would hear about the demand for more power in the field, and believe that it’s the ideal market into which they should pluge. To the casual observer, Bioinformatics needs more efficiency and power, and FPGA producers are looking for a martet where efficiency and power are needed! Is this a match made in heaven or what?

Actually, I contend that FPGAs are the wrong solution for several reasons.

While Second Generation Sequencing produces tons more data, the algorithms being employed haven’t yet settled down. Every 4 months we pick a different aligner. Every 3 months we add a new data base. Every month we produce a more efficient version of our algorithms for interpreting the data. Due to the overhead in producing an algorithm translation into hardware necessary to use the FPGA (which seems large to me, but may not be to people more fluent in HDL) would mean that you’d spend a disproportionate amount of time trying to get the chips set up to process your data – which you’re only going to use for a short period of time before moving on. And the gain of efficiency would probably be wiped out by the amount of effort introduced.

Furthermore, even when we do know the algorithms being used are going to stay around, a lot of our processing isn’t necessarily CPU bound – but rather is I/O or memory bound. When you’re trawling through 16Gb or memory, it’s not necessarily obvious that adding more speed to the CPU will help. Pre-fetching and pre-caching are probably doing more to help you out than anything else bound to your CPU.

In the age of multi-CPUs, using multi-threaded programs already reduces many of the pains that plague bioinformaticians. Most of my java code is thrilled to pull 2, 3, or more processors in to work faster – without a lotof explicit multi-treadding. (My record so far is 1496% cpu usage – nearly 15 processors.) I would expect that buying 16-way processors is probably more cost-efficient than buying 16 FPGAs in terms of processing data for many of the current algorithms in use.

Buying more conventional resources will probably alleviate the sudden bottle-neck in compute power, rather than innovating around new solutions to solve the need. It’s likely that many groups getting into the second generation genomics technologies failed to understand the processing demands of the data, and thus didn’t plan adequately for the resources. This means that much of the demand for data processing is just temporary, and may even be aleviated with more efficient algorithms in the future.

So where does the FPGA fit in?

I’d contend that there are very few products out there that would benefit from FPGAs in Bioinformatics… but there are a few. Clearly, all bioinformaticians know that aligning short reads is one of those areas. Considering that a full Maq run for a flow cell from an Illumina GAII takes 14+ hours on a small cluster, that would be one area in which they’d clearly benefit.

Of course, no bioinformatician wants to have to reprogram an FPGA on the fly to utilize their work. Were I to pick a model, it would probably be to team up with an aligner group, to produce a stand alone, multi-FPGA/CPU hybrid box with 32Gb of RAM, and a 3-4 year upgrade path. Every 4 months you produce a new aligner algorithm and HDL template, and users pick up the aligner and HDL upgrade, and “flash” their computer to use the new software/hardware. This would follow the Google Appliance model: an automated box that does one task, and does it well, with the exception that hardware “upgrades” come along with the software patches. That would certainly turn a few heads.

At any rate, only time will tell. If the algorithms settle down, FPGAs may become more useful. If the FPGAs become easier to program for bioinformaticians, they may find a willing audience. If the FPGAs begin to understand the constraints of the bioinformatics groups, they may find niche applications that will truly benefit from this technology. I look forward to seeing where this goes.

Ok… now that I’ve gone WAY out on a limb, I think it’s time to tackle a few of those tasks on my list.

>Indel Calling

>William left an interesting comment yesterday, which I figured was worthy of it’s own post, albeit a short one. (Are any of my posts really short, tho?) His point was that everyone in the genomics field is currently paying a lot of attention to SNPs, and very little attention to INDELs (Insertions and Deletions). And, I have to admit – this is true. I’m not aware of anyone really trying to do indels with Solexa reads, yet. But there are very good reasons for this.

The first is a lack of tools – most aligners aren’t doing gapped alignments. Well, there’s Exonerate and ZOOM, and possibly blast, but all of those have their problems. (Exonerate gapped mode is very slow, poor support, and we had a very difficult time making some versions work at all, while Blast is completely infeasible for short reads in a reasonable time scale and not guaranteed to give the right answer, while Zoom just hasn’t been released yet. If there are others, feel free to let me know.) Nothing currently available will really do a good gapped short read alignment. And, if you’ve noticed they key words: “short read”, you’re on to the real reason why no one is currently working with indels.

Yep – the reads are just too short. If you have a 36bp read, or even, say a 42bp read, you’re looking at (best case) having your indel right in the middle of the sequence, giving you two 21-base sequences, one on each side of the gap. Think about that for a moment and let that settle in. How much of the genome is unique for 21bp reads, which may have 2 or more SNPs or sequencing errors? I’d venture to say it’s 60% or so. With the 36 base pair read, you’re looking at two 18-bp reads, which is more like 40-50% of the genome. (Please don’t quote me on those numbers – they’re just estimates.) And that’s best case.

If your gap is closer to the end, you’ll get something more like a 32bp read and a 10bp read…. and I wouldn’t trust a 10bp seed to give the correct match against the genome no matter what aligner you’ve got – especially if it comes from the “poor” end of an Illumina sequence.

So that leaves you with two options: use a paired end tag, or use a longer read.

Paired end tags (PET) have been around for a couple months, now. We’re still trying to figure out the best way of using the technology, but it’s coming. People are mostly interested in using them for other applications – gross structural abnormalities, inversions, duplications, etc, but indels will be in there. It should be a few more months before we really see some good work done with PETs in the literature. I know of a couple of neat applications already, but a lot of the difficulty was just getting a good PET aligner going. Maq is there now, and it does an excellent job, albeit post processing the .map files for PET is not a lot of fun. (I do have software for it, tho, so it’s definitely a tractable problem.)

Longer reads are also good. Once you get gaps with enough bases on either side to do double-seed searches, we’ll get fast Indel capable aligners – and I’m sure it’s coming. There are long reads being attempted this week at the GSC. I don’t know anything about them, or the quality, but if they work, I’d expect to see a LOT more sequences being generated like this in the future, and a lot more attention being paid to indels.

So, I can only agree with William: we need to pay more attention to indels, but we need the technology to catch up first.

P.S. For 454 fans out there, yes, you do get longer reads, but I think you also need a lot of redundancy to show that the reads aren’t artifacts. As 454 ramps up its throughput, we’ll see both the Solexa and 454 platforms converge towards better data for indel studies.

>ZOOM-ing along.

>I’ve recently been having a public conversation with the people at Bioinformatics Solutions Inc about their latest project – a short read aligner called ZOOM. While I joked about them only accepting resumes in Microsoft Word format, (Update: Apparently that’s already taken off their web page.) they do seem to have their act together in other respects. Of course, I haven’t actually seen their application yet, and I have only their word to go by, however, what I am hearing seems promising. I suggest people go read that thread on SeqAnswers to get an idea of what they’re offering.

Unfortunately (- or maybe not – ) the timing is a bit off for me. I’m in the process of clearing things off my desk for my two week vacation, starting Friday, and I believe they’ll be releasing the demo version of their software next week sometime. Still, I suspect I won’t let that bother me too much. Between coconuts, hammocks and snorkeling, I don’t think I’ll have much time to think about missing a new aligner. (Just a hunch.)

Disclosure: I suppose, since people might perceive this as a conflict of interest, I should point out that I do have several ties to the company. I believe Ming Li is one of their founders, and I did (briefly) work with him and Paul Kearney for a thesis project at the University of Waterloo close to a decade ago. A relation of mine works there, as well, though I don’t have any insider information, so don’t ask. I don’t own stock or have any interest in the company otherwise. Bioinformatics is just too small of a community in Canada not to have some connections to anyone else doing bioinformatics here. Six degrees of separation, and all.

>SNP calling from MAQ

>With that title, you’re probably expecting a discussion on how MAQ calls snps, but you’re not going to get it. Instead, I’m going to rant a bit, but bear with me.

Rather than just use the MAQ snp caller, I decided to write my own. Why, you might ask? Because I already had all of the code for it, my snp caller has several added functionalities that I wanted to use, and *of course*, I thought it would be easy. Was it, you might also ask? No – but not for the reasons you might expect.

I spent the last 4 days doing nothing but working on this. I thought it would be simple to just tie the elements together: I have a working .map file parser (don’t get me started on platform dependent binary files!), I have a working snp caller, I even have all the code to link them together. What I was missing was all of the little tricks, particularly the ones for intron-spanning reads in transcriptome data sets, and the code that links together the “kludges” with the method I didn’t know about when I started. After hacking away at it, bit by bit things began to work. Somewhere north of 150 code commits later, it all came together.

If you’re wondering why it took so long, it’s three fold:

1. I started off depending on someone else’s method, since they came up with it. As is often the case, that person was working quickly to get results, and I don’t think they had the goal of writing production quality code. Since I didn’t have their code (though, honestly, I didn’t ask for it either since it was in perl, which is another rant for another day) it took a long time to settle all of the 1-off, 2-off and otherwise unexpected bugs. They had given me all of the clues, but there’s a world of difference between being pointed in the general direction towards your goal and having a GPS to navigate you there.

2. I was trying to write code that would be re-usable. That’s something I’m very proud of, as most of my code is modular and easy to re-purpose in my next project. Half way through this, I gave up: the code for this snp calling is not going to be re-usable. Though, truth be told, I think I’ll have to redo the whole experiment from the start at some point because I’m not fully satisfied with the method, and we won’t be doing it exactly this way in the future. I just hope the change doesn’t happen in the next 3 weeks.

3. Name space errors. For some reason, every single person has a different way of addressing the 24-ish chromosomes in the human genome. (Should we include the mitochondrial genome in our own?) I find myself building functions that strip and rename chromosomes all the time, using similar rules. Is the Mitochondrial genome a “MT” or just “M”? What case do we use for “X” and “Y” (or is it “x” and “y”?) in our files? Should we pre-pend “chr” to our chromsome names? And what on earth is “chr5_random” doing as a chromosome? This is even worse when you need to compare two active indexes, plus the strings in each read… bleh.

Anyhow, I fully admit that SNP calling isn’t hard to do. Once you’ve read all of your sequences in, determined which bases are worth keeping (prb scores), determined the minimum level of coverage, minimum number of bases that are needed to call a snp, there’s not much left to do. I check it all against the Ensembl database to determine which ones are non-synonymous, and then: tada, you have all your snps.

However, once you’re done all of this, you realize that the big issue is that there are now too many snp callers, and everyone and their pet dog is working on one. There are several now in use at the GSC: Mine, at least one custom one that I’m aware of, one built into an aligner (Bad Idea(tm)) under development here and the one tacked on to the swiss army knife of aligners and tools: MAQ. Do they all give different results, or is one better than another? who knows. I look forward to finding someone who has the time to compare, but I really doubt there’s much difference beyond the alignment quality.

Unfortunately, because the aligner field is still immature, there is no single file output format that’s common to all aligners, so the comparison is a pain to do – which means it’s probably a long way off. That, in itself, might be a good topic for an article, one day.

>MAQ mapview format. – Updated

>Well, I promised a quick update on the MAQ mapview format, after I wrote the interpreter for it, but there isn’t much to say.

Key bits of information:

  • The file is simply a zipped text file. (Update: this file is not normally gzipped. While the .map file is zipped, the .mapview file is not normally found in the gzipped state.) You can unzip it with ‘gunzip’ on a linux system.
  • Reads are pre-sorted by chromosome, then position.
  • The format is 1 based, so if you’re using a zero based format, you’ll need to convert.
  • Starting points are for the “left end”, ie, regardless of which strand the sequence aligned to, the matching position with the lowest position is reported.
  • Sequences are not contained in this file, but if you go back to the original fastq file you can retrieve the sequence. If you do so, you will need to obtain the reverse compliment of any read that maps to the reverse strand to map to your fasta file sequence. Forward strand sequences will map correctly to the fasta sequence.
  • Most of the fields are not useful for any form of analysis, and what’s given is mostly incomprehensible.

The best information I had was from the MAQ manpage. In a slightly more readable format:

  1. read name
  2. chromosome
  3. position
  4. strand
  5. insert size from the outer coorniates of a pair
  6. paired flag
  7. mapping quality
  8. single-end mapping quality
  9. alternative mapping quality
  10. number of mismatches of the best hit
  11. sum of qualities of mismatched bases of the best hit
  12. number of 0-mismatch hits of the first 24bp
  13. number of 1-mismatch hits of the first 24bp on the reference
  14. length of the read
  15. read sequence
  16. quality

Most strikingly, you’ll notice 16 fields are listed above, while the file appears to have 14 fields. It seems not all files have the last two fields. I don’t know if it’s just the file I have, or if it’s usually that way. (Update: Actually, there are normally 16 fields. The files I was given were generated using the mapview -B flag, which strips out some of the information, which I believe are the final two fields. Thus, the comments above reflect the -B flag output only! Thanks to Ryan for catching that!)

If I come across anything else that needs to be added, I’ll update this entry.

>Downloads and aligners

>At the Genome Science Centre, we have a platform Scientific Advisory Board meeting coming up in the near future, so a lot of people are taking the time to put together some information on what’s being going on since the last meeting. As part of that, I was asked to provide some information on the FindPeaks application, since it’s been relatively successful. One of those things was how many times it’s been downloaded.

Surprisingly, it’s not an insigificant number: FindPeaks 2.x has been downloaded 280 times, while FindPeaks 3.1.x has been downloaded 325 times. I was amazed. Admittedly, I filtered out google and anything calling itself “spider”, and didn’t filter on unique IPs, but it’s still more than I expected.

The other surprise to me was the institutions from which it was downloaded, which are scattered across the world. Very cool to see it’s not just Vancouver people who took an interest in FindPeaks, though I somewhat suspected as much, already.

Thinking of FindPeaks, I put in one last marathon cleaning session for all my software. I’m sure I’m somewhere north of 15,000 lines of code modified in the past week. Even when you consider that some of those changes were massive search-and-replace jobs, (very few, really), or refactoring and renaming variables (many many more), it’s still an impressive number for two weeks. With that done, I’m looking to see some significant gains in development speed, and in developer productivity. (I learned a LOT doing this.)

The last 122 warnings will just have to get cleaned up when I get around to it – although they’re really only 4 warnings types repeated so many times. (The majority of them are just that the branched logic goes deeper than 5 levels, or that my objects have public variables. (You can only write so many accessor functions in a week.)

Anyhow, I’m looking forward to testing out FindPeaks 4.0 alphas starting tomorrow, and to put some documents together on it. (And catch up on the flood of emails I received in the last 2 days.

Finally, I’m working on a MAQ file interpreter for another project I’m working on. If I ever manage to figure out how to interpret the (nearly undocumented) files, I’ll post that here. If anyone’s done this already (though I didn’t see anything publicly available on the web), I’d love to hear from them.


>Eland file Format

>I haven’t written much over the past couple of days. I have a few things piled up that all need doing urgently… and it never fails, that’s when you get sick. I spent today in bed, fighting off a cold, sore throat and fever. Wonderful combination.

Anyhow, since I’m starting to feel better, I thought I’d write a few lines before going to bed, and wanted to mention that I’ve finally seen a file produced by the new Eland. It’s a little different, and the documentation provided (ahem, that I was able to obtain from a colleague who uses the pipeline) was pretty scarce.

In fact, much of what you see in the file is pretty obvious, with the same general concept as the previous Eland files, except this has a few caveats:

1) the library name and 4-coordinate position of the sequence are all separated by tabs in one of the files I saw, but concatenated with a separating “:” in another. I’m not sure which is the real format, but there are at least 2 formats for line identification.

2) There’s a string that seems to encode the base quality scores from the prb files, but it’s in a format for which I can’t find any information.

3) there’s a new format for mismatches within the alignment. Instead of telling you the location of the mismatch, you now get a summary of the alignment itself. If Eland could do insertions, it would work well for those too. From the document, it tells you the number of aligned bases, with letters interspersed to show the mismatch. (e.g. if you had a 32 base alignment, with a mismatch A at character 10, you’d get the string “9A22”.) I also understand that upper and lower case mismatches mean something different, though I haven’t probed the format too much.

So, in the discussion of formats, I understand there’s some community effort around using a so-called “short read format” or SRF format. It’s been adopted by Helicos, GEO, as well as several other groups.

Maybe it’s time I start converting Eland formats to this as well. Wouldn’t it be nice if we only had to work with one format? (If only Microsoft understood that too! – ps, don’t let the community name fool you, it’s well known Microsoft sponsored that site.)

>Catching up….

>I can’t believe it’s been nearly a month since my last post! I feel like I’ve been neglecting this a bit more than I should, but I’ll try to rectify that as best I can.

For an indication of how busy I’ve been, I sat down to update my resume yesterday, and ended up adding 3 papers (all in submission) and two posters. That just about doubles what was in there previously in the papers section.

Anyhow, Next-generation sequencing doesn’t stand still, so I thought I’d outline some of the things I want to talk about in my next posts, and set up a few pointers to other resources:

1. SeqAnswers. This aptly named forum has been around for a few months now, but has recently become more popular, and a great forum for discussing relevant Next-gen technology and sequencing methods. I’m especially happy to see the automated posts triggered by new literature on the subject, which are a great resource for those of us who are busy and forget to check for new publications ourselves.

2. There’s one forum in particular that’s of great interest: Benchmarking different aligners. This appears to be a well done comparison (if lightweight) that may be a good focus for other people who are interested in comparing aligners, and discussing it in a wider forum.

3. For people interested in ChIP-Seq, or Chromatin immunoprecipitation and massively parallel sequencing, I’ve finally gotten around to posting FindPeaks 3.1 on the web. I’d consider this release (3.1.3) an alpha release. I’d love to get more people contributing by using this application and telling me what could be improved on it, or what enhancements they’d like to see. I’m always happy to discuss new features, and can probably add most of them in with a relatively quick turn around time.

4. For people interested in assessing the quality of the whole transcriptome shotgun sequencing (WTSS), I’m about to break out a tool that should fit that purpose. If anyone is interested in giving suggestions on ways they’d like to see quality tests performed, I’d be more than happy to code those into my applications. (And yes, if you contribute to the tool, I will provide you a copy of the tool to use. Collaborations, etc, can be discussed, as well.)

5. A quick question, of which I’ll post more in the future. Has anyone here managed to get Exonerate 2.0 to work in client/server mode on two separate machines?

6. I’ll also post a little more about this in the future: building environments, ant and java. Why are people still doing large projects in perl?

7. One last thing I wanted to mention. I was going to write more on this topic, but eh… I’ll let slashdot do it for me: The more you drink, the less you publish. Well, So much for keeping a bottle of tequila under the desk. Now I know what to get the competition for x-mas, though…