>Taking control of your documents

>It’s always a mystery to me how bioinformaticians, who are generally steeped in computer culture, can be Microsoft users. Not that Microsoft’s software is necessarily bad, (although I maintain that it doesn’t come with all of the tools built in that bioinformaticians need, depending on what form of bioinformatics you’re doing), but for those who have been immersed in the high tech environment, Microsoft’s well documented business practices and bad-neighbor behaviour seem to be somewhat unenlightened. That led me to leave the MS ecosystem in search of more friendly environments nearly a decade ago.

Ever since then, I’ve been trying to move people away from Microsoft products and towards either the truly open Linux ecosystem, or the proprietary (but less open) Apple Macintosh ecosystem. (I run 3 linux machines and a mac laptop at home.) As part of that move – and probably the most important one, I always suggest people take control of their documents and not hand them over to Microsoft’s trust.

One of the great proponents of this is Rob Weir, who has a vested interest in the process, but is able to provide a fantasticly objective perspective on the subject, in my opinion. (Microsoft employees frequently disagree.)

Anyhow, I just thought it was worth linking to a particular article of his, on that subject. Even if you don’t want to move away from your Microsoft supplied word processor, he gives advice on how to keep your documents as open as possible. I highly recommend you give this article a quick read – and maybe take some of Mr. Weir’s advice.


>TomTom has no Linux support?

>I’m still procrastinating – A plumber is supposed to show up to cut a hole in my ceiling in a few minutes, basically as exploratory surgery on my new house, in order to find a leak that’s developed in the pipes leading away from the washer and dryer. So, I thought I’d spend the intervening moments doing something utterly useless. I looked up TomTom’s web site and took a look at what they have to offer.

If you don’t know TomTom, they’re a company that produces GPS units for personal and car use. They’ve recently shot to fame because Microsoft decided to sue them for a bunch of really pointless patents. The most interesting ones of the bunch are the ones that Microsoft seems to think are being infringed just because TomTom is using Linux.

Anyhow, this post wasn’t going to be about the patents, since I already gave my opinion of that. Instead, since I’d been thinking about buying a GPS unit for a while, I thought it might be worth buying one from someone who uses embedded Linux – and I’d like to support TomTom in their fight against the Redmond Monopoly. Unfortunately – and this is the part that boggles my mind – TomTom offers absolutely zero support for people who run Linux as their computer operating system. Like many other companies, they’re a Windows/Mac support only shop.

This strikes me as rather silly – all of the open source users out there would probably be interested in buying an open source GPS, and would probably be happy to support TomTom in their fight… but they’ve completely neglected that market. They’ve generated a great swelling of goodwill in many communities by standing up to Microsoft’s bullying, but then completely shut that same market segment out of purchasing their products.

Well, that’s some brilliant strategy right there. I only hope TomTom changes their mind at some point – since otherwise all that goodwill is just going right down the toilet…

And thinking of plumbing, again, it’s time to go see about a hole in my ceiling.

>Searching for SNPs… a disaster waiting to happen.

>Well, I’m postponing my planned article, because I just don’t feel in the mood to work on that tonight. Instead, I figured I’d touch on something a little more important to me this evening: WTSS SNP calls. Well, as my committee members would say, they’re not SNPs, they’re variations or putative mutations. Technically, that makes them Single Nucleotide Variations, or SNVs. (They’re only polymorphisms if they’re common to a portion of the population.

In this case, they’re from cancer cell lines, so after I filter out all the real SNPs, what’s left are SNVs… and they’re bloody annoying. This is the second major project I’ve done where SNP calling has played a central role. The first was based on very early 454 data, where homopolymers were frequent, and thus finding SNVs was pretty easy: they were all over the place! After much work, it turned out that pretty much all of them were fake (false positives), and I learned to check for homopolymer runs – a simple trick, easily accomplished by visualizing the data.

We moved onto Illumina, after that. Actually, it was still Solexa at the time. Yes, this is older data – nearly a year old. It wasn’t particularly reliable, and I’ve now used several different aligners, references and otherwise, each time (I thought) improving the data. We came down to a couple very intriguing variations, and decided to sequence them. After several rounds of primer design, we finally got one that worked… and lo and behold. 0/2. Neither of them are real. So, now comes the post-mortem: Why did we get the false positives this time? Is it bias from the platform? Bad alignments? Or something even more suspicious… do we have evidence of edited RNA? Who knows. The game begins all over again, in the quest for answering the question “why?” Why do we get unexpected results?

Fortunately, I’m a scientist, so that question is really something I like. I don’t begrudge the last year’s worth of work – which apparently is now more or less down the toilet – but I hope that the why leads to something more interesting this time. (Thank goodness I have other projects on the go, as well!)

Ah, science. Good thing I’m hooked, otherwise I’d have tossed in the towel long ago.

>Decision time

>Well, now that I’ve heard that there’s a distinct possibility that I might be done my PhD in about a year, it’s time to start making some decisions. Frankly, I didn’t think I’d be done that quickly – although, really, I’m not done yet. I have a lot of publications to put together, and things to make sense of before I leave, but the clock to start figuring out what to do next has officially begun.

I suppose all of those post-doc blogs I’ve been reading for the last year have influenced me somewhat: I’m going to look for a lab where I’ll find a good mentor, a good environment, and a commitment to publishing and completing post-docs relatively quickly. Although that sounds simple, judging by other blogs I’ve been reading, it’s probably not all that easy to work out. Add to that the fact that my significant other isn’t interested in leaving Vancouver (and that I would prefer to stay here as well), and I think this will be a difficult process.

I do need to put together a timeline, however – and since I’m not yet entirely convinced which track I should follow (academic vs industry), it’s going to be a somewhat complex timeline. Anyhow, the point of blogging this it is an excellent way to open communication channels with people who you wouldn’t be able to connect with in person – and the first one I’d like to open is to ask readers if they have any suggestions.

Input, at this time would be VERY welcome, both on the point of academia vs. industry, as well as what I should be looking for in a good post-doc position, if that ends up being the path I go down. (=

Anyhow, just to mention, I have another blog post coming, but I’ll save it for tomorrow. I’d like to comment on another series of blog post from John Hawks and Daniel McArthur. I’m sure the whole blogosphere has heard all about the subject of training bioinformatics students from both the biology and computer science paths by now, but I feel I have something unique to talk about on that issue. In the meantime, I’d better get back to debugging and testing code. FindPeaks has a very cool new method of comparing different samples – and I’d like to get the testing finished. (=

>Universal format converter for aligned reads

>Last night, I was working on FindPeaks when I realized what an interesting treasure trove of libraries I was really sitting on. I have readers and writers for many of the most common aligned read formats, and I have several programs that do useful functions. So, that raise the distinctly interesting point that all of them should be applied together in one shot… and so I did exactly that.

I now have an interesting set of utilities that can be used to convert from one file format to another: bed, gff, eland, extended eland, MAQ .map (read only), mapview, bowtie…. and several other more obscure formats.

For the moment, the “conversion utility” forces the output to bed file format (since that’s the file type with the least information, and I don’t have to worry about unexpected file information loss), which can then be viewed with the UCSC browser, or interpreted by FindPeaks to generate wig files. (BED files are really the lowest common denominator of aligned information.) But why stop there?

Why not add a very simple functionality that lets one format be converted to the other? Actually, there’s no good reason not to, but it does involve some heavy caveats. Conversion from one format type to another is relatively trivial until you hit the quality strings. since these aren’t being scaled or altered, you could end up with some rather bizzare conversions unless they’re handled cleanly. Unfortunately, doing this scaling is such a moving target that it’s just not possible to keep up with that and do all the other devlopment work I have on my plate. (I think I’ll be asking for a co-op student for the summer to help out.)

Anyhow, I’ll be including this nifty utility in my new tags. Hopefully people will find the upgraded conversion utility to be helpful to them. (=

>Findpeaks 3.3… continued

>Patch, compile, read bug, search code, compile, remember to patch, compile, test, find bug, realized it’s the wrong bug, test, compile, test….

Although I really enjoy working on my apps, sometimes a whole day goes by where tons of changes are made, and really I don’t feel like I’ve gotten much done. I suppose it’s more of the scale of things left to do, rather than the number of tasks. I’ve managed to solve a few mysteries and make an impact for some people using the software, but haven’t got around to testing the big changes I’ve been working on for a few days on using different compare mechanisms for FindPeaks.

(One might then ask why I’m blogging instead of doing that testing… and that would be a very good question.)

Some quick ChIP-Seq things on my mind:

  • Samtools: there is a very complete Java/Samtools/Bamtools API that I could be integrating, but after staring at it for a while, I’ve realized that the complete lack of documentation on how to integrate it is really slowing the effort down. I will proably return to it next week.
  • Compare and Control: It seems people are switching to this paradigm on several other projects – I just need to get the new compare mechanism in, and then integrate it in with the control at the same time. That will provide a really nice method for doing both at once, which is really key for moving forward.
  • Eland “extended” format: I ended up reworking all of the Eland Export file functions today. All of the original files I worked with were pre-sorted and pre-formatted. Unfortunately, that’s not how they exist in the real world. I now have updated the sort and separate chromosome functions for eland ext. I haven’t done much testing on them, unfortunately, but that’s coming up too.
  • Documentation: I’m so far behind – writing one small piece of manual a day seems like a good target – I’ll try to hold myself to it. I might catch up by the end of the month, at that pace.

Anyhow, lots of really fun things coming up in this version of FindPeaks… I just have to keep plugging away.

>CSCBC 2009

>Someone raised the good point that I had forgotten to mention the origin of the talks I had made notes on last week, which is a very important point for several reasons. Although the conference is over, it was a neat little conference which deserves a little publicity. Additionally, it’s now in planning for it’s fifth year, so it’s worth mentioning just in case people are interested but weren’t aware of it.

The full title of the conference is the Canadian Student Conference on Biomedical Computing, although I believe the next year’s title will also be expanded to include Biomedical Computing and Engineering explicitly. (CSCBCE 2010) This year’s program can be found at http://www.cscbc2009.org/, and my notes for it can all be found under the tag of the same name.

As for why I think it was a neat conference, I suppose I have several reasons. It doesn’t hurt that one of the organizers sits in the cubicle next to mine at the office, and that many of this years organizers are friends through the bioinformatics program at UBC/SFU. But just as important (to me, anyhow), I was invited to be an industry panelist for the conference for the saturday morning session and to help judge the bioinformatics poster session. Both of those were a lot of fun. (Oddly enough, another memver of the industry panel was one of my committee members, and he suggested I would probably graduate in the coming year in front of a room of witnesses…)

Anyhow, back to the point, CSCBCE 2010 is now officially in the planning, and the torch has formally been passed along to the new organizers. I understand next year’s conference is going to be held in May 2010 at my alma matter, the University of Waterloo, which is a beautiful campus in the spring. (I strongly concur with their decision to host it in May instead of March, by the way. Waterloo is typically a rainy, grey and bleak place in March.) And, for those of you who have never been, Waterloo now has it’s own airport. I’m not sure if I’ll be going next year – especially if I’ve completed my degree by then, but if this year’s attendance was any indication of where the conference is heading, it’ll probably be worth checking out.

>xorg.conf file for vostro 1000 using compiz in Ubuntu 9.04

>I’m sure most people aren’t interested in this, but I finally got my laptop (a Dell Vostro 1000) to work nicely with Compiz under Ubuntu 9.04 (Jaunty). I think the key steps were removing every fglrx package on the computer (apt-get remove fglrx*), switching to the “ati” driver in the xorg.conf, and getting the BusID right (I tried copying it from my earlier xorg.conf file, but the value seems to have changed.) However, I added a lot of other things along the way, which sees to have helped the performance, so, for those who are interested, this is the Ubuntu 9.04, jaunty alpha 5 xorg.conf file for the vostro 1000:

Section "Device"
Identifier "Configured Video Device"
Driver "ati"
BusID "PCI:1:5:0"
Option "DRI" "true"
Option "ColorTiling" "on"
Option "EnablePageFlip" "true"
Option "AccelMethod" "EXA"
Option "RenderAccel" "true"


Section "Monitor"
Identifier "Configured Monitor"

Section "Screen"
Identifier "Default Screen"
Monitor "Configured Monitor"
Device "Configured Video Device"
Defaultdepth 24
Option "AddARGBGLXVisuals" "True"
SubSection "Display"
Modes "1280x800"

Section "Module"
Load "glx"
Load "dri"

Section "DRI"
Group "video"
Mode 0660

Section "ServerFlags"
Option "DontZap" "false"

Section "Extensions"
Option "Composite" "Enable"

>Dr. Michael Hallett, McGill University – Towards as systems approach to understanding the tumour microenvironment in breast cancer

>Most of this talk is from 2-3 years ago. Breast cancer is now more deadly for women than lung cancer. Lifetime risk for women is 1 in 9 women. Two most significant risk factors: being a woman, aging.

Treatment protocols include surgery, irradiation, hormonal therapy, chemotherapy, directed antibody therapy. Several clinical and molecular markers are now available to decide the treatment course. These also predict recurrence/survival well… but…

Many caveats: only 50% of Her2+ tumours respond to trastuzumab (Herceptin). No regime for (Her2-, ER-, PR-) “tripple negative” patients other than chemo/radiation. Many ER+ patients do not benefit from tamoxifen. 25% of lymph node negative patients (a less aggressive cancer) will develop micrometastatic disease and possibly recurrence (an example of under-treatment.) – Many other examples of undertreatment.

Microarray data caused a whole new perspective on breast cancer treatment. Created a taxonomy of breast cancer – Breast cancer is at least 5 different diseases. (Luminal Subtype A, Subtype B, ERBB2+, Basal Subtype, Normal Beast-like. Left to right, better prognosis to worst prognosis.)

[background into cellular origin of each type of cell. Classification, too.]

There are now gene expression biomarker panels for breast cancer. Most of them do very well in clinical trials. Point made that we almost never find biomarkers that are single gene. Most of the time you need to look at many many genes to figure out what’s going on. (“Good sign for bioinformatics”)

Microenvironment: Samples used on arrays, as above, include environment when run on arrays. We end up looking at averaging over the tumour. (Contribution of microenvironment is lost.) Epithelial gene expression signature “swamping out” signatures from other cell types. However, tumour cells interact successfully with it’s surrounding tissues.

Most therapies target epithelial cells. Genetic instability in epi cells lead to therapeutic resistance. Stromal cells (endothelial cells in particular) are genetically stable (eg, non-cancer.)

Therefore, If you target the stable microenvironment cells, it won’t become resistant.

Method: using invasive tumours, patient selection, laser capture microdiseaction, RNA isolation and amplification (Two rounds) -> microarray.

BIAS bioinformatics integrative application software. (Tool they’ve built)

LCM + Linear T7 amplification leads to 3′ Bias. Nearly 48% of probes are “bad”. Very hard to pick out the quality data.

Looking at just the tumour epitheila profiles (tumours themselves), confirmed that subtypes cluster as before. (Not new data. The breast cancer profiles we already have are basically epithelial driven.) When you look just at the stroma (the microenvironment), you find 6 different categories, and each one of them have distinct traits, which are not the same. There is almost no agreement between endothelial and epithelial cell categorization.. they are orthogonal.

Use both of these categorizations to predict even more accurate outcomes. Stroma are better at predicting outcome than the tumour type itself.

Found a “bad outcome cluster”, and then investigated each of the 163 genes that were differentially expressid between cluster and rest. Can use it to create a predictor. The subtypes are more difficult to work with, and become confounding effects. Used genes ordered by p-value from logistic regression. Apply to simple naive bayes’ classifier and cross validation using subsets. Identified 26 (of 163) as optimal classifier set.

“If you can’t explain it to a clinician, it won’t work.”

Stroma classifier is stroma specific.. It didn’t work on epithelial cells. But shows as well or better than other predictors (New, valuable information that wasn’t previously available.)

Cross validation of stromal targets against other data sets: worked on 8 datasets which were on bulk tumour. It was surprising that it worked that way, even though bulk tumour is usually just bulk tumour. You can also replicate this with blood vessels from a tumour.

Returning back to biology, you find the genes represent: angiogensis, hypoxic areas, immunosuppression.

[Skipping a few slides that say “on the verge of submission.”] Point: Linear Orderings are more informative than clustering! Things are not binary – it’s a real continuum with transitions between classic clusters. (Crosstalk between activated pathways?)

In a survey (2007, Breast Cancer Research 9-R61?), almost all things that breast cancer clinicians would like research done on is bioinformatic driven classification/organization,etc.


  • define all relevant breast cancer signatures
  • analysis of signatures
  • focus on transcriptional signatures
  • improve quality of signatures
  • aims for better statistics/computation with signatures.

There are too many papers coming out with new signature. Understanding breast cancer data in the litterature involves a lot of grouping and teasing out information – and avoiding noise. Signatures are heavily dependent on tissues type, etc etc.

Traditional pathway analysis: Always need experiment and control and require rankings. If that’s just two patients, that’s fine, if it’s a broad panel of patients, you won’t know what’s going on- you’re now in an unsupervised setting.

There are more than 8000 patients who have had array data collected. Even outcome is difficult to interpret.

Instead, using “BreSAT” to do linear ranking instead of clustering, and try to tease out signatures.

There is an activity of a signature – clinicians have always been ordering patients, so that’s what they want.

What is the optimal ordering that matches with the ordering….[sorry missed that.] Many trends show up when you do this than with hierarchical clustering. (Wnt, Hypoxia) You can even order two things: (eg. BRCA and Interferon), you can see tremendously strong signals. Start to see dependencies between signatures.

Working on several major technologies (chip-chip, microarray, smallRNA) and more precise view of microenvironment.