>Wolfram Alpha recreates ensembl?

>Ok, this might not be my most coherent post – I’m finally getting better from being sick for a whole week, which has left my brain felling somewhat… spongy. Several of us the AGBT-ers have come down with something after getting back, and I have a theory that it was something in the food we were given…. Maybe someone slipped something into the food to slow down research at the GSC??? (-; [insert conspiracy theory here.]

Anyhow, I just received a link [aka spam] from Wolfram Alpha, via a posting on linked in, letting me know all about their great new product: Wolfram Alpha now has genome information!

Somehow, looking at their quick demo, I’m somewhat less than impressed. Here’s the link, if you’d like to check it out yourself: Wolfram Alpha Blog Post (Genome)

I’m unimpressed for two reasons: the first is that there are TONS of other resources that do this – and apparently do it better, from the little I’ve seen on the blog. For the moment, they have 11 genomes in there, which they hope to expand in the future. I’m going to have to look more closely, if I find the motivation, as I might be missing something, but I really don’t see much that I can’t do in the UCSC genome browser or the Ensembl web page. The second thing is that I’m still unimpressed by Wolfram Alpha’s insistence that it’s more than just a search engine, and that if you use it to answer a question, you need to cite it.

I’m all in favour of using really cool algorithms and searches are no exception. [I don’t think I’ve mentioned this to anyone yet, but if you get a chance check out Unlimited Detail‘s use of search engine optimization to do unbelievable 3D graphics in real time.] However, if you’re going to send links boasting about what you can do with your technology, do something other people can’t do – and be clear what it is. From what I can tell, this is just a mash-up meta analysis of a few small publicly available resources. It’s not like we don’t have other engines that do the same thing, so I’m wondering what it is that they think they do that makes it worth going there for… anyone?

Worst of all, I’m not sure where they get their information from… where do they get their SNP calls from? How can you trust that, when you can’t even trust dbSNP?

Anyhow, for the moment, I’ll keep using resources that I can cite specifically, instead of just citing Wolfram Alpha… I don’t know how reviewers would take it if I cured cancer… and cited Wolfram as my source.

Happy searching, people!

>How to be a better Programmer: Tactics.

>I’m a bit too busy for a long post, but a link was circulating around the office that I thought was worth passing on to any bioinformaticians out there.

http://dlowe-wfh.blogspot.com/2007/06/tactics-tactics-tactics.html

The article above is on how to be a better programmer – and I wholeheartedly agree with what the author proposed, with one caveat that I’ll get to in a minute. The point of the the article is that learning to see the big picture (not specific skills) will make you a better programmer. In fact, this is the same advice Sun Tzu gives in “The Art of War”, where understanding the terrain, the enemy, etc are the tools you need to be a better general. [This would be in contrast to learning how to wield each weapon, which would only make you a better warrior.] Frankly, it’s good advice, and this leads you down the path towards good planning and clear thinking – the keys to success in most fields.

The caveat, however, is that there are times in your life where this is the wrong approach: ie. grad school. As a grad student, your goal isn’t to be great at everything you touch – it’s to specialize in some small corner of one field, and tactics are no help here. If grad school existed for Ninjas, the average student would walk out being the best (pick one of: poisoner/dart thrower/wall climber/etc) in the world – and likely knowing little or nothing about how to be a real ninja beyond what they learned in their Ninja undergrad. Tactics are never a bad investment, but they aren’t always what is being asked of you.

Anyhow, I plan to take the advice in the article and to keep studying the tactics of bioinformatics in my spare time, even though my daily work is more on the details and implementation side of it. There are a few links in the comments of the original article to sites the author believes are good comp-sci tactics… I’ll definitely be looking into those tonight. Besides, when it comes down to it, the tactics are really the fun parts of the problems, although there is also something to be said for getting your code working correctly and efficiently…. which I’d better get back to. (=

Happy coding!

>Link Roundup Returns – Dec 16-22

>I’ve been busy with my thesis project for the past couple weeks, which I think is understandable, but all work and no play kinda doesn’t sit well for me. So, over the weekend, I learned go, google’s new programming languages, and wrote myself a simple application for keeping track of links – and dumping them out in a pretty html format that I can just cut and paste into my blog.

While I’m not quite ready to release the code for my little go application, I am ready to test it out. I went back through the last 200 twitter posts I have (about 8 days worth), and grabbed the ones that looked interesting to me. I may have missed a few, or grabbed a few less than thrilling ones. It’s simply a consequence of me skimming some of the articles less well than others. I promise the quality of my links will be better in the future.

Anyhow, this experiment gave me a few insights into the process of “reprocessing” tweets. The first is that my app only records the person from whom I got the tweet – not the people from who they got it. I’ll try to address that in the future. The second is that it’s a very simple interface – and a lot of things I wanted to say just didn’t fit. (Maybe that’s for the better.. who knows.)

Regardless (or irregardless, for those of you in the U.S.) here are my picks for the week.

Bioinformatics:

  • Bringing back Blast (Blast+) (PDF) – Link (via @BioInfo)
  • Incredibly vague advice on how to become a bioinformatician – Link (via @KatherineMejia)
  • Cleaning up the Human Genome – Link (via @dgmacarthur)
  • Neat article on “4th paradigm of computing: exaflod of observational data” – Link (via @genomicslawyer)

Biology:

  • Gene/Protein Annotation is worse than you thought – Link (via @BioInfo)
  • Why are europeans white? – Link (via @lukejostins)

Future Technology:

  • D-Wave Surfaces again in discussions about bioinformatics – Link (via @biotechbase)
  • Changing the way we give credit in science – Link (via @genomicslawyer)

Off topic:

  • On scientists getting quote-mined by the press – Link (via @Etche_homo)
  • Give away of the best science cookie cutters ever – Link (via @apfejes)
  • Neat early history of the electric car – Link (via @biotechbase)
  • Wild (innacurate and funny) conspiracy theories about the Wellcome Trust Sanger Institute – Link (via @dgmacarthur)
  • The Eureka Moment: An Interview with Sir Alec Jeffreys (Inventor of the DNA Fingerprint) – Link (via @dgmacarthur)
  • Six types of twitter user (based on The Tipping Point) – Link (via @ritajlg)

Personal Medicine:

  • Discussion on mutations in cancer (in the press) – Link (via @CompleteGenomic)
  • Upcoming Conference: Personalized Medicine World Conference (Jan 19-20, 2010) – Link (via @CompleteGenomic)
  • deCODEme offers free analysis for 23andMe customers – Link (via @dgmacarthur)
  • UK government waking up to the impact of personalized medicine – Link (via @dgmacarthur)
  • Doctors not adopting genomic based tests for drug suitabiity – Link (via @dgmacarthur)
  • Quick and dirty biomarker detection – Link (via @genomicslawyer)
  • Personal Genomics article for the masses – Link (via @genomicslawyer)

Sequencing:

  • Paper doing the rounds: Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data – Link (via @BioInfo)
  • Archiving Next Generation Sequencing Data – Link (via @BioInfo)
  • Epigenetics takes aim at cancer and other illnesses – Link (via @BioInfo)
  • (Haven’t yet read) Changing ecconomics of DNA Synthesis – Link (via @biotechbase)
  • Genomic players for investors. (Very light overview) – Link (via @genomicslawyer)
  • Haven’t read yet: Recommended review of 2nd and 3rd generation seq. technologies – Link (via @nanopore)
  • De novo assembly of Giant Panda Genome – Link (via @nanopore)
  • Welcome Trust summary of 2nd Gen sequencing technologies – Link (via @ritajlg)

>Useful error messages…. and another format rant.

>I’ll start with the error message, since it had me laughing, while everything else seems to have the opposite reaction.

I sent a query to Biomart the other day, as I often do. Most of the time, I get back my results quickly, and have no problems whatsoever. It’s one of my “go-to” sites for useful genomic data. Unfortunately, every time I tried to download the results of my query, I’d get 2-3Mb into the file before the download would die. (It was a LONG list of snps, and the file size was supposed to be in the 10Mb ballpark.)

Anyhow, in frustration, I tried the “email results to you” option, whereupon I got the following email message:

Your results file FAILED.
Here is the reason why:
Error during query execution: Server shutdown in progress

That has to be the first time I’ve ever had a server shutdown cause a result failure. Ok, it’s not that funny, but I am left wondering if that was the cause of the other 10 or so aborted downloads. Anyone know if Biomart runs on Microsoft products? (-;

The other thing on my mind this afternoon is that I am still looking to see my first Variant Call Format file for SNPs. A while back, I was optimistic about seeing the VCF files in the real world. Not that I can complain, but I thought adoption would be a little faster. A uniform SNP format would make my life much more enjoyable – I now have 7 different SNP format iterators to maintain, and would love to drop most of them.

What surprised me, upon further investigation, is that I’m also unable to find a utility that actually creates VCF files from .map, SAM/BAM, eland, bowtie or even pileup files. I know of only one SNP caller that creates VCF compatible files, and unfortunately, it’s not freely available, which is somewhat un-helpful. (I don’t know when or if it will be available, although I’ve heard rumours about it being put into our pipeline…)

That’s kind of a sad state of affairs – although I really shouldn’t complain. I have more than enough work on my plate, and I’m sure the same can be said for those who are actively maintaining SNP callers.

In the meantime, I’ll just have to sit here and be patient… and maybe write an 8th snp format iterator.

>new repository of second generation software

>I finally have a good resource for locating second gen (next gen) sequencing analysis software. For a long time, people have just been collecting it on a single thread in the bioinformatics section of the SeqAnswers.com forum, however, the brilliant people at SeqAnswers have spawned off a wiki for it, with an easy to use form. I highly recommend you check it out, and possibly even add your own package.

http://seqanswers.com/wiki/SEQanswers

>SNP Datatabase v0.1

>Good news, my snp database seems to be in good form, and is ready for importing SNPs. For people who are interested, you can download the Vancouver Short Read Package from SVN, and find the relevant information in
/trunk/src/transcript_analysis/SNP_Database/

There’s a schema for setting up the tables and indexes, as well as applications for running imports from maq SNP calls and running a SNP caller on any form of alignment supported by FindPeaks (maq, eland, etc…).

At this point, there are no documents on how to use the software, since that’s the plan for this afternoon, and I’m assuming everyone who uses this already has access to a postgresql database (aka, a simple ubuntu + psql setup.)

But, I’m ready to start getting feature requests, requests for new SNP formats and schema changes.

Anyone who’s interested in joining onto this project, I’m only a few hours away from having some neat toys to play with!

>New Project Time… variation database

>I don’t know if anyone out there is interested in joining in – I’m starting to work on a database that will allow me to store all of the snps/variations that arise in any data set collected at the institution. (Or the subset to which I have the right to harvest snps, anyhow.) This will be part of the Vancouver Short Read Analysis Package, and, of course, will be available to anyone allowed to look at GPL code.

I’m currently on my first pass – consider it version 0.1 – but already have some basic functionality assembled. Currently, it uses a built in snp caller to identify locations with variations and to directly send them into a postgresql database, but I will shortly be building tools to allow SNPs from any snp caller to be migrated into the db.

Anyhow, just putting it out there – this could be a useful resource for people who are interested in meta analysis, and particularly those who might be interested in collaborating to build a better mousetrap. (=

>Picard code contribution

>

Update 2: I should point out that the subject of this post has been resolved. I’ll mark it down to a misunderstanding. The patches I submitted were accepted several days after being sent and rejected, once the purpose of the patch was clarified with the developers. I will leave the rest of the post here, for posterity sake, and because I think that there is some merit to the points I made, even if they were misguided in their target.

Today is going to be a very blog-ful day. I just seem to have a lot to rant about. I’ll be blaming it on the spider and a lack of sleep.

One of the things that thrills me about Open Source software is the ability for anyone to make contributions (above and beyond the ability to share and understand the source code) – and I was ecstatic when I discovered the java based Picard project, an open source set of libraries for working with SAM/BAM files. I’ve been slowly reading through the code, as I’d like to use it in my project for reading/writing SAM format files – which nearly all of the aligners available are moving towards.

One of those wonderful tools that I use for my own development is called Enerjy. It’s an Eclipse plug-in designed to help you write better java code by making suggestions about things that can be improved. A lot of it’s suggestions are simple: re-order imports to make them alphabetical (and more readable), fill in missing javadoc flags, etc. They’re not key pieces, but they are important to maintain your code’s good health. It does also point the way to things that will likely cause bugs as well (such as doing string comparisons with the “==” operator).

While reading through the Picard libraries and code, Enerjy threw more than 1600 warnings. It’s not in bad shape, but it’s got a lot of little “problems” that could easily be fixed. Mainly a lot of missing javadoc, un-cast generic types, arrays being passed between classes and the like. As part of my efforts to read through and understand the code, which I want to do before using it, I figured I’d fix these details. As I ramped up into the more complex warnings, I wanted to start small while still making a contribution. Open source at it’s best, right?

The sad part of the tale is that open source only works when the community’s contributions are welcome. Apparently, with Picard, code cleaning and maintenance isn’t. My first set of patches (dealing mainly with the trivial warnings) were rejected. With that reception, I’m not going to waste my time submitting the second set of changes I made. That’s kind of sad, in my opinion. I expressly told them that these patches were just a small start and that I’d begin making larger code contributions as my familiarity with the code improves – and at this rate, my familiarity with the code is definitely not going to mature as quickly, since I have much less motivation to clean up their warnings if they themselves aren’t interested in fixing them.

At any rate, perhaps I should have known. Open source in science usually means people have agendas about what they’d like to accomplish with the software – and including contributions may mean including someone on a publication downstream if and when it does become published. I don’t know if that was the case here: it was well within the project leader’s rights to reject my patches on any grounds they like, but I can’t say it makes me happy. I still don’t enjoy staring at 1600+ warnings every time I open Eclipse.

The only lesson I take away from this is that next time I see “Open Source” software, I’ll remember that just because it’s open source, it doesn’t mean all contributions are welcome – I should have confirmed with the developers before touching the code that they are open to small changes, and not just bug fixes. In the future, I suppose I’ll be tempering my excitement for open source science software projects.

update: A friend of mine pointed me to a link that’s highly related. Anyone with an open source project (or interested in getting started in one) should check out this blog post titled Teaching people to fish.

>New Tool: KeepNote

>Obviously I haven’t updated much here lately – I’ve been pretty busy and inspiration hasn’t struck me much in the last few days to get anything written. However, I started using some new software this morning, and I’m enjoying it so much I figured I have to share.

One of the big problems I have, as a bioinformatician, is keeping track of all the notes and one off scripts I write. I don’t want to use an SVN, because it’s just a repository with no organization. I don’t want to use a wiki, because it’s a huge hassle to maintain for small projects, and I hate using text files.

The compromise, it seems, is to use standards compliant files with a hell of a wrapper around them that does the organization for you, and the one I found is called KeepNote. The project page and downloads can be found at http://rasm.ods.org/keepnote/. The software is available for all major OS (Linux, Mac and even Windows), and can be installed relatively quickly and (for the most part) painlessly. (Linux builds are missing a library in the dependencies, but that can be figured out pretty quickly – just apt-get the missing lib and re-install if you hit this problem.)

While it may not fit everyone’s workflow, my few hours of using it have already helped me get my tools organized and assembled in a logical manner, and it’s allowed me to remove a load of files from my desktop. There are still bugs with it: I had to manually do some configuration of the the web browser, text editor and such before I could get started, but so far I haven’t hit any of the bugs.

It also claims to help you organize notes – which I can clearly see. next time I go to a conference, I’ll be using this for recording and organizing the usual 30-40 pages of notes I take.

For me, this falls under the heading of required tools for bioinformaticians and students alike and I look forward to seeing the project evolve and grow.

>Science Cartoons – 3

>I wasn’t going to do more than one comic a day, but since I just published it into the FindPeaks 4.0 manual today, I may as well put it here too, and kill two birds with one stone.

Just to clarify, under copyright laws, you can certainly re-use my images for teaching purposes or your own private use (that’s generally called “fair use” in the US, and copyright laws in most countries have similar exceptions), but you can’t publish it, take credit for it, or profit from it without discussing it with me first. However, since people browse through my page all the time, I figure I should mention that I do hold copyright on the pictures, so don’t steal them, ok?

Anyhow, Comic #3 is a brief description of how the compare in FindPeaks 4.0 works. Enjoy!