>GFF3 undocumented feature…

>Earlier today, I tweeted:

Does anyone know how to decypher a diBase GFF3 file? They don’t identify the “most abundant” nucleotide uniquely. seems useless to me.

Apparently, there is a solution, albeit undocumented:

The attribute “genotype” contains an IUB code that is limited to using either a single base or a double base annotation (eg, it should not contain, H, B, V, D or N – but may contain R, Y, W, S, M or K ), which then allows you to subtract the “reference” attribute (that must be canonical) from the “genotype” attribute IUB code to obtain the new SNP – but only when the “genotype” attribute is not a canonical base.

If only that were documented somewhere…

UPDATE: Actually, this turns out not to be the case at all — there are still positions for which the “genotype” attribute is an IUB code, and the reference is not one of the called bases. DOH!

>Useful error messages…. and another format rant.

>I’ll start with the error message, since it had me laughing, while everything else seems to have the opposite reaction.

I sent a query to Biomart the other day, as I often do. Most of the time, I get back my results quickly, and have no problems whatsoever. It’s one of my “go-to” sites for useful genomic data. Unfortunately, every time I tried to download the results of my query, I’d get 2-3Mb into the file before the download would die. (It was a LONG list of snps, and the file size was supposed to be in the 10Mb ballpark.)

Anyhow, in frustration, I tried the “email results to you” option, whereupon I got the following email message:

Your results file FAILED.
Here is the reason why:
Error during query execution: Server shutdown in progress

That has to be the first time I’ve ever had a server shutdown cause a result failure. Ok, it’s not that funny, but I am left wondering if that was the cause of the other 10 or so aborted downloads. Anyone know if Biomart runs on Microsoft products? (-;

The other thing on my mind this afternoon is that I am still looking to see my first Variant Call Format file for SNPs. A while back, I was optimistic about seeing the VCF files in the real world. Not that I can complain, but I thought adoption would be a little faster. A uniform SNP format would make my life much more enjoyable – I now have 7 different SNP format iterators to maintain, and would love to drop most of them.

What surprised me, upon further investigation, is that I’m also unable to find a utility that actually creates VCF files from .map, SAM/BAM, eland, bowtie or even pileup files. I know of only one SNP caller that creates VCF compatible files, and unfortunately, it’s not freely available, which is somewhat un-helpful. (I don’t know when or if it will be available, although I’ve heard rumours about it being put into our pipeline…)

That’s kind of a sad state of affairs – although I really shouldn’t complain. I have more than enough work on my plate, and I’m sure the same can be said for those who are actively maintaining SNP callers.

In the meantime, I’ll just have to sit here and be patient… and maybe write an 8th snp format iterator.

>SNP Database v0.2

>My SNP database is now up and running, with the first imports of data working well. That’s a huge improvement over the v0.1, where the data had to be entered under pretty tightly controlled circumstances. The API now uses locks, better indexes, and I’ve even tuned the database a little. (I also cheated a little and boosted the P4 running it to 1Gb RAM.)

So, what’s most interesting to me? Some of the early stats:

11,545,499 snps in total, made from:

  • 870549 snp calls from the 1000 genome project
  • 11361676 snps from dbsnp

So, some quick math:
11,361,676 + 870,549 – 11,545,499 = 686,726 Snps that overlapped between the 1000 genome project (34 data sets) and the dbSNP calls.

That is a whopping 1.6% of the SNPs in my database were not previously annotated in dbSNP.

I suppose that’s not a bad thing, since those samples were all “normals”, and it’s good to get some sense as to how big dbSNP really is.

Anyhow, now the fun with the database begins. A bit of documentation, a few scripts to start extracting data, and then time to put in all of the cancer datasets….

This is starting to become fun.

>new repository of second generation software

>I finally have a good resource for locating second gen (next gen) sequencing analysis software. For a long time, people have just been collecting it on a single thread in the bioinformatics section of the SeqAnswers.com forum, however, the brilliant people at SeqAnswers have spawned off a wiki for it, with an easy to use form. I highly recommend you check it out, and possibly even add your own package.

http://seqanswers.com/wiki/SEQanswers

>SNP Datatabase v0.1

>Good news, my snp database seems to be in good form, and is ready for importing SNPs. For people who are interested, you can download the Vancouver Short Read Package from SVN, and find the relevant information in
/trunk/src/transcript_analysis/SNP_Database/

There’s a schema for setting up the tables and indexes, as well as applications for running imports from maq SNP calls and running a SNP caller on any form of alignment supported by FindPeaks (maq, eland, etc…).

At this point, there are no documents on how to use the software, since that’s the plan for this afternoon, and I’m assuming everyone who uses this already has access to a postgresql database (aka, a simple ubuntu + psql setup.)

But, I’m ready to start getting feature requests, requests for new SNP formats and schema changes.

Anyone who’s interested in joining onto this project, I’m only a few hours away from having some neat toys to play with!

>New Project Time… variation database

>I don’t know if anyone out there is interested in joining in – I’m starting to work on a database that will allow me to store all of the snps/variations that arise in any data set collected at the institution. (Or the subset to which I have the right to harvest snps, anyhow.) This will be part of the Vancouver Short Read Analysis Package, and, of course, will be available to anyone allowed to look at GPL code.

I’m currently on my first pass – consider it version 0.1 – but already have some basic functionality assembled. Currently, it uses a built in snp caller to identify locations with variations and to directly send them into a postgresql database, but I will shortly be building tools to allow SNPs from any snp caller to be migrated into the db.

Anyhow, just putting it out there – this could be a useful resource for people who are interested in meta analysis, and particularly those who might be interested in collaborating to build a better mousetrap. (=

>Searching for SNPs… a disaster waiting to happen.

>Well, I’m postponing my planned article, because I just don’t feel in the mood to work on that tonight. Instead, I figured I’d touch on something a little more important to me this evening: WTSS SNP calls. Well, as my committee members would say, they’re not SNPs, they’re variations or putative mutations. Technically, that makes them Single Nucleotide Variations, or SNVs. (They’re only polymorphisms if they’re common to a portion of the population.

In this case, they’re from cancer cell lines, so after I filter out all the real SNPs, what’s left are SNVs… and they’re bloody annoying. This is the second major project I’ve done where SNP calling has played a central role. The first was based on very early 454 data, where homopolymers were frequent, and thus finding SNVs was pretty easy: they were all over the place! After much work, it turned out that pretty much all of them were fake (false positives), and I learned to check for homopolymer runs – a simple trick, easily accomplished by visualizing the data.

We moved onto Illumina, after that. Actually, it was still Solexa at the time. Yes, this is older data – nearly a year old. It wasn’t particularly reliable, and I’ve now used several different aligners, references and otherwise, each time (I thought) improving the data. We came down to a couple very intriguing variations, and decided to sequence them. After several rounds of primer design, we finally got one that worked… and lo and behold. 0/2. Neither of them are real. So, now comes the post-mortem: Why did we get the false positives this time? Is it bias from the platform? Bad alignments? Or something even more suspicious… do we have evidence of edited RNA? Who knows. The game begins all over again, in the quest for answering the question “why?” Why do we get unexpected results?

Fortunately, I’m a scientist, so that question is really something I like. I don’t begrudge the last year’s worth of work – which apparently is now more or less down the toilet – but I hope that the why leads to something more interesting this time. (Thank goodness I have other projects on the go, as well!)

Ah, science. Good thing I’m hooked, otherwise I’d have tossed in the towel long ago.

>Maq Bug

>I came across an interesting bug today, when trying to work with reads aligned by Smith-Waterman (flag = 130), in PET alignments. Indeed, I even filed a Bug Report on it.

The low down on the “bug” or “feature” (I’m not sure which it is yet), is that sequences containing deletions don’t actually show the deletions – they show the straight genomic sequence at that location. The reason that I figure this is a bug instead of by design is because sequences with insertions show the insertion. So why the discrepancy?

Anyhow, the upshot of it is that I’m only able to use 1/2 of the Smith-Waterman alignments maq produces when doing SNP calls in my software. (I can’t trust that the per-base quality scores align to the correct bases with this bug) Since other people are doing SNP calls using MAQ alignments… what are they doing?