A Reply to: Should a Rich Genome Variant File End the Storage of Raw Read Data?

Before you read my post, you might want to zip over to Complete Genomic’s blog where C.S.O. Dr. Rade Drmanac wrote an entry titled “Should a Rich Genome Variant File End the Storage of Raw Read Data?”  It’s an interesting perspective where he suggests that, as the article’s title might indicate, we should only be keeping a rich variant file as the only trace of a sequencing run.

I should mention that I’m really not going to distinguish between storing raw reads and storing aligned reads – you can go from one to the other by stripping out the alignment information, or by aligning to a reference template.  As far as I’m concerned, no information is lost when you align, unlike moving to a rich variant file (or any other non-rich variant file, for that matter.)

I can certainly see the attraction of deleting the raw or aligned data – much of the data we store is never used again, takes up space and could be regenerated at will from frozen samples of DNA, if needed.  As a scientist, I’ve very rarely have had to go back into the read-space files and look at the pre-variant calling data – and only a handfull of times in the past 2 years.  As such, it really doesn’t make much sense to store alignments, raw reads or other data. If I needed to go back to a data set, it would be (could be?) cheaper just to resequence the genome and regenerate the missing read data.

I’d like to bring out a real-world analogy, to summarize this argument. I had recently been considering a move, only to discover that storage in Vancouver is not cheap.  It’ll cost me $120-150/month for enough room to store some furniture and possessions I wouldn’t want to take with me.  If the total value of those possessions is only $5,ooo, storing them for more than 36 months means that (if I were to move) it would have been cheaper to sell it all and then buy a new set when I come back.

Where the analogy comes into play quite elegantly is if I have some interest in those particular items.  My wife, for instance, is quite attached to our dining room set.  If we were to sell it, we’d have to eventually replace it, and it might be impossible to find another one just like it at any price.  It might be cheaper to abandon the furniture than to store it in the long run, but if there’s something important in that data, the cost of storage isn’t the only thing that comes into play.

While I’m not suggesting that we should be emotionally attached to our raw data, there is merit in having a record that we can return to – if (and only if) there is a strong likelyhood that we will return to the data for verification purposes.  You can’t always recreate something of interest in a data set by resequencing.

That is a poor reason to keep the data around, most of the time.  We rarely find things of interest that couldn’t be recreated in a specific read set when we’re doing genome-wide analysis. Since most of the analysis we do uses only the variants and we frequently verify our findings with other means, the cost/value argument is probably strongly in favour of throwing away raw reads and only storing the variants.  Considering my recent project on storage of variants (all 3 billion of the I’ve collected), people have probably heard me make the same arguments before.

But lets not stop here. There is much more to this question than meets the eye. If we delve a little deeper into what Dr. Drmanac is really asking, we’ll find that this question isn’t quite as simple as it sounds.  Although the question stated basically boils down to “Can a rich genome variant file be stored instead of a raw read data file?”, the underlying question is really: “Are we capable of extracting all of the information from the raw reads and storing it for later use?”

Here, I actually would contend the answer is no, depending on the platform.  Let me give you examples of what data I feel we do a poor example of extracting, right now.

  • Structural variations:  My experience with structural variations is that no two SV callers give you the same information or make the same calls.  They are notoriously difficult to evaluate, so any information we are extracting is likely just the tip of the iceberg.  (Same goes for structural rearrangements in cancers, etc.)
  • Phasing information:  Most SNP callers aren’t giving phasing information, and some aren’t capable of it.  However, those details could be teased from the raw data files (depending on the platform).  We’re just not capturing it efficiently.
  • Exon/Gene expression:  This one is trivial, and I’ve had code that pulls this data from raw aligned read files since we started doing RNA-Seq.  Unfortunately, due to exon annotation issues, no one is doing this well yet, but it’s a huge amount of clear and concise information available that is obviously not captured in variant files.  (We do reasonably well with Copy Number Variations (CNVs), but again, those aren’t typically sored in rich variant files.)
  • Variant Quality information: we may have solved the base quality problems that plagued early data sets, but let me say that variant calling hasn’t exactly come up with a unified method of comparing SNV qualities between variant callers.  There’s really no substitute for comparing other than to re-run the data set with the same tools.
  • The variants themselves!  Have you ever compared the variants observed by two SNP callers run on the same data set? I’ll spoil the suspense for you: they never agree completely, and may in fact disagree on up to 25% of the variants called. (Personal observation – data not shown.)

Even dealing only with the last item, it should be obvious:  If we can’t have two snp callers produce the same set of variants, then no amount of richness in the variant file will replace the need to store the raw read data because we should always be double checking interesting findings with (at least) a second set of software.

For me, the answer is clear:  If you’re going to stop storing raw read files, you need to make sure that you’ve extracted all of the useful information – and that the information you’ve extracted is complete.  I just don’t think we’ve hit those milestones yet.

Of course, if you work in a shop where you only use one set of tools, then none of the above problems will be obvious to you and there really isn’t a point to storing the raw reads.  You’ll never return to them because you already have all the answers you want.  If, on the other hand, you get daily exposure to the uncertainty in your pipeline by comparing it to other pipelines, you might look at it with a different perspective.

So, my answer to the question “Are we ready to stop storing raw reads?” is easy:  That depends on what you think you need from a data set and if you think you’re already an expert at extracting it.  Personally, I think we’ve barely scratched the surface on what information we can get out of genomic and transcriptomic data, we just don’t know what it is we’re missing yet.

Completely off topic, but related in concept: I’m spending my morning looking for Scalable Vector Graphics (.svg) files for many jpg and png files I’d created along the course of my studies.  Unfortunately, jpg and png are lossy formats and don’t reproduce as nicely in the Portable Document Format (PDF) export process.  Having deleted some of those .svg files because I thought I had extracted all of the useful information from them in the export to png format, I’m now at the point where I might have to recreate them to properly export the files again in a lossless format for my thesis.  If I’d just have stored them (as the cost is negligible) I wouldn’t be in this bind….   meh.


How I would improve Google plus, or “Squares, not Circles”

First off, I absolutely think Google+ is fantastic and I’m thinking of giving up twitter in favour of it, which should say a lot.  I think it has the ability to be THAT good.  However, google missed something big.  As “Science of the Invisible” points out (and via tech crunch) – Google circles are great for organizing, but don’t really help you in the noise to signal ratio.

So, I’d like to propose a new Idea:  Google Squares.

Instead of the loosely grouped people that make up circles, a Square would be a rigidly defined group, with moderators.  The moderators have two roles: determining who can post to a Square, and who can follow a Square.

Imagine, if you will, a private Square. Lets imagine I want to start a Square for my family.  I can first decide to make it private – only people who are invited to the square are allowed to see the posts, and only those in my family are invited.  It becomes, instantly, the equivalent of an email mailing list (but much easier to set up and manage) for my family members.  I don’t have to worry about typing in the email address of my family every time I want to post something to the Square – and neither do my other family members.  Only one person needs to set it up (the moderator), and it instantly becomes a useful communication tool for everyone I add.

It would be useful for labs, social groups, clubs, etc.   And, it moves people away from email – a long time holy grail of security experts.

So, what about public Squares? They could be even more useful – providing the equivalent of a blogging consortium, or twittering group (which don’t even really exist.)  Imagine if there were a Square with all of your favorite twitter-ers. You can still follow them all, one by one, if you like, or add them to your circles, but the square would give you instant access to All those people who someone else has pre-screened as being good communicators and worth following.  Instant increase in signal-to-noise.

Finally, the last piece lacking is direct URLs.  Seriously, I’m a bit miffed that google didn’t set this up from the start, even based on the google ID.  Really, I’ve had the google id apfejes for a LONG time – why can’t I have plus.google.com/apfejes.  Even twitter has this one figured out.

In any case, these are minor grievances…. but I’m waiting for Google to up their game once more.  In the meantime:

Do not disturb my circles! – Archimedes

Google+ goes to battle…

After playing with Google+ for part of a day, I have a few comments to make.  Some are in relation to bioinformatics, others are just general comments.

My first comment is probably the least useful:  Google, why the hell did you make me wait 3 days to get into Google+, only to then let EVERYONE into it 3 hours later after activating my invite?  Seriously, you could have told me that I was wasting my time when I was chasing one down.

Ok, that’s out of my system now. So on to the more interesting things.  First, this isn’t Google’s first shot into the social media field.  We all remember “The Wave”.  It was Google’s “Microsoft Moment”, that is to say, their time to release something that was more hype than real product.  Fortunately, Google stepped back from the brink and started over – so with that in mind, I think Google deserves a lot of credit for not pulling a Microsoft. (In my dictionary pulling a Microsoft is blowing a lot of money on a bunch of adds for products that really suck, but will get critical mass by sheer advertising.  eg.  Bling. Cloud. Need I say more?)

Ok, what did google get right?  Well, first, it seems that they’ve been reading the diaspora mailing list, or at least paying attention.  The whole environment looks exactly like Diaspora to me.  It’s clean, it’s simple, and unlike facebook, is built around communities that don’t have to overlap!  With facebook, everyone belongs to a single group, while Diaspora brought the concept of groups, so that you can segment your circles.  Clicking and dragging people into those groups was what convinced me that Diaspora would be a Facebook killer.

Instead, Google+ has leapfrogged and beaten Diaspora.  And rightly so – Diaspora had it’s faults, but this isn’t the right place for me to get into that.  As far as I can tell, everything I wanted from Diaspora has found it’s way into Google+ with one exception: You can’t host your own data.  Although, really, if there’s one company out there that has done a good job of managing user data (albeit it has stumbled a few times) it’s Google.  The “Do no evil” moto has taken a few beatings, but it’s still a good start.

[By the way, Diaspora fans, the code was open source, so if you’re upset that Google replicated the look and feel, you have to remember that that is the purpose of open source: to foster good development ideas. ]

So, where does this leave things?

First, I think Google has a winner here.  The big question is, unlike the wave, can it get critical mass?  I think the answer to that is a profound yes.  A lot of the trend setters are moving here from facebook, which means others will follow.  More importantly, however, I think getting security right from the start will be one of the big draws for Google.  They don’t need to convince your grandmother to switch to facebook – they just need you to switch, and your grandmother will eventually be dragged along because she’ll want to see your pictures. (And yes, Picasa is about to be deluged with new accounts.)

More importantly, All those kids who want to post naked pictures of themselves dancing on cars during riots are going to move over pretty damn quickly.  Whether that’s a good thing or not, I think EVERYONE learned something from the Vancouver Riots aftermath.

So great, but how will this be useful to the rest of us?  Actually, I’ve heard that Google+ is going to be the twitter killer – and I can see that, but I don’t see that as the main purpose.  Frankly, the real value is in the harmonization of services.  Google has, hands down been one of the best Software as a Service (SAS or SAAS) provider around in my humble opinion.  When your Google+ account talks to your email, documents, images – and lets you have intuitive fine grained control over who sees what, I think people will find it to be dramatically more useful than any of the competition.  Twitter will either have to find a way to integrate into Google+ or to figure out how to implement communities of their own. It may be a subtle change, but it’s a sea change in how people interact on the web.

For those of you who are bioinformaticians, you won’t be able to take Google+ lightly either.  Already, I’ve found some of my favorite scientist twitterers on Google+ and some have started posting things.  Once people start getting the hang of the groups, it won’t be long till we’ll see people following industry groups, science networks and celebrities.  (Hell, even PZ Myers has an account already.)

The more I think about it, the more I see its potential as a good collaboration tool, as well.  Let me give an example.  If group management can be made into something like a mailing list (ie, opt-in with a moderator) , a PI could create a “My Lab” group that he only allows his own students and group members to join, it would be a great way to communicate group announcements.  It doesn’t spill out information people who aren’t interested, and other people can’t get into those communications unless someone intentionally “re-tweets” the content.  Merge this with Google calendar, and you have an instant scheduling system as well.

What does Google get out of this?  Well, think targetted google adds.  As long as you’re logged in, Google will know everything about you that you’ve ever thought about.  And is that a bad thing?  Well, only if you’re Microsoft and want to complain about Google’s absolute monopoly of the online advertisement market.  You know what Microsoft?  Better them than you.  (And hey, if Google adds do help me find a good canoe when I’m in the market for one, who’s going to complain?)

ArsenicLife – a deeper shift than meets the eye.

Unfortunately, I don’t really have the time to write this post out in full, so please don’t mind the rough format in which I’ve assembled it.  I’ve been trying to twitter more than blog recently so that I wouldn’t distract myself by spending hours researching and writing blog posts.  Go figure.

However, I don’t think I can keep myself from commenting this time, even if it’s just in passing.  The whole arsenic-based-life (or #arseniclife) affair is really a much deeper  story than it appears.  It’s not just a story about some poor science, but rather a clash of cultures.

First, I read through the list of published critiques of the arsenic paper, as well as the response on the same page.  They critiques are pretty thoughtful and clear, giving me the overall impression that the authors of the original paper just didn’t bother to talk to specialists outside of their own narrow field.   That’s the first clash of cultures:  Specialists vs. interdisciplinary researchers.  If you neglect to consult people who can shed light on your results, you’re effectively ignoring alternative hypotheses.   Biologists have been guilty of this in the past, however, failing to consult statisticians before constructing a story their data doesn’t support.  In this case, however, it’s more blatant because the authors should have consulted with other biologists, the least of them being specialists in microbial biology. (Rosie Redfield’s blog comes to mind for some of the critiques that should have been solicited before the paper was sent to journals.)

If that wasn’t enough, this clash is also underpinned by “oldschool” meets “newschool” – aka, technology.  This isn’t an original idea of mine, as I’ve seen it hinted at elsewhere, but it’s an important idea.  Underneath all of the research, we have a science battle that’s being fought out in the journals, while new media runs circles around it.  It took almost 6 months for Science to print 6 critiques that stretch from a half page to just over a page.  In the world of blogs, that is about 2 hours worth of work.

I really don’t know what’s involved in having a small half-page article go to press, but I’m quite surprised if it would take 6 months to do that amount of work.  In contrast, a great many blogs popped up with serious scientific criticisms in hours, if not days, of the original embargo on the paper being lifted. (The embargo itself was a totally ridiculous, but that’s another tangent I’m not going to follow.)  The science discussion in the blogs was every bit as valid as the critiques Science published.

Frankly, the arsenic life paper leaves me stunned on many levels:  How long will continue to believe they can work in independent fields, publishing results without considering the implications of their work on other fields?  How long will journals be the common currency of science given their sluggish pace to keep up with the discussions?  How long will blogs  (and not the anonymous kind) be relegated to step-child status in science communication?

Given the rapid pace with which science progresses as a whole, it’s only a matter of time before something needs to be done to change the way we chose to publish and collaborate in the open.

For more reading on this, I suggest the Nature article here.


Illumina’s MiSeq.

Really, I have nothing to add here.  Keith Robinson on the Omics! Omics! blog has already said it all – right down to the email from Illumina’s PR person.

Admittedly, the poster they sent is quite pretty, but as always, I’m waiting to see how the instrument performs in other people’s hands.  (Though, that’s not to say I doubt the results, but I have been bitten by Illumina’s optimistic reports in the past – with specific emphasis on shipping dates for software.)

At this point in the game, we’ve now entered into a long protracted arms race, with each company trying to out-perform the others, but with very few new features.  Improving chemistry, longer reads,  cheaper per-base costs, faster sequencing time and better base qualities will continue to ratchet up – so the MiSeq is Illumina raising the bar again.  Undoubtedly we’ll continue to see other competitors showing off their products for the next few years, trying to push into the market.  (A market which grows to include smaller labs every time they can reduce the cost of the machine, the sequencing, and the bioinformatics overhead.)

However, let me say that we’ve certainly come a long way from the single end 22bp reads that I first saw from the Solexa machines in 2008.   mmmmm… PET 151bp reads in 27 hours. *drool*.

Edit:  Oops.  I missed the link to Illumina’s page for the MiSeq.  Yes, it’s exactly what you’d expect to see on a vendor page, but they do have a link to the poster on the left hand side so that you can check it out for yourself.