A Reply to: Should a Rich Genome Variant File End the Storage of Raw Read Data?

Before you read my post, you might want to zip over to Complete Genomic’s blog where C.S.O. Dr. Rade Drmanac wrote an entry titled “Should a Rich Genome Variant File End the Storage of Raw Read Data?”  It’s an interesting perspective where he suggests that, as the article’s title might indicate, we should only be keeping a rich variant file as the only trace of a sequencing run.

I should mention that I’m really not going to distinguish between storing raw reads and storing aligned reads – you can go from one to the other by stripping out the alignment information, or by aligning to a reference template.  As far as I’m concerned, no information is lost when you align, unlike moving to a rich variant file (or any other non-rich variant file, for that matter.)

I can certainly see the attraction of deleting the raw or aligned data – much of the data we store is never used again, takes up space and could be regenerated at will from frozen samples of DNA, if needed.  As a scientist, I’ve very rarely have had to go back into the read-space files and look at the pre-variant calling data – and only a handfull of times in the past 2 years.  As such, it really doesn’t make much sense to store alignments, raw reads or other data. If I needed to go back to a data set, it would be (could be?) cheaper just to resequence the genome and regenerate the missing read data.

I’d like to bring out a real-world analogy, to summarize this argument. I had recently been considering a move, only to discover that storage in Vancouver is not cheap.  It’ll cost me $120-150/month for enough room to store some furniture and possessions I wouldn’t want to take with me.  If the total value of those possessions is only $5,ooo, storing them for more than 36 months means that (if I were to move) it would have been cheaper to sell it all and then buy a new set when I come back.

Where the analogy comes into play quite elegantly is if I have some interest in those particular items.  My wife, for instance, is quite attached to our dining room set.  If we were to sell it, we’d have to eventually replace it, and it might be impossible to find another one just like it at any price.  It might be cheaper to abandon the furniture than to store it in the long run, but if there’s something important in that data, the cost of storage isn’t the only thing that comes into play.

While I’m not suggesting that we should be emotionally attached to our raw data, there is merit in having a record that we can return to – if (and only if) there is a strong likelyhood that we will return to the data for verification purposes.  You can’t always recreate something of interest in a data set by resequencing.

That is a poor reason to keep the data around, most of the time.  We rarely find things of interest that couldn’t be recreated in a specific read set when we’re doing genome-wide analysis. Since most of the analysis we do uses only the variants and we frequently verify our findings with other means, the cost/value argument is probably strongly in favour of throwing away raw reads and only storing the variants.  Considering my recent project on storage of variants (all 3 billion of the I’ve collected), people have probably heard me make the same arguments before.

But lets not stop here. There is much more to this question than meets the eye. If we delve a little deeper into what Dr. Drmanac is really asking, we’ll find that this question isn’t quite as simple as it sounds.  Although the question stated basically boils down to “Can a rich genome variant file be stored instead of a raw read data file?”, the underlying question is really: “Are we capable of extracting all of the information from the raw reads and storing it for later use?”

Here, I actually would contend the answer is no, depending on the platform.  Let me give you examples of what data I feel we do a poor example of extracting, right now.

  • Structural variations:  My experience with structural variations is that no two SV callers give you the same information or make the same calls.  They are notoriously difficult to evaluate, so any information we are extracting is likely just the tip of the iceberg.  (Same goes for structural rearrangements in cancers, etc.)
  • Phasing information:  Most SNP callers aren’t giving phasing information, and some aren’t capable of it.  However, those details could be teased from the raw data files (depending on the platform).  We’re just not capturing it efficiently.
  • Exon/Gene expression:  This one is trivial, and I’ve had code that pulls this data from raw aligned read files since we started doing RNA-Seq.  Unfortunately, due to exon annotation issues, no one is doing this well yet, but it’s a huge amount of clear and concise information available that is obviously not captured in variant files.  (We do reasonably well with Copy Number Variations (CNVs), but again, those aren’t typically sored in rich variant files.)
  • Variant Quality information: we may have solved the base quality problems that plagued early data sets, but let me say that variant calling hasn’t exactly come up with a unified method of comparing SNV qualities between variant callers.  There’s really no substitute for comparing other than to re-run the data set with the same tools.
  • The variants themselves!  Have you ever compared the variants observed by two SNP callers run on the same data set? I’ll spoil the suspense for you: they never agree completely, and may in fact disagree on up to 25% of the variants called. (Personal observation – data not shown.)

Even dealing only with the last item, it should be obvious:  If we can’t have two snp callers produce the same set of variants, then no amount of richness in the variant file will replace the need to store the raw read data because we should always be double checking interesting findings with (at least) a second set of software.

For me, the answer is clear:  If you’re going to stop storing raw read files, you need to make sure that you’ve extracted all of the useful information – and that the information you’ve extracted is complete.  I just don’t think we’ve hit those milestones yet.

Of course, if you work in a shop where you only use one set of tools, then none of the above problems will be obvious to you and there really isn’t a point to storing the raw reads.  You’ll never return to them because you already have all the answers you want.  If, on the other hand, you get daily exposure to the uncertainty in your pipeline by comparing it to other pipelines, you might look at it with a different perspective.

So, my answer to the question “Are we ready to stop storing raw reads?” is easy:  That depends on what you think you need from a data set and if you think you’re already an expert at extracting it.  Personally, I think we’ve barely scratched the surface on what information we can get out of genomic and transcriptomic data, we just don’t know what it is we’re missing yet.

Completely off topic, but related in concept: I’m spending my morning looking for Scalable Vector Graphics (.svg) files for many jpg and png files I’d created along the course of my studies.  Unfortunately, jpg and png are lossy formats and don’t reproduce as nicely in the Portable Document Format (PDF) export process.  Having deleted some of those .svg files because I thought I had extracted all of the useful information from them in the export to png format, I’m now at the point where I might have to recreate them to properly export the files again in a lossless format for my thesis.  If I’d just have stored them (as the cost is negligible) I wouldn’t be in this bind….   meh.


How I would improve Google plus, or “Squares, not Circles”

First off, I absolutely think Google+ is fantastic and I’m thinking of giving up twitter in favour of it, which should say a lot.  I think it has the ability to be THAT good.  However, google missed something big.  As “Science of the Invisible” points out (and via tech crunch) – Google circles are great for organizing, but don’t really help you in the noise to signal ratio.

So, I’d like to propose a new Idea:  Google Squares.

Instead of the loosely grouped people that make up circles, a Square would be a rigidly defined group, with moderators.  The moderators have two roles: determining who can post to a Square, and who can follow a Square.

Imagine, if you will, a private Square. Lets imagine I want to start a Square for my family.  I can first decide to make it private – only people who are invited to the square are allowed to see the posts, and only those in my family are invited.  It becomes, instantly, the equivalent of an email mailing list (but much easier to set up and manage) for my family members.  I don’t have to worry about typing in the email address of my family every time I want to post something to the Square – and neither do my other family members.  Only one person needs to set it up (the moderator), and it instantly becomes a useful communication tool for everyone I add.

It would be useful for labs, social groups, clubs, etc.   And, it moves people away from email – a long time holy grail of security experts.

So, what about public Squares? They could be even more useful – providing the equivalent of a blogging consortium, or twittering group (which don’t even really exist.)  Imagine if there were a Square with all of your favorite twitter-ers. You can still follow them all, one by one, if you like, or add them to your circles, but the square would give you instant access to All those people who someone else has pre-screened as being good communicators and worth following.  Instant increase in signal-to-noise.

Finally, the last piece lacking is direct URLs.  Seriously, I’m a bit miffed that google didn’t set this up from the start, even based on the google ID.  Really, I’ve had the google id apfejes for a LONG time – why can’t I have plus.google.com/apfejes.  Even twitter has this one figured out.

In any case, these are minor grievances…. but I’m waiting for Google to up their game once more.  In the meantime:

Do not disturb my circles! – Archimedes

Is blogging revolutionizing science communication?

There’s been a lot of talk about blogging changing the nature of science communication recently that I think is completely missing the mark.  And, given that I see this really often, I thought I’d comment on it quickly.   (aka, this is a short, and not particularly well researched post… but deal with it.  I’m on “vacation” this week.)

Two of the articles/posts that are still on my desktop (that discuss this topic, albeit in the context of changing the presentation of science, not really in science communication) are:

But  I’ve come across a ton of them, and they all say (emphatically) that blogging has changed the way we communicate in science.  Well Yes and No.

Yes, it has changed the way scientists communicate between themselves.  I don’t run to the journal stacks anymore when I want to know what’s going on in someone’s lab, I run to the lab blog.  Or I check the twitter feed… or I’ll look for someone else blogging about the research.  You learn a lot that way, and it is actually representative of what’s going on in the world – and the researcher’s opinions on a much broader set of topics.  That is to say, it’s not a static picture of what small set of experiments worked in the lab in 1997.

On the other hand, I don’t think that there are nearly enough bloggers making science accessible for lay people.  We haven’t made science more easily understood by those outside of our fields – we’ve just make it easier for scientists inside our own field to find and compare information.

I know there are a few good blogs out there trying to make research easier to understand, but they are few and far between.  I, personally, haven’t written an article trying to explain what I do for a non-scientist in well over a year.

So, yes, blogging has changed science communication, but as far as I can tell, we’ve only changed it for the scientists.

CPHx: Sponsored talk – Roland Wicki, Life Technologies

Ion Torrent Semiconductor Sequencing

Three sequencing technology concepts: Sanger, Post Light, Massively Parallel Sequencing.

PostLight fits somewhere in between sanger and massively parallel.  Long, fast, quantity.

Low cost convenient single use device..  Everything is on the chip.  55,000 euros for full machine, plus 17,000 euros for the server.

[They’ve trademarked “The Chip is the Machine”.]

Everything works on pH changes.  Very low cost, no cameras, not modified enzymes, etc.  It’s all nature.

Accuracy is 99.9% (from EdgeBio.)

Automatable, simple, quarterly updates and rapidly improving raw accuracy.

Chemistry.  Nucleotides are incorporated into DNA, releasing a proton.  That pH change can be measured.  If no nucleotide is incorporated, no pH change is detected.  If multiple bases are incorporated, a larger change is registerd.  [I assume this means they wash each of the bases over in sequence…  yep, it’s on the next slide.]

Raw accuracy is still improving all the time.  Homopolymer accuracy is now 99% for up to 6 bases being incorporated….  [nothing shown beyond that, however.]

Scalability: expect 100Mb in Q2 2011, 1Gb in Q4 2011.  [Some of that is longer reads.] [RE-EDIT:I have changed the above to reflect the accurate values.  The original numbers I had noted down were VERY wrong.  Please see my post here for more information]

Single day workflow available.  2 hour sequencing runs.

[Some discussion of products available… I don’t take notes on this type of stuff.  Consult your Life Tech rep if you want that info.]

Supported apps:

  • microbial
  • mitochondrial
  • amplicon
  • custom targetted
  • validation of whole genome/exome mutatinos
  • library assessment
  • RNA-Seq

Chip-Seq, Wholte transcriptome RNA-seq for human coming up with the next chip.  You need more coverage than you would get with the current chip.

Example given using DH10b data, available from Ion Torrent, BGI and EdgeBio.  You can get full data sets from Life Tech’s web page. [no url given.]

TargetSeq Enrichment Kits. announced recently.  Works for Exome Enrichment.

Details given for amplicon library prep.

You can use existing sanger-based amplicons on Ion Torrent.

You can also use PGM/Ion Torrent to do QC.  It’s a short run, so you can test library construction by sequencing small sets before tossing it on a higher throughput sequencer.

There is an Ion Community… if you want more info.

And a quick plug for BGI’s use of PGM to sequence E. coli found in recent outbreak.


ArsenicLife – a deeper shift than meets the eye.

Unfortunately, I don’t really have the time to write this post out in full, so please don’t mind the rough format in which I’ve assembled it.  I’ve been trying to twitter more than blog recently so that I wouldn’t distract myself by spending hours researching and writing blog posts.  Go figure.

However, I don’t think I can keep myself from commenting this time, even if it’s just in passing.  The whole arsenic-based-life (or #arseniclife) affair is really a much deeper  story than it appears.  It’s not just a story about some poor science, but rather a clash of cultures.

First, I read through the list of published critiques of the arsenic paper, as well as the response on the same page.  They critiques are pretty thoughtful and clear, giving me the overall impression that the authors of the original paper just didn’t bother to talk to specialists outside of their own narrow field.   That’s the first clash of cultures:  Specialists vs. interdisciplinary researchers.  If you neglect to consult people who can shed light on your results, you’re effectively ignoring alternative hypotheses.   Biologists have been guilty of this in the past, however, failing to consult statisticians before constructing a story their data doesn’t support.  In this case, however, it’s more blatant because the authors should have consulted with other biologists, the least of them being specialists in microbial biology. (Rosie Redfield’s blog comes to mind for some of the critiques that should have been solicited before the paper was sent to journals.)

If that wasn’t enough, this clash is also underpinned by “oldschool” meets “newschool” – aka, technology.  This isn’t an original idea of mine, as I’ve seen it hinted at elsewhere, but it’s an important idea.  Underneath all of the research, we have a science battle that’s being fought out in the journals, while new media runs circles around it.  It took almost 6 months for Science to print 6 critiques that stretch from a half page to just over a page.  In the world of blogs, that is about 2 hours worth of work.

I really don’t know what’s involved in having a small half-page article go to press, but I’m quite surprised if it would take 6 months to do that amount of work.  In contrast, a great many blogs popped up with serious scientific criticisms in hours, if not days, of the original embargo on the paper being lifted. (The embargo itself was a totally ridiculous, but that’s another tangent I’m not going to follow.)  The science discussion in the blogs was every bit as valid as the critiques Science published.

Frankly, the arsenic life paper leaves me stunned on many levels:  How long will continue to believe they can work in independent fields, publishing results without considering the implications of their work on other fields?  How long will journals be the common currency of science given their sluggish pace to keep up with the discussions?  How long will blogs  (and not the anonymous kind) be relegated to step-child status in science communication?

Given the rapid pace with which science progresses as a whole, it’s only a matter of time before something needs to be done to change the way we chose to publish and collaborate in the open.

For more reading on this, I suggest the Nature article here.


Illumina’s MiSeq.

Really, I have nothing to add here.  Keith Robinson on the Omics! Omics! blog has already said it all – right down to the email from Illumina’s PR person.

Admittedly, the poster they sent is quite pretty, but as always, I’m waiting to see how the instrument performs in other people’s hands.  (Though, that’s not to say I doubt the results, but I have been bitten by Illumina’s optimistic reports in the past – with specific emphasis on shipping dates for software.)

At this point in the game, we’ve now entered into a long protracted arms race, with each company trying to out-perform the others, but with very few new features.  Improving chemistry, longer reads,  cheaper per-base costs, faster sequencing time and better base qualities will continue to ratchet up – so the MiSeq is Illumina raising the bar again.  Undoubtedly we’ll continue to see other competitors showing off their products for the next few years, trying to push into the market.  (A market which grows to include smaller labs every time they can reduce the cost of the machine, the sequencing, and the bioinformatics overhead.)

However, let me say that we’ve certainly come a long way from the single end 22bp reads that I first saw from the Solexa machines in 2008.   mmmmm… PET 151bp reads in 27 hours. *drool*.

Edit:  Oops.  I missed the link to Illumina’s page for the MiSeq.  Yes, it’s exactly what you’d expect to see on a vendor page, but they do have a link to the poster on the left hand side so that you can check it out for yourself.


Teens and risk taking… a path to learning.

I read an article on the web the other day, in which it was described that teenagers have a different weighting of risk and reward than either young children or adults due to a chemical change that emphasizes the benefits of the rewards, without fully processing the risks.

The idea is that the changes in the adolescent brain emphasize the imagined reward for achieving goals, but fails to equally magnify the resulting negative impulse for the potential outcomes of failure. (I suggest reading the linked article for a better explanation.)

Having once been a teenager myself, this somewhat makes sense to me in terms of how I learned to use computers. A large part of the advantage of learning computers as a child is the lack of fear of “doing something wrong.” If I didn’t know what I was doing, I would just try a bunch of things till something worked never worrying about the consequences of making a mess of the computer.  I have often taught people who came to computers late in their lives, and the one feature that comes to the forefront is always their (justified) fear of making a mess of their computer.

In fact, that was the greatest difference between my father and I, in terms of learning curve: when encountering an obstacle, my father would stop as though hitting a brick wall until he could find someone to guide him to a solution, while I’d throw myself at it till I found a hole through it, or a way around it. (Rewriting dos config files, editing registries and modifying IRQ settings on add-on boards were not for the faint of heart in the early 90’s.)

As someone now in my 30’s I can see the value of both approaches. My father never did mess up the computer, but managed to get the vast majority of things working. On the other hand, I learned dramatically faster, but did manage to make a few messes – all of which I eventually cleaned up (learning how to fix computers in the process). In fact, learning how to fix your mistakes is often more painful than causing the mistake in the first place, so my father’s method clearly was superior in sheer pain avoidance technique (eg, negative reinforcement).

However, in the long run, I think there’s something to be said for the teen’s approach: you can move much more agilely (is that a word?) if you throw yourself at problems with the full expectation that you’ll just learn how to solve them in the end.  One can’t be a successful researcher if fear of the unknown is what drives you.  And, if you never venture out into the fringes of the field, you won’t make the great discoveries.  Imagine if Columbus hadn’t been willing to test his theories (which were wrong, by the way) about the circumference of the earth – and no, even the ancient Greeks knew that the earth was round.

Incidentally, fear of making a mess of my computer was always the driving fear for me when I first started learning Linux.  Back in the days before good package management, I was always afraid of installing software because I never knew where to put it.  Even worse, however, was the posibility of doing something that would cause an unrecoverable partition or damaging hardware – both of which were actual possibilities in those days if you used the wrong settings in your config files.  However, with a distinct risk/reward ratio towards the benefit of getting a working system, I managed to learn enough to dull that fear.  Good package management also meant that I didn’t have to worry about making messes of the software while installing things, but that’s another story.

Anyhow, I’m not sure what this says about communicating with teenagers, but it does reinforce the idea that older researchers (myself included) have to lose some of their fear of failure – or fear of insufficient reward – to keep themselves competitive.

Perhaps this explains why older labs depend upon younger post-docs and grad students to conduct research… and the academic cycle continues.

Pet Peeve

Ok, My pet peeve is Microsoft’s repeated abuse of terminology.  Today, I’m annoyed with their use of the word “cloud”, because Microsoft just doesn’t get it… again.

This is how Wikipedia explains cloud computing:

Cloud computing is computation, software, data access, and storage services that do not require end-user knowledge of the physical location and configuration of the system that delivers the services.

Microsoft, it seems, doesn’t bother to explain what cloud computing is despite having a full web site, packed with white papers and and video clips, devoted to it.   However, the example they use in their commercial is using the cloud to get a specific file from your computer at home – instead of having someone named Claude go to your house to get it for you.

How on earth is that an example of cloud computing? The file is in a defined location, and you’d better damn well know where your own computer is if Claude is going to bring it to you… I think Microsoft has just reinvented ssh/scp (available since 1995) or rlogin (available since 1986) and decided to call it “the cloud.”

It’s hardly the first time Microsoft has grossly misused terms for their own marketing purposes (eg. “I’m a PC” or “Open XML”), so you think I’d be used to it by now. Unfortunately, it still grates on my nerves every time.

Just in case you think I’m the first to complain about this (I’m not), here are a few others – and one possible reason why.

AGBT talk: Ellen Wright Clayton, Vanderbilt University

[Speaker Encourages tweeting and blogging!]

Title: Surfing the tusnami of whole genome sequencing.


  • Complete disclosure of the results of whole genome sequencing could lead to disaster.
  • Suggest strategies to take the flood of information.

Medicine: Based on genetic and environmental contributions. Prevention plays a smaller part in medical care, and is based entirely on phenotype + age.

Future: Personalized medicine [Francis Collins quote on sequencing newborns].


  1. Separating the wheat from the chaff: false positives increase as data increases.
  2. Incidental findings: Most people say they want incidental findings, even if they don’t know what that means.  When deciding what results to return, however, there are many categories (reproductive outcome, action ability, personal value, but what about standards in clinical practice?)  The debate about this is ongoing, but possibly paternalistic.
  3. What are the downstream costs?  Parallel debate in radiology where you have to factor in everything – and where the actual cost of following up incidental findings is not trivial.  Maybe it’s not worth following up on everything.
  4. pleiotropy: ApoE4 story and PheWAS (no detail given, but much information available elsewhere.) As we look at genomes, we’ll find a lot of pleotropic effects, which means we’ll have a LOT of incidental effects.
  5. Bad Science: Discussion of “GATTACA”.

[This discussion is subtly directed at an American audience….  finding it less convincing as a Canadian, where healthcare is free, and the cost savings of personal genomics will outweigh the cost of following up on accessory conditions.]

Thus: disclosure of all this information threatens to sweep away the health care system.  [Meh… doubtful.]


  1. Consider utiltiy and and actionability… don’t disclose things until [someone] decides its ready for “primetime”.  [who is this someone who decides this for me?]
  2. Age for testing and disclosure
  3. Impact of costs of follow up
  4. what about people who don’t want to know?

The real question:

  • We all assume we can control who gets access to this data. [No, not really – I assume it is mostly irrelevant to everyone but the person for whom it pertains, unless you’re an american with private healthcare.]

What do we do when the information is available?

  • Better information for electronic medical records
  • Develop better policies now.
  • Patient’s desires will probably play a minor role.  This will be REALLY controversial.  Limits will make people unhappy.

[I’m leaving out the discussion of “parents have a constitutional right to their child’s information”… it’s very much irrelevant and seems like a non-sequitur to me, and childhood stories don’t belong on a blog.  See, I know where to stop blogging.]

To clarify:

  1. Scientific analysis of variations and their impacts must proceed at full speed. [Yes, but why would you assume it isn’t????]  Public doesn’t know it. [Ok, we need to be better at communication.  How about more blogging and tweeting? (-:]
  2. Policies determining access and use.
  3. We need to engage the public and explain what it shows and doesn’t show.  [Communicate limits.  I agree with this, but media needs to be better informed on the point…  yada yada]

If we’re going to “surf the tsunami” of medical data, we have to do a better job of engaging, recognizing that it will be controversial, and knowing it’s limits.

[Interesting talk, but I fail to see most of her points. First question makes light of the American/non-american divide… (-:  ]

Evening Festivities or being snarky about pac bio’s movie.

Possibly the most exciting thing that’s happened in the past hour is the fact I’ve won a million pounds in a lottery that I don’t remember entering…. although really, I don’t think I’m going to send my personal information to the lottery corporation – after all, the lottery was sponsored by the British tobaco promo.

More to the point, I was underwhelmed by “The New Biology” film shown by Pac Bio. That’s not to say it was bad, but that they’d picked the wrong audience. Really, it might have been good if you were, say, a complete newbie in the field of next-gen sequencing, or if you like snazzy graphics that don’t tell you much. (Yes, I’m being snarky… but I’ve been good all day, so here it goes.)

Personally, I found myself trying to read the lines of perl code that would scroll by periodically in batches of random numbers. I did catch the line:

while (1) {

which has me worrying about the origin of the code and where it came from. (This is one of those things that good coders just don’t do.) I got a copy of the video so I’ll try to figure this out later. My guess is it wasn’t pac bio’s code. I think much more highly of them than that, and this movie was really not designed for an audience of bioinformaticians. I hope the biologists in the audience got more out of it.

I’m also a little wary of the “new biology” paradigm, which was alternately defined as personalized medicine, drug screening, network biology and next generation sequencing. They can’t ALL be new biology… can they? Or did I miss the memo that everything in the future is new biology… hrm.

I suppose It also didn’t help that there were a lot of facebook analogies in the introduction… I’m rather anti-facebook because of it’s policies, and really, I think my database of a billion rows of search-able variations across 2000 samples faces entirely different challenges than the mechanisms used when my nephew tells all of his friends about how much he hates math class… Don’t get me wrong – I love social networking in the abstract, but facebook isn’t my device of choice….and then there was the Monsanto thing, but lets not get into that.

Anyhow, I guess I can say the movie wasn’t to my taste, unfortunately. I can see it doing well as a one hour TV special on the national geographic channel – or even uploaded to youtube, where I’m sure it would quickly accrue several million hits, but my further viewing pleasure will all be with an eye to figuring out where the code came from… or possibly as a drinking game. (A shot every time someone says “new biology” might work well.) Bottoms up!

Ah, Pac Bio, I was hoping for more snazzy technology this year, rather than a disney-esque version of the future. But that’s ok, you’re still my favorite technology… Long live single molecule sequencing!