How to Improve Bioinformatics Graduate Work

This is only a half formed thought, but I wanted to get something “on paper”, so to speak.

I was thinking about my grad school experience and what was positive (many things) and what was negative (many things) – and of course, what I’d do to improve things.  In fact, what it came down to was one major theme:  The vast majority of work I did in coding (much to my advisor’s dismay), in implementing new algorithms or developing code – was done independently.

I don’t mean that I never had buy-in. It was slow to build, but my projects generally were implemented and used by a large population of bioinformaticians both where I work and elsewhere. (I’m honoured that they used it, for the record.)

What I mean is that I had a difficult time recruiting people to participate in the coding.  The two most productive periods of the last year were – respectively – when I had a co-worker who contributed code to FindPeaks for a few months before leaving to return to Europe, and in the last few months, when a post-doc joined my project and helped implement a small section of the database project.  In each case, their help (neither of which was full-time) was enough to free me (temporarily) from the support cycle that tends to consume bioinformatics projects.  Just off-loading a few requests can be the difference between having time to write a paper, or not. The more popular the project, the more you find yourself dealing with users.

Unfortunately, graduate work, as implemented where I am, tends to be done in solo projects.  No one else in my lab works on anything remotely close to what I do, and when the projects do overlap, the incentive isn’t to collaborate but to use someone else’s tool to do something completely different.  That tends to leave everyone isolated in their own codebase.

Naturally, I’m not saying that my isolation was someone else’s fault – I could have chosen to work in perl, which might have helped bring in other developers – Java is not a popular language for bioinformatics – and I probably could have been more welcoming to a few people who showed interest. (I can be short with people I feel are wasting my time, which is something I’m working on.  None of them were interested in contributing, as far as I could tell, just using my project to advance their own work, which is itself a common academic technique.)

However, if I had to pick one thing that would have made my graduate work go faster and more smoothly, it would have been to have some structure around me that encouraged collaborative work.  Specifically, collaborative development work.

I don’t think it’s sufficient these days to sit and write code on your own anymore, and call that bioinformatics – I really believe that the future of the field will be in the larger group efforts, much the same way that no one physicist would go off and build his own cyclotron.  It’s just that bioinformatics hasn’t been forced to catch up yet.

Those graduate students who are given the collaborative support to grow and develop will find themselves more in demand – and better equipped to handle the tasks which are slowly going to dominate the field.  Seriously, gone are the days when a coder could hack up blockbuster games like tetris in their garage, develop new search engines on their own, or even write an operating system – these things all take teams to do successfully now.  And I think it’s time we start to consider bioinformatics to be the same way.

Once I’m done my graduate work (which will be soon, thank goodness), I’m going to make sure I’m in an environment where I’m not working in isolation.   The last 4 years have been hard.  One man teams are not all they’re cracked up to be!

I should mention, however, that I’m fortunate my own advisors let me pave my own trail with open source code and wikis.  I may not have had much opportunity to work with other people on my projects, but at least I gave the world a chance to participate – and that made all the difference for me.

All right, that’s all I’ve got for tonight… so for all you bioinformatics grad students just starting out: go forth and collaborate.

Dr. Robert Weinberg lecture at the BC Cancer Agency

A LONG long time ago, I wrote a review on Dr. Weinberg’s textbook, The biology of cancer, which actually drew a response from Dr. Weinberg himself. Needless to say, I’ve been a big fan of Dr. Weinberg’s since reading his textbook, and I’ve learned to keep my opinions to myself – mostly.

In any case, I couldn’t resist the opportunity to see him in person, now that he’s visiting the BC Cancer Agency. Although, I arrived at the lecture theater 15 minutes early, and wasn’t at all surprised to discover that the hall is nearly full – there’s a full overflow room as well.

The lecture was organized by GRASSPODS, the Bennet family and several other sources -none of whom, by the way, are sponsoring me, so I don’t feel obliged to list them all.

I’ll revisit this later to add some reasonable formatting.

—————-

* Lab: involved in mechanisms leading to tumours. Outline of how primary tumours begin.
* 90% of cancer associated mortality is associated with metastasis, not primary tumours.
* Discussion of tumour progression – can take two tumours, introduce identical mutations. and in one case you get metastasis, in another, no metastasis. Tumourigenicity is also vastly different. Suggest the involvement of stem cells.
The nature of the normal cell of origin is a strong determinant of the phenotype of the tumorigenic cell.

Cascade model of stem cells -> intermediate ->  post-mitotic differentiated cells.  Most cells give up the stem cell properties, and can no longer seed new tumors.

Invasion -metastasis cascade:

Primary Tumor -> invasion -> intravasation -> Trasposition -> Extravations -> micrometastasis -> metastasis.

Most cancers don’t make it all the way to metastasis.  Thus, there are serious challenges for tumour cells to get all the way to the end of the cascade.

So, what changes must occur to metastasize?  several Images: human xenograft, stained.  Human epithelial cells show EMT transformation preferentially on outside of human xenograft, at interface with mouse cells.  Thus, EMT transformation seems probable as a metastasis intermediate.

microenvironment of the primary tumor can contribute importantly to the phenotypic conversion occuring during an EMT…..  (missed the end)

Thus, we probably can’t find out EVERYTHING from just sequencing the tumour – we need to know about the environment of the cells.  eg, tumour structure specific changes.  There is also more to an EMT transformation than just shape – it’s a profound shift in expression, possibly up to 1000 genes.

Much of what we know about EMT comes from transcription factor research.  eg, gene Twist, which reprograms cells involved in EMT.

Cancer cells appropriate early embryogenic programs in order to acquire traits of high-grade malignancy.  (slide heading)

(EMT also occurs in non-cancer as well, such as wound healing. )

Image: metastasis are 85% reduced in one experiment by using siRNA-Twist.  We don’t know if Twist is sufficient for activating EMT, however.

How many of the steps of the invation-metastasis cascade can the twist transcription factors program? (header)

Colonization is unlikely to be activated by twist, but it changes how we see the metastatic process.  It simplifies things however, as turning on twist, you can probably get metastasis by simply turning on twist (or equivalent).

Anecdote: 15 year adenoma -> cancer, 1 year cancer -> metastasis.  Very suggestive, anyhow.

Gupta’s work:  melanocite -> tumor state, (SV40, ER, hTERT and Ras). behaviour was “mild” Put into non melanocite cells caused many other changes.

Some discussion of Slug protein, which enables emigration from priimitive neural crest – also seems to be activated in metastatic cells.

transcription factors that enable neural crest emigration are likely also involved.

member of Weinberg’s lab (Mani?) asked about the nature of cells induced by an EMT: Epithelial cells -> (EMT) -> Fibroblast ??

Tested it, and found that you don’t get a standard fibroblast.  Then what DOES it produce?  Found FOXC2 in cell nuclei of normal human mammary epithelial.  Proves nothing, but suggests: Epithelial cells -> EMT -> Mesenchymal cells -> ??? -> stem cells.

In fact, tried FACS, (CD44, CD24) and showed migration from non-stem cells to putative stem-cell groupings.

Induction of EMT by Snail and Twist EMT-inducing TFs also generates CD44hi CD24lo cells. Similar experiment done, and tied with mRNA expression showing epitheial markers drops dramatically, and mesenchymal markers dramatically rises (N_cad, Vim FN1, FOXC2, Slug, SIP1, Twist).  Thus, epithelial cells transformed this way are going through the “Front Door” of the pathway towards stem cell characteristics.

This is also shown by morphology: mamosphere forming abilities.

Kornelia Polyak looked at primary human breast samples, and demonstrated that also seems to happen in cancers. Marked overexpression of mesenchymal markers.

Stem cell program of EMT cells, is the same (we believe) as breast cancer, etc.  EMT is likely highly “dangerous” for cancer patients.  It allows transportation to new locations AND likely self-renewal!

Can one produce more compelling biological profs that EMTs generate epithelial stem cells?

Mouse model was used by Wenjun Guo.  (using cleared mammary stromal fat pads.)  Slug was found to be highly upregulated in stem cell population. [leaving out much detail here about the experimental design, which was very elegant.]

Performed competitive reconstitution assay – used Slug vector.  Used controls without Slug vector.  Before induction, they’re the same:   Slug induction: able to recreate complete ductal tree as though completely stem-cell-like.  Controls had some behaviour, but ductal tree recreated is small, and non-complete. [again leaving out lots of evidence presented]

Thus, it seems that Slug CAN be used to induce a stem-cell behaviour in a cell population.  It’s possible, but not shown, that this could be used for generating complete organs.  [neat]

Slug is necessary AND sufficient for breast cell Stem Cell transformation.

Most currently used chemotherapeutics kill non-CSCs (cancer stem cells) preferentially.  Thus, as long as the CSC populations are present, recurrance is likely, if not probable.  We need to target the CSC population instead, thus, were need new drugs, etc.

Force epithelial cancer cells to undergo an EMT and acquire stem-cell traits… used this to screen for drugs that preferentially kill stem cells, rather than non-stem cells. For two common drugs (doxorubicin and Paclitaxel) CSCs required 2x dose to kill the CSCs relative to non-CSC.

Screened, and came up with two drugs, both of which are anti-paracitics.  Salinomycin (used in chicken feed!)   (Gupta and Onder)

Example of cell populations tested via FACS.  original 4.9% are stem cell -> Paclitaxel takes you to 70% stem cell state (by killing non-CSCs), whereas salinomycin takes you to 0.2% (by killing CSCs).  This compound isn’t a good drug for cancer, but it may lead to new drugs that could be used.

C.Chaffer did an experiment that shows that floating cells (flopc) in media aren’t all dead, and some of them adhere to suface if re-cultured – there appear to find small sub-population of stem-cell-like.  Followed over time, showed that non-stem-cells spontanously go into stem-cell state and vice versa!

There is a rate of conversion, depending on the cell.

Can not be due to rapidly growing populations of stem cells because they grow SLOWER than non-stem-cells.

Thus, there is a spontaneous transformation from the non-stem-cells to stem-like states.

Thus, the cascade figure is wrong!  The arrow must go both ways – FROM and TO the stem-cell state.  Also, must be used to re-think the cancer treatment on CSC based treatments – because you can’t just kill the CSC once, as they’ll regenerate!  Thus, you must hit both populations – both the non-CSC and the CSC at the same time.

Also have to revisit the cascade…  CSCs can go through non-CSC stages!  [kinda mind blowing, actually]  Thus, not all mutations must happen in the stem cells!  Mutations happen more rapidly in the more rapidly growing cells, which are not stem-cells, but that later BECOME stem cells.

[Great Talk!  Tons to think about, but I’ll just post it now and revisit it later for cleanup.]

Rhodes Scholars

I read an interesting article on the trend for “well educated” students to be increasingly book smart, and decreasingly worldly. (Link here) Frankly, I found the article to be a rather sad reflection of the authors unwillingness to re-evaluate their pre-conceived notions about what makes a good student.

Rhodes scholars are picked from the best of the best – using marks, leaderships and sports participation as the criteria for the picking the first crop, from which the finalists are chosen. In the past this might have worked – there were probably a lot of book smart students among those who applied, but were easily weeded out. However, as the number of eligible students increases, with selection criteria that will enrich for book smart students, it’s no wonder the trend they see is away from those who see the big picture. Getting perfect marks AND seeing the big picture are (in my humble opinion) let alone that and sports participation are not commonly found in the same people.

Anyhow, your milage may vary – but I think they’ve failed to grasp the big picture: Those who move the world generally aren’t going to fixate on perfect test scores. But, I supose if you’re only looking for the three or four per year, you’ll have to dredge through a lot of applicants to find the few who do meet the criteria – but then why complain that you have to look harder, as your pool gets larger and larger?

Kubuntu 10.04 on a Macbook pro

After spending a few hours setting up a macbook pro with kubuntu 10.10 (Maverick), I figured there were a few things that would be useful to jot down.

First of all, I found this page very helpful: https://help.ubuntu.com/community/MacBookPro6-2/Maverick – instructions on setting up a macbook pro 6,2 with Kubunu Maverick provided by the community.

Unfortunately, not all of their advice was spot on, so there are a few things I figured I’d need to point out:

  1. run sudo apt-get update; sudo apt-get dist-upgrade once you’ve got things set up – that will upgrade and solve a few problems right off the bat
  2. you will need to install mbp-nvidia-bl to get the screen brightness controls working (more on that later) and you’ll want macfanctld to control the fans. I’m not sure what btusb-dkms does, but it wasn’t a problem down the road, so you can install all three at once with: sudo apt-get install mbp-nvidia-bl-dkms macfanctld btusb-dkms
  3. Mac Keys:The equivalent instruction on the community page suggests installing pommed – Do not install pommed! It’s unnecessary, it constantly takes up about 4% cpu and doesn’t work well. You can configure all of the buttons elsewhere. (eg, I used compiz to set up the desktop widget and scale applications)
  4. You’ll want to have mbp-nvidia-dkms running at boot, so you’ll add the word “mbp-nvidia-bl” to the end of the /etc/modules file. (I also have coretemp in the file, having followed the instructions for sensors on the community page.)
  5. At some point, I installed “nvidia-bl-dkms”, which was causing the mbp-nvidia-bl-dkms to not be loaded (aka, blacklisted), so I had to remove it – if you haven’t added it, you won’t have that problem.)
  6. I used the command “alsamixer” to unmute the speakers, as suggested on the community page, but then also had to make a few changes to the sound control widget. First, I had to change the master control from the High Def sound controller to the onboard sound controller, and then un-mute that. That enabled sound on my system AND made the sound buttons on the keyboard work.
  7. I found (and still find) the touchpad to be extremely annoying – so I tried touchfreeze. Unfortunately, touchfreeze takes a constant 4% of cpu as well, so I killed that. The next best solution I had was to use the command “syndaemon -d -i 1” on the command line. This runs syndaemon in the background, and prevents the touchpad from doing anything one second after you’ve typed anything. It’s not perfect, but WAY better than nothing. You can put it into systems -> Startup and Shutdown -> Autostart, so that it will turn on every time you log in.
  8. Screen brightness is somewhat of a pain. Once you have mbp-nvidia-bl running as a module, you can use the screen brightness slider on the power control (battery) applet to control the screen brightness. Unfortunately, I haven’t gotten the buttons working yet – though, I’ve read that you can add “acpi_backlight=vendor” to your /etc/default/grub file to the end of the GRUB_CMDLINE_LINUX_DEFAULT parameter. I haven’t tried this yet. EDIT: Tried it – didn’t work. EDIT 2: There appears to be a KDE bug report for this problem, which is fixed in a patch in Nov 2010 – which means it’s unlikely that there’s a trivial fix for it – and furthermore, that it will be fixed in Kubuntu 11.04. (see this)

Everything else seems to have worked as advertised on the community page.

Credit for everything here should go to the people who posted them in forums – I didn’t invent any of this myself, and, unfortunately, I didn’t write down the sources for what I did use – I’ve googled using too many other computers on the way to getting things set up.

Further Notes by means of EDIT:

  • To enable the speakers on the macbook pro, you do need to install the gnome-alsamixer package, which will then give you the option of un-muting the speakers. I couldn’t track down this option using any of the other interfaces (sound applet, control panel or the alsamixer for the command line.)
  • The best way to set up the touch pad is to use the control panel to disable the single tap (set it to none), but allow tapping to be on otherwise. That fixes most (if not all) of the weird track pad behaviors.

Happy Kubuntu-ing.

Variant Call Format Redundancy

I’ve been working with the Variant Call Format (VCF) files released for the latest version of dbsnp, which contain snp calls from the 1000 genome project. I have to say that it’s nice to have it all in one file. That makes my life easier, really. (Links to them and some discussion on them can be found here)

However, I’m a little surprised at the execution. First, VCF is a text-based format, which is supposed to make it easy to add information to the record by standardising the first couple of columns (chromosome, position, etc), and then providing one column for flags in text format. That seems just fine to me. However, the VCF format files for dbsnp are somewhat… funky.

Instead of sticking with text flags, they’ve also incorporated a binary flag. First, that somewhat defeats the point of using text flags in the first place – you’re either making something easy to parse by your users, or you’re not. Binary flag fields don’t really count in that category. (There are instructions for deciphering them, which is a big plus, but not something your average perl hacker can deal with.) This binary field, of course, is mixed in with a large number of text flags… and that’s where the weirdness begins.

Example line:

1 10327 rs112750067 T C . . dbSNPBuildID=132;VP=050000020005000000000100;WGT=1;VC=SNP;R5;ASP

That is to say, on chromosome 1, at the 10327th position, there is a snp which has been given the name “rs112750067”, which changes a reference T into a C. After the two dots, you’ll then see the text-flag field I was talking about, and the VP={blah} is the bit flag field.

Stranger than just forcing the binary information into a comma delimited text field, the binary-based flags are actually redundant with the text field – that is to say that it contains no information outside of what the text flags already hold! In the example above, the VP=blah actually contains the information given by the WGT, VC, R5 and ASP flags.

I would be very interested to know what exactly is going on here. There are logical possibilities: maybe this is just an intermediate format before they break the VCF format and switch to an all-binary flag set? Or maybe they haven’t realised how redundant it is to duplicate all of the information in the flags area by embedding itself in a binary format within said flags field? I’m not really sure… but I’d love to know.

For the moment, I’m just going to ignore the binary data and just deal with the text flags… I can’t see any reason to do otherwise.

Ode to Expedited Delay

28 hours after my new laptop was supposed to have left china, it turns out it’s still there – and has left again! I’m not sure how many times it has to leave, but it certainly was unexpected to me to see it sit in china that long after UPS claimed it had left.

In any case, to tide you over till I have something else to write (and to keep myself from hitting refresh on the UPS screen another 1000 times) I thought I’d provide a fantastic “video” of Beethoven’s 9th symphony, 2nd movement. I often listen to classical music while working, but this visualisation is so compelling, I’ve been unable to drag myself away from it. It’s very different than most other pieces I’ve seen visualised in this manner, which is just another way that you can see Beethoven’s genius in action. If you’d like to see other videos, the same user has a whole bunch of them posted – and apparently sells them as videos for $1 each.

Actually, it’s given me a few genome visualisation ideas as well – but I’ll talk more about that soon. I need to finish the revisions for my paper first. Then I’ll talk about what I worked on over the holidays, which might be a good starting point for some future work. whee!

Buying a laptop… i5-2537M processor?

Let me summarise: My own laptop, bought 4 years ago just before starting grad school, is dying. The monitor supports one resolution (and dies any time it leaves it, e.g., games, tty1-7, shutdown screen, etc) and has a battery life that would terrify a mayfly. Replacing the battery costs more than 1/3rd of the original price of the laptop, and just seems like a silly waste of money. And, although I’m technically competent enough to purchase new cells and re-wire the battery pack, I don’t own my own soldering iron.

Thus, when I was told I have some money in my research component of my scholarship that will expire shortly, I thought I’d buy a new laptop and put this one out of its misery.

It didn’t take me long to come up with a list of requirements, which are all self contradicting: Light, powerful, good screen resolution. Unfortunately, I’ve only found one laptop that fits the bill: The just announced Samsung 9 series laptop:

Unfortunately, the best I could find was that it won’t be released until February 2011. No one seems to have a date for it yet, however. So I did my own sleuthing.

The big mystery to me was the CPU, which I couldn’t track down in any other laptop – it seems to be the front runner of intel’s low powered line of dual core processors – a low power i5. And that’s what lead me to the answer:

Intel Core i5-2537M Processor

The CPU itself is going to be released on February 20th, 2011, so there is no way the laptop itself will be released before that. And that means I won’t be able to get my hands on one of these before AGBT 2011.

Nutsit. I guess that means I’ll be buying a macbook.

Chinese military shovel….

I came across the video for the “Chinese military shovel” today, billed as the swiss army knife of shovels. Frankly, it’s worth watching the video just for the music that makes me feel like I’m watching “The Magnificent Seven” for the whole 8 and a half minutes.

http://www.wimp.com/militaryshovel/

The one thing that worries me, however, is that it does have a sharp side – sharp enough to be used as an axe and to julienne potatoes during the demo! Shortly thereafter, they show someone using it as a wire twister, gripping that side to spin the shovel… and THEN using it to paddle an inflatable dinghy! These Chinese soldiers don’t seem to recognize that sharp objects and inflatable ones don’t really mix. eep!

Despite the fact that I already have access to three shovels (one for snow, one for digging soil, and one for moving gravel), I never expected that I’d want a shovel as badly as I want one of these!

American politics

First off, I am a Canadian, so I obviously am not involved in American politics in any way. However, Canadians and Americans share a lot of tv channels and the vast majority of the North American continent. We’re neighbours whether we like it or not.

Also a lot like neighbours, it’s hard not to notice when things are going on next door.

Admittedly, Canadians don’t like to be mistaken for Americans – there are a lot of American stereotypes we don’t want to be associated with, but we’re usually good natured about it. We’re all people, after all, and we can usually find common ground. We have a decent fence, which the Americans usually respect (except in the Arctic and a few odd things like the Pig war). We can respect each other’s differences, and move on.

However, the tea party fears me with fear. They’re exactly the kind of neighbour you don’t want. In a weird way, it makes me think I’m living next to a couple with an abusive child. Most of the time, not much happens, but you can see the way the abusive child treats his (or her) parents. There are derogatory remarks, there’s a constant shifting of blame, and a serious disrespect for the family members. And, of course, there’s the occasional screaming match when the child threatens violence.

In this case, the tea part is involved in exactly the same relationship with it’s political opponents – and when the violence starts, there’s a serious denial of responsibility. It’s always the parents fault – they shouldn’t have made the curfew that the child broke! And, if they’d just given the child the money he (or she) wanted, then they wouldn’t have had to hold up the gas station…

Right. Anyhow, you get the idea. Honestly, I feel that your right to free speech ends the moment it incites violence against someone else – a crime that the tea party is guilty of, regardless of how you want to interpret “2nd amendment remedies”, or painting cross-hairs on an opponents face. No matter how much you want to hide behind your rhetoric, you are responsible for your actions – and your words.

Anyhow, I hope some of the more violent members of the tea party come to their senses with the shooting of Gabrielle Giffords. When you suggest violence, even if your hand isn’t on the gun, you are promoting a crime – as the “hate-radio” broadcasters in Rwanda found out. (I suggest Ferdinand Nahimana as a case and point.) Your responsibility for your words doesn’t end when someone else does what you tell them to.

Before this goes any further, I hope my American neighbours can rein in the violent talk long before it boils over.. any further.

In the meantime, I hope Gabrielle Giffords pulls though and is able to recover and continue her work.

Mini Experiment – how many mistakes are there in the human genome?

I thought I’d try something new today. I often come up with neat little experiments that I could try, but usually don’t have the time for, and since I’ve been trying to make more time to blog. I thought to myself, why not kill two birds with one stone and use the time I was going to set asside for my blog to try an actual experiment!

Yesterday, a tweet was going around about the number of errors in the human genome, eg, the number of rare alleles incorporated into the reference structure. (Unfortunately, I don’t know how to search the logs to figure out what that tweet was, or who sent it…) Interestingly enough, I’m in a neat and unique place to try to answer that question – or at least estimate it. I think it was something like 10% of snps called are actually cause by rare alleles in the reference genome, which I thought might be an understatement. However, that made me wonder about the overal number of rare alleles in the human genome.

Hypothesis:
Using my database of ~780M snp calls in ~1750 NGS libraries, I should be able to estimate how many positions in the human genome (hg18) incorporate the rare allele, or perhaps incorrect bases.

Method:
Interrogate the database using simple SQL commands: Using the count of observations at each position in the genome, we can observe how frequently we call polymorphisms on a location. If the location is observed at high enough frequency (eg, 100% of the libraries), we clearly have an error. If it’s observed in more than 50%, then we can at least say that the reference base is the minor allele, at the very least. We’ll use the 50% mark for our estimates.

To do this, we’ll use the following SQL to establish the number of distinct libraries:

select count(distinct name) from library;

This tells us that there are 1631 libraries total – which I happen to know are heavily biased towards transcriptomes and exomes, but we’ll talk about that in a bit. Next, we’ll make use of the total number of distinct libraries to find out how many libraries each snp is found:

select count(snp_id) from snpcancernormalcount where (cancer + normal >=1631) ;

I’d guessed that this query would be pretty slow, having to run over a table with about 125M entries, so we don’t want to run it too frequently. I’ll include the time taken just for entertainment. Lets run it for 25%, 50%, 75% and 100% of the libraries, replacing 1631 with the appropriate number each time.

Results:

Percent Result Search Time
100% 0 260s
75% 26 39s
50% 753 24s
25% 22,205 24s
10% 908,249 24s

Discussion:
It should come as no surprise to anyone that the number of bases that are found with disagreement in all 1631 libraries sequenced is zero. Many of the libraries sequenced are transcriprtome or exome, which means the vast majority of bases in the genome aren’t covered. That also means the odds of a particular location being found in all of our library would be a function of the number of cell types it would be associated with, limiting our search to housekeeping genes, for the most part. However, that means that any polymorphism found in 50%-75% of our libraries is that much more compelling, and really does make the point that this experiment is a vast underestimate. Since we already know that more than half of the libraries in the database are transcriptome or exome based, anything that does turn up in 50% of more libraries is probably reliable, and more importantly, confined to the 2% of the genome that is actively transcribed.

We could repeat the same experiment limiting it only to transcriptome or exomes or even complete genomes, but we’ll save that for another day.

In any case, keeping in mind that this is a vast understatement of the true result and, as mentioned above, we’re likely only polling about 2% of the genome in this experiment. Thus they’re really about 1/50th of the true number. (Actually, it would be somewhat more complex, as the exons have different selective pressures on them than does intergenic DNA, but again, for this experiment, we’ll ignore that.) Thus, we can estimate the number of “rare allele” polymorphisms in the database to be somewhere around (26×50 = ) 1,300 to (753×50 = ) 37,650, with the expected number being towards the higher end of that range.

Of course, 37,650 “mistakes” in the 2.8 million called bases in the human genome is pretty small. It’s only 1.3% of the total genome, meaning that the hg18 database is probably somewhere in the 98.7% accuracy range. Of course, if the 1.3% is likely an understatement, the 98.7% is probably an overstatement. It does suggest that we can trust the human genome (hg18) to be pretty reliable for single base substitutions.. (Indels would be another matter entirely.)

As a final note on the result, its a neat concept to be able to estimate the accuracy of the reference genome using nothing but an aggregate table from the variation database. As the number of libraries grows, it’s likely that we’ be able to re-calculate this much more accurately in the future. Particularly, as the cost of sequencing continues to drop, its likely we’ll see more genomes begin to appear in the database, which will allow us to converge to the correct answer.

On the subject of the timing, There’s a steady progression of speed improvements as each query was run in order. This has to do with the database itself, which caches tables/indexes as they’ve used. The table in use here was probably not cached at all for the first run, so reading it from disk took just over 4 minutes. Once the table had been put into memory, all of the subsequent runs were dramatically faster. I didn’t make use of prepared statements or temporary tables, since this was a one-off “experiment”, but that could have been done to get better performance.