I thought I’d try something new today. I often come up with neat little experiments that I could try, but usually don’t have the time for, and since I’ve been trying to make more time to blog. I thought to myself, why not kill two birds with one stone and use the time I was going to set asside for my blog to try an actual experiment!
Yesterday, a tweet was going around about the number of errors in the human genome, eg, the number of rare alleles incorporated into the reference structure. (Unfortunately, I don’t know how to search the logs to figure out what that tweet was, or who sent it…) Interestingly enough, I’m in a neat and unique place to try to answer that question – or at least estimate it. I think it was something like 10% of snps called are actually cause by rare alleles in the reference genome, which I thought might be an understatement. However, that made me wonder about the overal number of rare alleles in the human genome.
Using my database of ~780M snp calls in ~1750 NGS libraries, I should be able to estimate how many positions in the human genome (hg18) incorporate the rare allele, or perhaps incorrect bases.
Interrogate the database using simple SQL commands: Using the count of observations at each position in the genome, we can observe how frequently we call polymorphisms on a location. If the location is observed at high enough frequency (eg, 100% of the libraries), we clearly have an error. If it’s observed in more than 50%, then we can at least say that the reference base is the minor allele, at the very least. We’ll use the 50% mark for our estimates.
To do this, we’ll use the following SQL to establish the number of distinct libraries:
select count(distinct name) from library;
This tells us that there are 1631 libraries total – which I happen to know are heavily biased towards transcriptomes and exomes, but we’ll talk about that in a bit. Next, we’ll make use of the total number of distinct libraries to find out how many libraries each snp is found:
select count(snp_id) from snpcancernormalcount where (cancer + normal >=1631) ;
I’d guessed that this query would be pretty slow, having to run over a table with about 125M entries, so we don’t want to run it too frequently. I’ll include the time taken just for entertainment. Lets run it for 25%, 50%, 75% and 100% of the libraries, replacing 1631 with the appropriate number each time.
It should come as no surprise to anyone that the number of bases that are found with disagreement in all 1631 libraries sequenced is zero. Many of the libraries sequenced are transcriprtome or exome, which means the vast majority of bases in the genome aren’t covered. That also means the odds of a particular location being found in all of our library would be a function of the number of cell types it would be associated with, limiting our search to housekeeping genes, for the most part. However, that means that any polymorphism found in 50%-75% of our libraries is that much more compelling, and really does make the point that this experiment is a vast underestimate. Since we already know that more than half of the libraries in the database are transcriptome or exome based, anything that does turn up in 50% of more libraries is probably reliable, and more importantly, confined to the 2% of the genome that is actively transcribed.
We could repeat the same experiment limiting it only to transcriptome or exomes or even complete genomes, but we’ll save that for another day.
In any case, keeping in mind that this is a vast understatement of the true result and, as mentioned above, we’re likely only polling about 2% of the genome in this experiment. Thus they’re really about 1/50th of the true number. (Actually, it would be somewhat more complex, as the exons have different selective pressures on them than does intergenic DNA, but again, for this experiment, we’ll ignore that.) Thus, we can estimate the number of “rare allele” polymorphisms in the database to be somewhere around (26×50 = ) 1,300 to (753×50 = ) 37,650, with the expected number being towards the higher end of that range.
Of course, 37,650 “mistakes” in the 2.8 million called bases in the human genome is pretty small. It’s only 1.3% of the total genome, meaning that the hg18 database is probably somewhere in the 98.7% accuracy range. Of course, if the 1.3% is likely an understatement, the 98.7% is probably an overstatement. It does suggest that we can trust the human genome (hg18) to be pretty reliable for single base substitutions.. (Indels would be another matter entirely.)
As a final note on the result, its a neat concept to be able to estimate the accuracy of the reference genome using nothing but an aggregate table from the variation database. As the number of libraries grows, it’s likely that we’ be able to re-calculate this much more accurately in the future. Particularly, as the cost of sequencing continues to drop, its likely we’ll see more genomes begin to appear in the database, which will allow us to converge to the correct answer.
On the subject of the timing, There’s a steady progression of speed improvements as each query was run in order. This has to do with the database itself, which caches tables/indexes as they’ve used. The table in use here was probably not cached at all for the first run, so reading it from disk took just over 4 minutes. Once the table had been put into memory, all of the subsequent runs were dramatically faster. I didn’t make use of prepared statements or temporary tables, since this was a one-off “experiment”, but that could have been done to get better performance.