Mini Experiment – how many mistakes are there in the human genome?

I thought I’d try something new today. I often come up with neat little experiments that I could try, but usually don’t have the time for, and since I’ve been trying to make more time to blog. I thought to myself, why not kill two birds with one stone and use the time I was going to set asside for my blog to try an actual experiment!

Yesterday, a tweet was going around about the number of errors in the human genome, eg, the number of rare alleles incorporated into the reference structure. (Unfortunately, I don’t know how to search the logs to figure out what that tweet was, or who sent it…) Interestingly enough, I’m in a neat and unique place to try to answer that question – or at least estimate it. I think it was something like 10% of snps called are actually cause by rare alleles in the reference genome, which I thought might be an understatement. However, that made me wonder about the overal number of rare alleles in the human genome.

Hypothesis:
Using my database of ~780M snp calls in ~1750 NGS libraries, I should be able to estimate how many positions in the human genome (hg18) incorporate the rare allele, or perhaps incorrect bases.

Method:
Interrogate the database using simple SQL commands: Using the count of observations at each position in the genome, we can observe how frequently we call polymorphisms on a location. If the location is observed at high enough frequency (eg, 100% of the libraries), we clearly have an error. If it’s observed in more than 50%, then we can at least say that the reference base is the minor allele, at the very least. We’ll use the 50% mark for our estimates.

To do this, we’ll use the following SQL to establish the number of distinct libraries:

select count(distinct name) from library;

This tells us that there are 1631 libraries total – which I happen to know are heavily biased towards transcriptomes and exomes, but we’ll talk about that in a bit. Next, we’ll make use of the total number of distinct libraries to find out how many libraries each snp is found:

select count(snp_id) from snpcancernormalcount where (cancer + normal >=1631) ;

I’d guessed that this query would be pretty slow, having to run over a table with about 125M entries, so we don’t want to run it too frequently. I’ll include the time taken just for entertainment. Lets run it for 25%, 50%, 75% and 100% of the libraries, replacing 1631 with the appropriate number each time.

Results:

Percent Result Search Time
100% 0 260s
75% 26 39s
50% 753 24s
25% 22,205 24s
10% 908,249 24s

Discussion:
It should come as no surprise to anyone that the number of bases that are found with disagreement in all 1631 libraries sequenced is zero. Many of the libraries sequenced are transcriprtome or exome, which means the vast majority of bases in the genome aren’t covered. That also means the odds of a particular location being found in all of our library would be a function of the number of cell types it would be associated with, limiting our search to housekeeping genes, for the most part. However, that means that any polymorphism found in 50%-75% of our libraries is that much more compelling, and really does make the point that this experiment is a vast underestimate. Since we already know that more than half of the libraries in the database are transcriptome or exome based, anything that does turn up in 50% of more libraries is probably reliable, and more importantly, confined to the 2% of the genome that is actively transcribed.

We could repeat the same experiment limiting it only to transcriptome or exomes or even complete genomes, but we’ll save that for another day.

In any case, keeping in mind that this is a vast understatement of the true result and, as mentioned above, we’re likely only polling about 2% of the genome in this experiment. Thus they’re really about 1/50th of the true number. (Actually, it would be somewhat more complex, as the exons have different selective pressures on them than does intergenic DNA, but again, for this experiment, we’ll ignore that.) Thus, we can estimate the number of “rare allele” polymorphisms in the database to be somewhere around (26×50 = ) 1,300 to (753×50 = ) 37,650, with the expected number being towards the higher end of that range.

Of course, 37,650 “mistakes” in the 2.8 million called bases in the human genome is pretty small. It’s only 1.3% of the total genome, meaning that the hg18 database is probably somewhere in the 98.7% accuracy range. Of course, if the 1.3% is likely an understatement, the 98.7% is probably an overstatement. It does suggest that we can trust the human genome (hg18) to be pretty reliable for single base substitutions.. (Indels would be another matter entirely.)

As a final note on the result, its a neat concept to be able to estimate the accuracy of the reference genome using nothing but an aggregate table from the variation database. As the number of libraries grows, it’s likely that we’ be able to re-calculate this much more accurately in the future. Particularly, as the cost of sequencing continues to drop, its likely we’ll see more genomes begin to appear in the database, which will allow us to converge to the correct answer.

On the subject of the timing, There’s a steady progression of speed improvements as each query was run in order. This has to do with the database itself, which caches tables/indexes as they’ve used. The table in use here was probably not cached at all for the first run, so reading it from disk took just over 4 minutes. Once the table had been put into memory, all of the subsequent runs were dramatically faster. I didn’t make use of prepared statements or temporary tables, since this was a one-off “experiment”, but that could have been done to get better performance.

2 thoughts on “Mini Experiment – how many mistakes are there in the human genome?

  1. Like most good Medical Scientist, you are willing to make false assumptions about your premises and hurry to get to the Math to prove your point.
    If you take a step back, you will see that all Human Genomes are based on assumption of 23 chromosomes from Dad and 23 from Mom. The truth is that is the mean average; but, some people have 40 to 60, while other men may have 1-3 y chromosomes along with one x. But, not women, why?
    No red blood cells have any D.N.A. and RBC are 45% of all cells, why?
    Mitochondrial DNA is 100% from Mom, and nothing from Dad, why?
    And, are we supposed to forget about hormones, enzymes, and good bacteria (11 cells for every Human cell), and just assume they are “along for the ride”? Why is the Nat.’l Inst. of Health spending Billions to have it “adjusted”?
    The Human Gnome is a very, very small contribution to our knowledge of the Human Being, if any.

    • Wow, Mike…

      Thanks for the rant, but I’m afraid you’re just flat out wrong. I rather suspect you have no idea what you’re talking about.

      “Human Genome” is really a term that is used in three senses: either as the collection of genomes of all human beings, or as a description of the reference “human genome” which is often used as the “average” of many human beings, or (finally) as the individual genome of a specific human being. However, in neither case do your comments make any sense.

      “Human genomes” are in no way based on assumptions of 23 chromosomes from each parent. Nor do people have 40-60 chromosomes. Cancer cells do, but they are really only borderline human anymore. While it is possible to have multiple copies of the Y chromosome, it’s not a common condition – but nor does it change anything about the “human genome”. A rough analogy would be if a book has two copies of page 1. That doesn’t mean that the whole book is wrong, just that there are two copies of a single page. It does not invalidate any assumptions in a meaningful way.

      Mature red blood cells don’t have nuclei, but that has nothing to do with anything…. I fail to see why you raise the point at all.

      Likewise, Mitochondrial DNA is from the mother – but what does that have to do with anything.

      Your point about the NIH spending billions to have it “adjusted” is really an obtuse and misleading comment. As we improve the technology used for sequencing, we improve our understanding about what information exists in the common reference genome, and money is being spent to understand, improve, correct and interpret the genome.

      Not one has ever said that the Human Genome project is the sum total of human knowledge… so what on earth is the point of your rant?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.