>My SNP database is now up and running, with the first imports of data working well. That’s a huge improvement over the v0.1, where the data had to be entered under pretty tightly controlled circumstances. The API now uses locks, better indexes, and I’ve even tuned the database a little. (I also cheated a little and boosted the P4 running it to 1Gb RAM.)
So, what’s most interesting to me? Some of the early stats:
11,545,499 snps in total, made from:
- 870549 snp calls from the 1000 genome project
- 11361676 snps from dbsnp
So, some quick math:
11,361,676 + 870,549 – 11,545,499 = 686,726 Snps that overlapped between the 1000 genome project (34 data sets) and the dbSNP calls.
That is a whopping 1.6% of the SNPs in my database were not previously annotated in dbSNP.
I suppose that’s not a bad thing, since those samples were all “normals”, and it’s good to get some sense as to how big dbSNP really is.
Anyhow, now the fun with the database begins. A bit of documentation, a few scripts to start extracting data, and then time to put in all of the cancer datasets….
This is starting to become fun.