Yes, I am one of the luckiest grad students around. One of the things that very few bioinformaticians get to work with during their PhD is scalability… and I don’t just mean moving from their own project to a larger dataset. I mean, industrial size scalability.
Interestingly enough, I’m now working on my second “scalable” project. My first one, FindPeaks, was reasonably scalable – I’ve yet to find a dataset that it couldn’t handle – which includes some pretty monstrous illumina data sets. We’ve even forced through individual peaks will several millions of reads, which is not a bad test of how scalable the app is. However, even that pales in comparison to my current project.
This second project is a database rather than an application, which adds yet another level of scalability. We’ve gone from single datasets to 10’s of datasets to 100’s of data sets and now we’re in storing the condensed information of about 1,700 libraries of next generation sequencing. That’s no small feat!
Of course, that’s not where it ends. In the next year, I can see that easily doubling or more, and heading rapidly towards the 10,000 library level. What’s interesting about this is that you start to tax the hardware pretty hard, and not evenly. This size of database is now rather far beyond what most grad students are likely to encounter in any project, so it’s pure gain for me. However, even then, the database won’t stop growing. At the rate that sequencing has grown, there’s likely to be a lot more sequencing done next year than this year, and so forth. 100,000 genomes sequenced is not really out of the question for a large sequencing centre in the lifespan of this database.
Just imagining where it’s going to go, you can think about how many SNVs will you find in 100,000 genomes. (A rough estimate is somewhere around 1,000,000 per dataset x 100,000 genomes = 100 billion records.) Somehow, that number is pretty daunting for any database. There will undoubtedly have to be purges of low quality information, or further division of the tables at some point.
Regardless of where the end point is, when you contemplate data sets this large, you have to start questioning everything. Was the database designed well? Indexes, clustering, triggers? Did you pick the right database? There’s no end of places you can look to improve performance. However, it all comes down to two things: experience and money.
Experience is a wonderful thing, it gives you a framework to work from. The biggest database I’d worked with before this was in my days working for the University of Waterloo’s Student Information Systems Project, where I was building a reporting database to allow staff to instantly pull up records for any of the thousands of students. No matter how many students the university can handle, it can’t compete with the number of variations across 100,000 genomes. Of course, the university was willing to throw some money (and a db admin and myself) at the problem, so we came up with solutions.
Experience is not cheap, however, particularly when you have to bring it in. Resourcefulness, however is cheap, so I’ve recently been turning to the people who work on postgres. They have a great channel on IRC (Freenode – #postgres), where you can talk with experts on the subject. They helped debug a trigger, clued me in to clustering, and provided me with a set of linux tools I’d never seen before. They became my proxy for experience, and helped me bootstrap myself to where I’m comfortable with how the database works. That takes care of experience.
So, what’s left is money. And, that’s where the commitment to the project comes in. So far, I’ve been fortunate to get some great hardware, most of which comes from cast-offs of other projects, but is still pretty high quality. All I can say at this point, is that I’m glad I work where I do, because I think I’m going to be able to rack up some pretty cool experiences with “large datasets” and big hardware, which was exactly what I put on my MSFHR scholarship application as my educational goal. The specific header was “Develop new competencies in mining large databases”.
So there you have it – I get to be an incredibly lucky grad student, with the resources of an entire genome science centre behind me, and thousands of genomes to analyse… and I’ve even managed to fit in the goals that I told proposed to my funding source.
The next couple months will be a lot of fun, as I start to wrap this project up.
For those of you in Vancouver, who’d like to know more, I’ll be presenting some of this work at Vanbug tomorrow as the student speaker. After that, I get to start making sense of the data… now THAT should be fun. If my luck holds, that should make a few good papers, and chapters for my thesis.