Diaspora with Apache2

Update: Note that apache2 doesn’t actually work – the diaspora interface is apparently too tightly coupled to the “thin” webserver to use the apache2 webserver. It looks like it works, but you can’t connect to friends.

Update 2: It does seem to be working again – using an updated version from the git. (sudo services apache2 stop; git pull; bundle install; rake; sudo services apache2 start)

If you’re wondering how I got diaspora working with apache2, I followed the instructions here:

https://help.ubuntu.com/community/RubyOnRails

And used the following in my apache2.conf file:

  LoadModule passenger_module /usr/lib/ruby/gems/1.8/ 
         gems/passenger-2.2.15/ext/apache2/mod_passenger.so
  PassengerRoot /usr/lib/ruby/gems/1.8/gems/passenger-2.2.15
  PassengerRuby /usr/bin/ruby1.8

And used the following in my sites-available/default:

<VirtualHost *:80>
    DocumentRoot /var/www/diaspora/public
    ServerName diaspora.fejes.ca
    ServerPath /diaspora
    RewriteEngine On
    RewriteRule ^(/diaspora/.*) /www/diaspora
    RailsBaseURI /diaspora
    Alias /diaspora/ "/var/www/diaspora/public"
    <Directory /var/www/diaspora/public>
      AllowOverride all
      Options -MultiViews
   </Directory>
</VirtualHost>

Happy Diaspora-ing.

Scalability

Yes, I am one of the luckiest grad students around.  One of the things that very few bioinformaticians get to work with during their PhD is scalability…  and I don’t just mean moving from their own project to a larger dataset.  I mean, industrial size scalability.

Interestingly enough, I’m now working on my second “scalable” project.  My first one, FindPeaks, was reasonably scalable – I’ve yet to find a dataset that it couldn’t handle – which includes some pretty monstrous illumina data sets.  We’ve even forced through individual peaks will several millions of reads, which is not a bad test of how scalable the app is.  However, even that pales in comparison to my current project.

This second project is a database rather than an application, which adds yet another level of scalability.  We’ve gone from single datasets to 10’s of datasets to 100’s of data sets and now we’re in storing the condensed information of about 1,700 libraries of next generation sequencing.  That’s no small feat!

Of course, that’s not where it ends.  In the next year, I can see that easily doubling or more, and heading rapidly towards the 10,000 library level.  What’s interesting about this is that you start to tax the hardware pretty hard, and not evenly.  This size of database is now rather far beyond what most grad students are likely to encounter in any project, so it’s pure gain for me.  However, even then, the database won’t stop growing.  At the rate that sequencing has grown, there’s likely to be a lot more sequencing done next year than this year, and so forth.  100,000 genomes sequenced is not really out of the question for a large sequencing centre in the lifespan of this database.

Just imagining where it’s going to go, you can think about how many SNVs will you find in 100,000 genomes. (A rough estimate is somewhere around 1,000,000 per dataset x 100,000 genomes = 100 billion records.)  Somehow, that number is pretty daunting for any database.  There will undoubtedly have to be purges of low quality information, or further division of the tables at some point.

Regardless of where the end point is, when you contemplate data sets this large, you have to start questioning everything. Was the database designed well?  Indexes, clustering, triggers? Did you pick the right database? There’s no end of places you can look to improve performance.  However, it all comes down to two things: experience and money.

Experience is a wonderful thing, it gives you a framework to work from.  The biggest database I’d worked with before this was in my days working for the University of Waterloo’s Student Information Systems Project, where I was building a reporting database to allow staff to instantly pull up records for any of the thousands of students.  No matter how many students the university can handle, it can’t compete with the number of variations across 100,000 genomes.  Of course, the university was willing to throw some money (and a db admin and myself) at the problem, so we came up with solutions.

Experience is not cheap, however, particularly when you have to bring it in.  Resourcefulness, however is cheap, so I’ve recently been turning to the people who work on postgres.  They have a great channel on IRC (Freenode – #postgres), where you can talk with experts on the subject.  They helped debug a trigger, clued me in to clustering, and provided me with a set of linux tools I’d never seen before.  They became my proxy for experience, and helped me bootstrap myself to where I’m comfortable with how the database works.  That takes care of experience.

So, what’s left is money.  And, that’s where the commitment to the project comes in. So far, I’ve been fortunate to get some great hardware, most of which comes from cast-offs of other projects, but is still pretty high quality.  All I can say at this point, is that I’m glad I work where I do, because I think I’m going to be able to rack up some pretty cool experiences with “large datasets” and big hardware, which was exactly what I put on my MSFHR scholarship application as my educational goal.  The specific header was “Develop new competencies in mining large databases”.

So there you have it – I get to be an incredibly lucky grad student, with the resources of an entire genome science centre behind me, and thousands of genomes to analyse… and I’ve even managed to fit in the goals that I told proposed to my funding source.

The next couple months will be a lot of fun, as I start to wrap this project up.

For those of you in Vancouver, who’d like to know more, I’ll be presenting some of this work at Vanbug tomorrow as the student speaker.  After that, I get to start making sense of the data…  now THAT should be fun.  If my luck holds, that should make a few good papers, and chapters for my thesis.

Frustration with Nature Networks

About a month ago, I volunteered to be a Nature Nework Blogging guinea pig, testing out the site improvements, and giving my feed back.  After sending in comments on what worked (a couple of things) and what didn’t (most things), I thought I was doing my part to help make Nature Blogs into something worth using.  Turns out that my feedback wasn’t particularly helpful after all.

Yes, this is a whiney post.  I found out today, via scienceblogging.org that they’ve rolled out the changes to people – without fixing the things I told them didn’t work.  Why did I bother?  It’s not like I could have been using my time more efficiently to find cures for cancer…. oh wait, that’s exactly what I’d have been doing otherwise!  DOH.

Anyhow, here I am, no better off afterwards.  The interface for custom banners completely fails for me, there’s no “republish button”, and I still have none of the things I’ve been pining for:

  1. Custom banners that work (they don’t),
  2. anonymous comments (they never will),
  3. a tag cloud (bizarrely, I can see it, but the people who read my blog can’t) and
  4. stats on who reads my blog, and what they find interesting  (The only stats they provide are those on how often I write on my blog, which is more like a guilt trip then useful information.)

So, I think I’m slowly coming to the conclusion that Nature Blogs subscribes to the theory that you can’t always get what you want, but if I try real hard, they’ll promise me what I need.

In the name of encouraging myself to regain my passion for blogging, I’m just going to strike out on my own, again, for a while, and see what happens.