New developments…

I’ve not been blogging lately because I have managed to convince myself that blogging was taking time away from other things I need to be doing, which I feel I need to focus on.  The most important thing at the moment is to get a paper done, which will be the backbone of my thesis.  Clearly, it is high priority, however, it’s becoming harder and harder not to talk about things that are going on, so I thought I’d interrupt my “non-blogging” with a few quick updates.  I have a whole list of topics I’m dying to write about, but just haven’t found the time to work on yet, but trust me, they will get done.  Moving along….

First, I’m INCREDIBLY happy that I’ve been invited to attend and blog the Copenhagenomics 2011 conference (June 9/10, 2011).  I’m not being paid, but the organizers are supporting my travel and hotel (and presumably waiving the conference fee), so that I can do it.  That means, of course, that I’ll be working hard to match or exceed what I was able to do for AGBT 2011. And, of course, I’ll be taking a few days to see some of Denmark and presumably do some photography.  Travel, photography, science and blogging!  What a week that’ll be!

Anyhow, this invitation came just before the wonderful editorial in Nature Methods, in which social media is discussed as a positive scientific communication for conference organizers.  I have much to say on this issue, but I don’t want to get into it at the moment.  It will have to wait till I’m a few figures further in my paper, but needless to say, I believe very strongly in it and think that conferences can get a lot of value out of supporting bloggers.

Moving along (again), I will also be traveling in June to give an invited talk, which will be my first outside of Vancouver. Details have not been arranged yet, but once things are settled down, I’ll definitely share some more information.

And, a little closer to home, I’ve been invited to sit on a panel for VanBug (the Vancouver Bioinformatics Users Group) on their “Careers in Bioinformatics” night (April 14th).  Apparently, my bioinformatics start-up credentials are still good and I’ve been told I’m an interesting speaker.  (In case you’re wondering, I will do my best to avoid suggesting a career as a permanent graduate student…) Of course, I’m looking forward to sitting on a panel with the other speakers: Dr. Inanc Birol, Dr. Ben Good and Dr. Phil Hieter – all of whom are better speakers than I am.  I’ve had the opportunity to interact with all of them at one point or another and found them to be fascinating people. In fact, I took my very first genomics course with Dr. Hieter nearly a decade ago, in an interesting twist of fate.  (You can find the poster for the event here.)

Even with just the few things I’ve mentioned above, the next few months should be busy, but I’m really excited.  Not only can I start to see the proverbial light at the end of the tunnel for grad school, I’m really starting to get excited about what comes after that.  It’s hard to not want to work, when you can see the results taking shape infront of your eyes.  If only there were a few more hours in the day!

 

Pet Peeve

Ok, My pet peeve is Microsoft’s repeated abuse of terminology.  Today, I’m annoyed with their use of the word “cloud”, because Microsoft just doesn’t get it… again.

This is how Wikipedia explains cloud computing:

Cloud computing is computation, software, data access, and storage services that do not require end-user knowledge of the physical location and configuration of the system that delivers the services.

Microsoft, it seems, doesn’t bother to explain what cloud computing is despite having a full web site, packed with white papers and and video clips, devoted to it.   However, the example they use in their commercial is using the cloud to get a specific file from your computer at home – instead of having someone named Claude go to your house to get it for you.

How on earth is that an example of cloud computing? The file is in a defined location, and you’d better damn well know where your own computer is if Claude is going to bring it to you… I think Microsoft has just reinvented ssh/scp (available since 1995) or rlogin (available since 1986) and decided to call it “the cloud.”

It’s hardly the first time Microsoft has grossly misused terms for their own marketing purposes (eg. “I’m a PC” or “Open XML”), so you think I’d be used to it by now. Unfortunately, it still grates on my nerves every time.

Just in case you think I’m the first to complain about this (I’m not), here are a few others – and one possible reason why.

Bullying.

I just saw the video that’s been doing the rounds this morning, where a bullied kid has had enough and stands up for himself. I’ll include a link here, but really, I don’t suggest watching it. It’s not pretty.

However, this video provoked a strong response from me. Frankly, as a victim of high school bullying myself, I have a lot of vehement opinions on the subject. Even thinking about those who bullied me in high school can still upset me… and I’ve been out of high school for a long time.

Frankly, I only see two reasons why this video is getting any airtime:

  1. The bully has his ankle broken by the tormented kid.
  2. The principal’s (insane) comments that both boys are equally at fault and the tormented boy should have found an alternate resolution.

Frankly, I have only one thing to say to the principal: You’re way out of your league.  If you think bullies respond to anything but retribution, you have no business dealing with children.

Otherwise, my overwhelming response is: Good for the kid who stood up for himself! I understand violence is not the answer to anything – which is why I was beat on as a kid.  If you don’t defend yourself, the bullies know they’re able to torment with impunity.

As a general rule, I foolishly listened to those people in my life who spouted stuff like the principal in the video.  Consequently, I didn’t defend myself and the tormenting never ended, putting me through 6 years of hell.  The one time I had had enough, I wasn’t able to win the fight, meaning it wasn’t a deterrent. I wasn’t a big enough back then to pile drive my tormentors… but oh, I wished I had been.  It also didn’t help that the outcome of the fight was for me to be punished equally with my tormentor.

To the bullied kid: Thank you – In the same circumstances, I wish I could have done something other than take the punches.  I’m glad you had the strength to defend yourself.  I don’t condone violence, but I understand and know that any less measure would have been ineffective.

To the bully:  You were trying to hurt someone intentionally and it back fired.  Think about the consequences of your actions next time – and empathize with the kid who’s nose you tried to break. If you’d have kicked him in the ankle and broken it instead of punching him, he’d have suffered the pain you’re experiencing now.  Why did you want to inflict that on someone else?

All in all, there’s no room for bullying in the classroom or outside.  The age of the kids or the size of the kids doesn’t matter.  While all of the bullied kids may enjoy the schadenfreude, I think we need to address the bigger issue here.  Bullies feed on the attention they get when they bully – if the kids around weren’t enabling it, it would simply evaporate.  This isn’t just between the bully and the bullied, but all of the kids in that school who watched, clapped, cheered and even video taped stuff like this.

Common Mr Principal, I think it’s time you educated the kids in your school about bullying – and maybe get off your sorry ass and do something pro-active about the next time.  Punishing the two kids who came off the worse from the incident is not a long term (or a short term) solution.

Fixing the screen/LCD brightness keys on a macbook pro, Ubuntu 10.10.

If you’ve been following my posts, you’ll know I have a macbook pro running Ubuntu Linux, and you’ll also know I love tweaking things. I’m not obsessive about tweaking, but if something could work better, I’d like it to work better. So after a month of having 4 dead keys on my laptop, I figured I had to do something about it.

They keys are:

  • keyboard back light brightness up
  • keyboard back light brightness down
  • monitor brightness up
  • monitor brightness down

They’re hardly the most important buttons on a keyboard, but I figured I’d let them sit idle long enough.

Getting them to work turns out to be a relatively simple.  Initially, I’d just followed the instructions here to get the keyboard LED brightness working, and that did a decent job…. but not perfect.  It didn’t actually let you get the brightness all the way to zero or to 255, the max brightness.   Thus, I modified the script:

(You can download it here)

#!/bin/bash
# Francisco Diéguez Souto (frandieguez@ubuntu.com)
# This script is licensed under MIT License.
# Modified by Anthony Fejes (apfejes@gmail.com)
#
# This program just modifies the value of backlight keyboard for Apple Laptops
# You must run it as root user or via sudo.
# As a shortcut you could allow to admin users to run via sudo without password
# prompt. To do this you must add sudoers file the next contents:
#
#   ALL = (ALL) NOPASSWD: /usr/sbin/keyboard-backlight
#
# You must then install the script in the path given above, eg, /usr/sbin/
# If you chose another path, then the location in the sudoers file must reflect 
# that path.
#
# After this you can use this script as follows:
#
#     Increase backlight keyboard:
#           $ sudo keyboard-backlight up
#     Decrease backlight keyboard:
#           $ sudo keyboard-backlight down
#
# You can customize the amount of backlight by step by changing the INCREMENT
# variable as you want it.

BACKLIGHT=$(cat /sys/class/leds/smc::kbd_backlight/brightness)
INCREMENT=10

if [ $UID -ne 0 ]; then
    echo "Please run this program as superuser"
    exit 1
fi                                                                                                     
                                                                                                       
SET_VALUE=0                                                                                            
                                                                                                       
case $1 in                                                                                             
                                                                                                       
    up)                                                                                                
        TOTAL=`expr $BACKLIGHT + $INCREMENT`
        if [ $BACKLIGHT -eq "255" ]; then
                exit 1
        fi
        if [ $TOTAL -gt "255" ]; then
            TOTAL="255"
        fi
        echo $TOTAL > /sys/class/leds/smc::kbd_backlight/brightness
        ;;
    down)
        TOTAL=`expr $BACKLIGHT - $INCREMENT`
        if [ $BACKLIGHT -eq "0" ]; then
                exit 1
        fi
        if [ $TOTAL -lt "0" ]; then
            TOTAL="0"
        fi
        echo $TOTAL > /sys/class/leds/smc::kbd_backlight/brightness
        ;;
    *)
        echo "Use: keyboard-light up|down"
        ;;
esac

Following the instructions in the header to add this into the sudoers file, you can then go to your windowing environment and associate the keys with the command. In my case, I went to the Settings menu in the KDE launcher, clicked on the System Settings toolbox and went into the Gestures and Shortcuts menu. I created a new group brightness controls in the custom input action settings, then used the menu to create new global shortcuts and picked the “command/URL” type. At this point, all you need to do is move to the “Trigger” tab, click on the key you want to associate with each command, then enter the commands into the action tab. (The commands are:

sudo /usr/sbin/keyboard-backlight up

and

sudo /usr/sbin/keyboard-backlight down

Getting the screen brightness to work wasn’t that much harder. The script looks like:

(You can download it here)

#!/bin/bash                                                                                            
                                                                                                       
# Anthony Fejes (apfejes@gmail.com)                                                                    
# Template taken from post by Fran Diéguez at                                                          
# http://www.mabishu.com/blog/2010/06/24/macbook-pro-keyboard-backlight-keys-on-ubuntu-gnulinux/       
#                                                                                                      
# This program just modifies the value of video brightness for Apple Laptops                           
# You must run it as root user or via sudo.
# As a shortcut you could allow to admin users to run via sudo without password
# prompt. To do this you must add sudoers file the next contents:
#
#   ALL = NOPASSWD: /usr/sbin/mbp_backlight

# After this you can use this script as follows:
#
#     Increase backlight keyboard:
#           $ sudo mbp_backlight up
#     Decrease backlight keyboard:
#           $ sudo mbp_backlight down
#

BACKLIGHT=$(cat /sys/devices/virtual/backlight/mbp_backlight/brightness)
MAX=$(cat /sys/devices/virtual/backlight/mbp_backlight/max_brightness)
MIN=4
INCREMENT=1

if [ $UID -ne 0 ]; then
    echo "Please run this program as superuser"
    exit 1
fi

case $1 in

    up)
        TOTAL=`expr $BACKLIGHT + $INCREMENT`
        if [ $BACKLIGHT -eq $MAX ]; then
                exit 1
        fi
        if [ $TOTAL -gt $MAX ]; then
            let TOTAL=MAX
        fi
        echo $TOTAL > /sys/devices/virtual/backlight/mbp_backlight/brightness
        ;;
    down)
        TOTAL=`expr $BACKLIGHT - $INCREMENT`
        if [ $BACKLIGHT -eq $MIN ]; then
                exit 1
        fi
        if [ $TOTAL -lt $MIN ]; then
            let TOTAL=MIN
        fi
        echo $TOTAL > /sys/devices/virtual/backlight/mbp_backlight/brightness
        ;;
    *)
        echo "Use: mbp_backlight up|down"
        ;;
esac

And method is identical to that above, associating the appropriate keys to the command:

sudo /usr/sbin/mbp_brightness up

and

sudo /usr/sbin/mbp_brightness down

And I’m now happily able to use all of the keys on my keyboard!

Thought’s on Andrew G Clark’s Talk and Cancer Genomics

Last night, I hung around late into the evening to hear Dr. Andrew G Clark give a talk focusing on how most of the variations we see in the modern human genome are rare variants that haven’t had a chance to equilibrate into the larger population.  This enormous expansion of rare variants is courtesy of the population explosion of humans since the dawn of the agricultural age, specifically in the past 2000 years at the dawn of modern science and education.

I think the talk was a very well done and managed to hit a lot of points that struck home for me.  In particular, my own collected database of human variations in cancers and normals has shown me much of the same information that Dr Clark illustrated using 1000 genome data, as well as information from his 2010 paper on deep re-sequencing.

However interesting the talk was, one particular piece just didn’t click in until after the talk was over.  During a conversation prior to the talk, I described my work to Dr. Clark and received a reaction I wasn’t expecting.  Paraphrased, this is how the conversation went:

Me: “I’ve assembled a very large database, where all of the cancers and normals that we sequence here at the genome science centre are stored, so that we can investigate the frequency of variations in cancers to identify mutations of interest.”

Dr. Clark: “Oh, so it’s the same as a HapMap project?”

Me: “Yeah, I guess so…”

What I didn’t understand at the time was that Dr. Clark was asking was: “So, you’re just cataloging rare variations, which are more or less meaningless?”  Which is exactly what HapMap projects are: Nothing more than large surveys of human variation across genomes.  While they could be the basis of GWAS studies, the huge amount of rare variants in the modern human population means that many of these GWAS studies are doomed to fail.  There will not be a large convergence of variations causing the disease, but rather an extreme number of rare variations with similar outcomes.

However, I think the problem was that I handled the question incorrectly.  My answer should have touched on the following point:

“In most diseases, we’re stuck using lineages to look for points of interest (variations) passed on from parent to child and the large number of rare variants in the human population makes this incredibly difficult to do as each child will have a significant number of variation that neither parent passed on to them.  However, in cancer, we have the unique ability to compare diseased cancer cells with a matched normal from the same patient, which allows us to effectively mask all of the rare variants that are not contributing to cancer.  Thus, the database does act like a large HapMap database, if you’re interested in studying non-cancer, but the matched-normal sample pairing available to cancer studies means we’re not confined to using it as a HapMap-style database, enabling incredibly detailed and coherent information about the drivers and passengers involved in oncogenesis, without the same level of rare variants interfering in the interpretation of the genome.”

Alas, in the way of all things, that answer only came to me after I heard Dr. Clark’s talk and understood the subtext of his question.  However, that answer is very important on its own.

It means that while many diseases will be hard slogs through the deep rare variant populations (which SNP chips will never be detailed enough to elucidate, by the way, for those of you who think 23andMe will solve a large number of complicated diseases), cancer is bound to be a more tractable disease in comparison!  We will by-pass the misery of studying every single rare variant, which is a sizeable fraction of each new genome sequenced!

Unfortunately, unlike many other human metabolic diseases that target a single gene or pathway, cancer is really a whole genome disease and is vastly more complex than any other disease.  Thus, even if our ability to zoom in on the “driver” mutations progresses rapidly as we sequence more cancer tissues (and their matched normal samples, of course!), it will undoubtedly be harder to interpret how all of these work and identify a cure.

So, as with everything, cancer’s somatic nature is a double edged sword: it can be used to more efficiently sort the wheat from the chaff, but will also be a source of great consternation for finding cures.

Now, if only I could convince other people of the dire necessity of matched normals in cancer research…

VanBUG: Andrew G Clark, Professor of Population Genetics, Cornell University

[My preamble to this talk is that I was fortunate enough to have had the opportunity to speak with Dr. Clark before the talk along with a group of students from the Bioinformatics Training Program.  Although asked to speak today on the subject of the 1000 genomes work that he’s done, I was able to pose several questions to him, including “If you weren’t talking about 1000 Genomes, what would would you have been speaking about instead?”  I have to admit, I had a very interesting tour of the chemistry of drosophila mating, parental specific gene expression in progeny and even some chicken expression.  Rarely has 45 minutes of science gone by so quickly.  Without further ado (and with great respect to Rodrigo Goya, who is speaking far too briefly – and at a ridiculous speed – on RNA-seq and alternative splicing  in cancer before Dr. Clark takes the stage), here are my notes. ]

Human population genomics with large sample size and full genome sequences

Talking about two projects – one sequencing a large number of genomes (1000 Genomes project), the other sequencing a very large number of samples in only 2 genes (Rare Variant studies).

The ability to predict phenotype from genotype is still small – where is the heritability?  Using simple snps is insufficient to figure out disease and heritibility.  Perhaps it’s rare variation that is responsible.  That launched the 1000 Genome project.

1000 Genome was looking to find stuff down to 1% of population.   (In accessible regions)

See Nature for pilot project publication of the 1000 Genomes project.. This included several trios (Parents and child).  Found more than 15M snps across the human genome.  Biggest impact, however, has been the impact on informatics – How do you deal with that large volume of snps?  Snp calling, alignment, codification, etc…

Much of the standard file formats, etc came from the 1000 Genomes groups working on that data. Biggest issue is (of course) to avoid mapping to the wrong reference!  “High quality mismatches” ->  Many false positives that failed to validate: misalignments of reads.  Read length improvements helped keep this down, as did using the insertions found in other 1000 Genome project subjects.

Tuning of snp callling made a big difference.  Process with validations made a significant impact.  However, for rare snps, it’s still hard to call snps.

Novel SNPs tend to be population specific.  Eg. Yoruban vs. European have different patterns of SNPs.  There is a core of common SNPs, but each has it’s own distribution of the rare or population specific SNPs.

“Imputation” using haplotype information (phasing) was a key item for making sense of the different sources of the data.

Great graph on fequency spectrum.  (Number of variants – log vs allele frequency (0.01 – 1)) Gives a lying out flat hockey stick.  Lots of very rare frequency snps, decreasing towards 1, but a spike at 1.

>100kb from each gene there is reduced variation (eg, Transcription start site.)

Some discussion of recombination hotspots, which were much better mapped by using the 1000 genome project data.

Another application: de novo mutation.  Identify where there are variations in the offspring where they are not found in either present.   Roughly about 1000 mutations per gamete.  ~3×10^-8 substitution per generation.

1000 Genomes project is now expanding to 2500 samples.  Trying to distribute across 25 population groups, with 100 individuals per group.

Well, what do we expect to discover from ultra-deep sampling?

There are >3000 mutations in dystrophin.  (Ascertained cases of muscular dystrophy. – Flanagan et al, 2009, Human Mutation)

If you think of any gene, you can expect to find every gene mutated at every point across every population… eventually.  [Actually, I do see this in most genes, but not all… some are hyper conserved, if I’ve interpreted it correctly.]

Major problem, tho: sequencing error.  If you’re sampling billions of base pairs, with 1/100,000 error rate, you’ll still find bad base calls!

Alex Coventry: There are only 6 types of heterozygotes (CG, CT, GT, AC, AG, AT)… ancient technology, not getting into it – was developed for sanger.

Studied HHEX and KCNJ11 genes, sequenced in 13,715 people. Validated by Barcoding and 454 sequencing.

Using the model from Alex’s work, you could use a posterior probabilty of each SNP.  Helped in validating.  When dealing with rare variants, there isn’t a lot of information.

The punchline: “There are a lot of rare SNPs out there!”

Some data shown (site frequency) as sample data increases.  The vast majority of what you get in the long run is the rare SNPs.

Human rare variation is “in excess” of what you’d expect from classical theory.  So why are there so many variants?

Historical population was small, but underwent a recent population explosion in the last 2000 years. This allows for a rapid diversity to be generated as each new generation has new variants, and no dramatic culls to force this rare variation to consolidate.

How many excess rare variants would you expect from the population explosion?  (Guttenkunst et al, 2009, PLOS Genetics)  Population has expanded 100x in about 100 generations.  Thus, we see the core set, which were present in the population before the explosion, followed by the rapid diversification explosion of rare snps.

You can do age inferrence, then, with the frequency of SNPs.  older snps must be present across more of the population.  Very few SNVs are older than 100 generations.  If you fit the population model back to the expected SNV frequency in100 generations ago, the current data fits very well.

When fitting to effective sample size of humans, you can see that we’re WAY out of equilibrium from what the common snps would suggest.  [I’m somewhat lost on this, actually.  Ne (parent) vs n (offspring).  I think the point is that we’ve not yet seen consolidation (coalescence?) of SNPs.]

“Theory of Multiple Mergers”  Essentially, we have a lot of branches that haven’t had the chance to blend – each node on the variation tree has a lot of unique traits (SNPs) independent of the ancestors.  (The bulk of the weight of the branch lengths is in the many many leaves at the tips of the trees.)

[If that didn’t make sense, it’s my fault – the talk is very clear, but I don’t have the population genetics vocabulary to explain this on the fly.]

What proportion of SNPs found in each new full genome sequence do we expect to be novel? (For each human.)  “It’s a fairly large number.”  It’s about 5-7%, Outliers from ]3-17%.  [I see about the same for my database,  which is neat to confirm.]  Can fit this to models: constant population size would give a low fraction (0.1%), with explosive model (1.4%) over very large sample sizes.

Rare variants are enriched for non-synonymous and premature terminations (Marth et al , submitted) [Cool – not surprising, and very confounding if you don’t take population frequency into account in your variant discovery.]

What does this mean in complex diseases?  Many of our diseases are going to be caused by rare variants, rather than common variants.  Analogy of jets that have 4x redundancy, versus humans with 2x redundancy at the genome level.

Conclusions:

  • Human population has exploded, but it has a huge effect on rare variations.
  • Huge samples must be sequenced to detect and test effects
  • Will impact out studies of diseases, as we have to come to terms with the effects of the rare variations.

[Great talk!  I’ve enjoyed this tremendously!]

 

File Formats and Genomic Data

I’ve been thinking a lot about file formats lately. I know it doesn’t sound like a particularly interesting topic, but it really can be interesting. Well, at least useful. Interesting might be a bit overstating the case, but designing a new file format isn’t a simple job – and finding the right balance between bare bones and overly verbose can be a big challenge.

Over the past few years, I’ve collected a large number of interfaces in my codebase for interfacing with all sorts of files, from plain text files all the way up through binary alignment data (MAQ and SAM), spanning all sorts of information. It’s not like I set out to build a library of genomic file interpreters, but it happened.  Consequently, I spend a lot of time looking at file formats and recently, I’ve found myself saying “I’ve been thinking about file formats a lot lately…”, which is exactly as geeky as it sounds.

To make matters worse, I’m finding myself reversing an opinion I had earlier. I have been on record cheering VCF, saying that I would love to have one file format that EVERYONE can dump their SNV information into. I still stand by this, but I’m some what disgruntled with the format. Well, as much as one can be disgruntled about a file format, anyways. (Or even gruntled, for that matter.)

When VCF came out, I thought it was brilliant – and it is – because it’s simple. A few mandatory fields are listed, followed by an info field into which everything and anything can be crammed. (See here for a few examples.) While I love the mandatory fields, I don’t think they’ve gone quite far enough. I’d have tossed in a few more fields in that category, such as individual read qualities, total coverage, number of times the variant was called with confidence… you get the idea. Anyhow, as far as formats go, it really isn’t that bad. The part that annoys me is the lack of enforced fields in the info. By going this route, they’ve created a format with infinite flexibility and extensibility.  For the most part, I support that.

Infinite flexibility, however, can also be extended to infinite abuse. I am waiting to see the first time someone sends me a file that has 2k characters in the info field on the each line, and not a single useful piece of information in it.  It hasn’t happened yet, but given the plethora of stuff people are now jamming in to VCFs, it’s just a matter of time.

In fact, I’ve been working with the Complete Genomics data (second hand, since a co-worker has been processing it, and I’m not sure how much it has been processed) for much of this week. I’ll be thrilled to see this incredible volume of genomic data in both cancers and normals appear in my database, once the import is complete.  With the first import running when I left for the day, there’s a good chance it’ll be in pretty soon.  Unfortunately, not all of the information I’d like was in it. The format is similar to VCF in spirit, but a bit different in style. All in all, it’s not bad (albeit incredibly confusing when my co-worker provided the documentation somewhat second hand, and left out all the gooey bits that are necessary to actually interpret the file), but even here, there were issues with insufficient information.  The files had to be massaged a bit to get all of the information we needed.  (Although, really, with an expected ~270 Million variants called for these data sets, complaining about a lack of information really needs to be kept in perspective!)

I have to say, however, of all my favorite variant formats, my favorite this week is the varfilter SAM from Samtools. It’s a pain in the ass with its cryptic quality data at the end, but that is what makes it easily the most informative – and self-checking. Consider the following “invalid” line:

1       19154656        T       A       22      22      30      3     ,.^?n   "%@

In it, the T and A signify a T->A variant. The quality markers, (,.^?n) tell you what was found at that position ( “,” and “.” are reference bases of different quality, and the “n” tells you that one indeterminate poor quality base with no call.  I’ll admit, I don’t actually know what the “^?” part means.)  which allows you to figure out what evidence was used to make this call…  which turns out to be none whatsoever.  It’s a bad snp call. If that information hadn’t been there, I’d have had to blindly accept the above snp into my analysis, and there would be no way to verify the integrity of the snp caller. (Which seems to be poor, since the example above came from an actual varfilter file and lines like this are frequently found in other datasets as well, regardless of the snp caller used as far as I am aware.)

In a properly done snp call, you would expect to see a quality that looks more like “.,.,.aAAaa” which would tell you that 5 reference bases (the dots and commas describing the reference T’s in the example above) and 5 A’s were found, confirming the base call.  Now THAT’s useful information, easily interpreted, and easily parsed.  With VCF, that information might or might not be included – and is completely absent in Complete Genomics files and optional in VCF files.

So where am I going with this?  Simple: If you’re designing a snp calling file, make sure you include sufficient information that the data can be verified easily. I’m willing to pay a slight penalty in storage such that I can be certain that the data is correct.  This should also include the sub-point: If you’re using VCF, some of those optional fields shouldn’t be optional.

Graduate Students who Twitter about Science

I’m trying to collect a list of graduate students who twitter about science – please nominate anyone you know who fits this category.

If you’d like to follow everyone on the list in one shot, Check out this list: http://twitter.com/#!/apfejes/science-graduate-students

TwitterId Blog Subject
@23becka Biology
@aanaqvi Biology and Bioinformatics
@aemonten Biology and Molecular Biology
@agvia Biology and Bioinformatics
@aindap Bioinformatics
@ajebsary Biochemistry
@alexaflu0r Molecular Drawing Board Chemistry
@anandvithu Polymers and Materials Science
@apfejes fejes.ca Bioinformatics and Next Generation Sequencing
@audyyy Bioinformatics
@bensaunders Biopsychology
@bgrassbluecrab southernfriedscience Biology
@Bioinf_Beccy Bioinformatics and Biology
@brycewdavis Analytical Chemistry
@CBC_psi
@Cephalover (*)
@cloudskinner Clouds, Satellites and Hydrological models
@conorbrien Evolution, Development, and Genomics Biology
@Daniella battlestar daniellica Media Violence and History
@davidmanly The Wonderful World of Animals Biology and Zoology
@dragonflywoman2 The Dragonfly Woman entomology
@dsquintana Neurobiology
@erikacule Blogging the PhD Biology
@fatgenius Biochemistry
@FLOSciences Biotechnology
@GradStudentity Bioinformatics
@ianstrutt The Boiling Point Chemistry
@icanhasscience I can has science? Chemistry
@JacquelynGill Paleoecology, Climate change and Biogeography
@jbyoder Denim and Tweed Evolutionary Biology
@jgold85 The Thoughtful Animal Psychology, Neuroscience, Behaviour
@joangabriel Biochemistry
@joelshmelly
@KatherineMejia Biology and Bioinformatics
@Katie_PhD KatiePhD.com Biochemistry
@katiesci Neuroscience
@kshameer Bioinformatics
@lads4life Molecular Biology and Functional Genomics
@LeighJKBoerner The Bunsen Boerner Chemistry
@lja_allen Chemistry
@lswenson lukeswenson.ca Biology and HIV
@lunardecorazon Anthropology and Biology
@LyndellMBade
@mcjenmcg
@mmontaner Sociology
@MeinHermitage The Hermitage Engineer
@NerdyChristie Observations of a Nerd Marine Biology
@newreactions New Reactions Organic Chemistry
@oystersgarter Deep Sea News Marine Biology
@palaeobeth Physical Anthropology
@paleophile Paleoanthropology
@PhDtweeting
@physilology C6-H12-O6 Physiology
@polarwander Geology
@pop_gen_JED Animal Genomics
@PsiWaveFunction (*) Skeptic Wonder Biology
@Rachelannmcg Medical Science
@rockstarscience Rocket Scientista Astronomy
@rothzilla Geography and Cartography
@SapphireSeaLion (*)
@seelix This View of Life Physical Anthropology
@SFriedScientist southernfriedscience Biology
@shaenasaurus Comparative Biology
@Synthesist88 (*) Chemistry
@TheAtavism The Atavism Physical anthropology
@thegr8caitbait Environmental Toxicology
@TheStudentT The Student T Nanotechnology
@UdPrfrNAgo Molecular Biology and Neuroscience
@wakebright You’d Prefer An Argonaute RNA Biology
@willandbeyond
@WhySharksMatter southernfriedscience Biology

* indicates Undergraduate.

Why don’t grad students twitter?

This post was sparked from a recent twitter convesation:

pvanbaarlen: just had to teach 55 BSc students, age ca. 21.. NONE of them ever used Twitter! or read a weblog! age gap yes, but different direction..

larry_parnell: Same here. I know no grad students on Twitter

apfejes: I can’t be the only one…

larry_parnell: Nutrition research here very traditional; no Tweeps

apfejes: would be interesting to see # of grad school tweeters by field…

larry_parnell: I agree.

Chris_Evelo: Would be interesting to know why grad students prefer other things (e.g. biostar, discussionlists) over twitter.

Obviously, I’ve cleaned up the conversation (and left out a few other comments by others) for readability, but there are two good questions in this conversation:

  1. Why aren’t there a lot of Grad Students on Twitter?
  2. What are they doing instead?

I have a hard time imagining that grad students aren’t using social media at all – that would be very… unexpected for a generation that is currently 25-35.

From my own personal experience (just in Bioinformatics), I know there are a number of grad students on IRC (Internet Relay Chat). Although it’s popularity has waned in general, it is still the best way of having a real time, text-based conversation with multiple parties of which I’m aware.  Still, with the 20-30 grad students IRC-ing that I know of, that’s a tiny minority.  There are some fantastic discussions online, but it’s a relatively quiet (though friendly) group. (For those who are interested, it’s #bioinformatics on @Freenode.)

There are also a lot of grad students with blogs… in fact, so many that I’m not going to bother posting any links or telling you where to find them.  When I started out, blogs were the “gateway-drug” of having an online presence.  Twitter has vastly lowered the bar, but complex thoughts are really not well suited for the 140 character limit – and grad students usually have complex thoughts.

I’ll go out on a limb here and suggest that a) grad students probably utilizing their online time for social networking rather than career networking, and b) the question above was a great example of network bias (if it doesn’t exist, I am officially coining the phrase.)

For part a) I propose that most grad students are on facebook instead of twitter, using their time for less science-y (less geeky?) communication.  I actually can’t think of anyone I know of who isn’t on facebook. (except me – I have a page, but I just don’t use it, but that’s a personal choice about the platform.)

If I’m right, that suggests that social media is the primary concern of grad students, which would be supported by the presence of organizations like the BC Student Biotechnology Network on facebook. (I’d give you the link, but they’ve blocked facebook @ work.)

My proposal of “network bias” is probably the more interesting of the two ideas above.  I would explain this very simply: Professors are interested in communicating about their field and thus tend to write more science oriented posts and thus also tend to follow people who tweet about more science oriented topics.  This would then lead science people to heavily bias their twitter networks towards people who are either profs, or the rare grad student who’s willing to stick out his neck and tweet about science.  Like attracts like, and the people I follow are generally science twitters.  If there are grad students on twitter who are using it as a facebook replacement, I just wouldn’t be following them anyhow.

I admit the above is all speculation, so I’d be interested in hearing comments – and experiments – to support or disprove the above.  And, I think I’d also be interested in collecting a list of twittering grad students (who aren’t just twittering about how much beer they drank last night, of course). If you know of any, please pass along their twitter names, and I’ll collect them into a single post.

Periodic table of visualization

I’m not actually impressed with this, but I’m spending a lot of time thinking about visualization these days and came across this on twitter, so I thought I’d toss this out there:

A Periodic table of Visualization

I’m kinda disappointed in the end result, but the concept is neat.  (And yes, the chemist in me is screaming about the wedging of their figure into a the periodic table: the blocks mean nothing in this table! A gross misuse of the format!) Anyhow, it’s an interesting place to start in your quest for better visualization.