CPHx: John Quackenbush, Dana-Farber Cancer Inst. & Harvard – Network & State Space Models

Network & State Space Models: Science and Science Fiction Approaches to Cell Fate Predictions

John Quackenbush, Dana-Farber Cancer Inst. & Harvard

—–

Challenge the way  you think about biological systems.

“Science is built with facts as a house is with stones,  but a collection of facts is no more a science than a heap of stones is a house” – Jules Henri Poincare

What is a model?

“The purpose of the models is not to fit the data, but to sharpen the questions.”

The question in biology – Is the mean large, given the variance?

Example, determining gender by height.  There is a correlation, but the variance is huge.

We would like small variance compared to the difference in mean.

An alternative:  Is the difference in variance large independent of the mean?

Modeling cell fate transitions.  How does one cell morph into another cell type based on stimulus. Also want to identify pathways that underlie various cell types. All of this comes from building models.

Referee #3 always contests the use of the word model on all his papers.

Phenomenology tries to look at the past.  Ultimately we look to develop a theory that describes the interactions that dreive biological systems.  Build an approximate model that describes a body of knowledge that relates empirical observations of phenomena to each other, consistent with fundamental theory, but not derived from theory.

A journey through Variation.  Jess Mar’s PhD work.

Cells converge to attractive states.  Stuart Kauffman presented the idea of a gene expression landscape with attractors.  Great illustration of gene networks on a landscape.. distinct patterns of gene expression.  States are attractors, and pathways tend to self organize towards them.

There are only 250 stable cell types and each of them represent attractors.

Can we push cells from one state to another based on external stimulus.

An example of Promyelocytes (HL-60) transforming into another cell type.  Arrays done to profile the states of the gene expression between the two end points over several days.

Cells Display Divergent Trajectories That Eventually Converge as they Differentiate.  What accounts for the divergence?

There are multiple processes that are occurring during this observed change.  What you see is actually the sum of all of the different process.  You can, in fact, divide the genes into different groups: transients and core changing genes.  Transients tend to be related to external stimuli.

Waddington’s hypothesis.  A developmental biologist, with publications of attractor states, etc.

Waddington’s model calls for creation of “canalization” of the landscape, in which you move from start to end in paths.

The paths, however, don’t have to be straight.  You can get paths that wander up the walls of the canals.  Individual cells can follow random courses down that path…  thus when you look at the population, you see the canal, but if you don’t, you’d see a high amount of variation.

Had to come up with a method or pathways that characterize various cell types.  What are the signatures?  “Attract” soon to be published.  Finds core pathways that underlie cell fate transition.  Pull out pathways from KEGG – then built new method of gene set phenotyping.  Ranking pathways based on cell type informativeness.

Need to look at separate expression groups.  Some profiles are common across various states, so you need to deconstruct the pathway profiles to make sense.  This can then be used to define an “informativeness” metric, which in tern can be used for identification of core pathways that identify states.

A variational approach to expression analysis.

A stem cell model for neurological disease, based on olfactory cells.  Nasal biopsies, culture pluripotent stemcells, then allow the stemcells to differentiate.  9 healthy, 9 schizophrenia, 13 parkinsons.

What are the pathways that characterize the differentiation of the stem cells?

A bunch of pathways were identified that stood out with significant p-values. One can then ask if anything stood out between the control and the neurological disease patients.  There were no real difference in average pathways… but there were significant differences in their variance!

How important is the difference in variance in defining phenotype?

When overlaid, you can observe skews in the data for pathways.  If the change in variance is important, you should see an even greater skew in the pathways that are key in defining the phenotype.

Indeed, when looking at key pathways, the skew becomes more apparent.  Top 5 pathways show the same skew each time.  There is a robust difference in the profiles, then.

You can also observe the same type of phenomena when using 5% top/5% bottom cutoffs.

High variance genes are cell surface genes and nucleus, low variance tend to be kinases, signalling. etc.

Variance constraints alter network topology.  This suggests schizophrenia are opposite ends of spectra of neural disease.  (Referring to variance being high in one, and low in the other)

Now, trying to understand the mechanisms underlying this variance.

Path integral formulations of quantum mechanics… neutonian objects follow one path. subatomic molecules follow EVERY path.  You must consider cells in the same way, they follow many paths that converge to the average path.

[Ok, I really like this analogy.]

Where are we going?

  • Biology is really driving this
  • integrated data types must be considered intelligently
  • We may be in a position to start developing functional biology models. [My words.. it was expressed more clearly by the speaker.]

Genomics is here to stay.  Even bus drivers have DNA kits to help identify people who spit on them. (-:

 

CPHx: Peter Jabbour, Sponsored by BlueSEQ – An exchange for next-generation sequencing

An exchange for next-generation sequencing

Peter Jabbour, Sponsored by BlueSEQ

—–

A very new company, just went live last month.

What is an exchange?  A platform that brings together buyers and sellers within a market.  A web portal that helps place researchers, clinicians individuals, etc. with providers of next-gen sequencing services.

[web portals?  This seems very 1990s… time warp!]

Why do users need an exchange?  Users have limited access, need better access to technology, platvform, application, etc.

Why do providers need an exchange?  Providers may want to fill their queues.

[This is one stop shopping for next-gen sequencing providers?  How do you make money doing this?]

BlueSEQ platform: 3 parts.

  1. Knowledge Bank:  Comprehensive collection of continuously updated Next Generation Sequencing information, opinons, evaluations, tech bechmarks.
  2. Project Design: Standardized project parameters.  eg, de novo, etc. [How do you standardize the bioinformatics?  Seems… naive.]
  3. Sequencing exchange:  Providers get a list of projects that they can bid on.

[wow… not buying this. Keeps referring back to the model with airline tickets.]

Statistics will come out of the exchange – cost of sequencing, etc.

No cost to users.  Exchange fees for providers. [again, why would providers want to opt in to this?] 100 users have already signed up.

Future directions:  Specialized project desin tools, quoting tools, project management tools, comparison tools, customer reviews.

There are extensive tools for giving feedback, and rating other user’s feedback.

[Sorry for my snarky comments throughout.  This just really doesn’t seem like a well thought out business plan.  I see TONs of reasons why this shouldn’t work… and really not seeing any why it should.  Why would any provider want customer reviews of NGS data… the sample prep is a huge part of the quality, and if they don’t control it, it’s just going to be disaster.  I also don’t really see the value added component.  Good luck to the business, tho!]

 

CPHx: Jacob Glanville, Pfizer – Discovery of biologics & biomarkers in the antibody repertoire

Discovery of biologics & biomarkers in the antibody repertoire

Jacob Glanville, Pfizer

—–

Antibody repertoire: a sum of distinct antibodies in a population.  Eg, your antibody repertoire is all of the antibodies in your body at this moment in time.

Each b-cell displays a single antibody type, but there are a huge range of antibodies that can be made.  Quick review of VDJ recombination.

Anigen + antibody repertoire -> anti-antigen antibody.  One of them is picked out, if one binds.

The antibody repertoire is often considered a black box, but that’s a problem for many reasons, because we they can be understood, and can have major effects.

Pfizer’s natural antibody phate display library – comes from 650 patients.  gives you 35 billion different molecules.  Aka, a high throughput way to generate new molecules.  However, the synthetic repertoire still treats it like a black box.  There is no way to optimize, etc.

Glanville et al, 2009 PNAS.  Used phage display library with 454 sequencing to over come this.  Algorithm development needed to be undertaken to analyze antibody repertoires.  Used a HMM for CDR recognition.

Focus on results instead of algorithms because of time.

There are biases in heavy/light chain families.  You don’t sample the full sequence space because of pairing biases.  There are also biases because only the ones that humans allow into their blood stream are found in the library.  Good for future synthetic design, but not great for sampling sequence space.  Third, representation of clones is not even.  Some dominate and waste time.

Using NGS, they are learning to build better synthetic libraries.

Nice graphics shown to illustrate diversity and type of synthetic libraries available.

New library designed – picked pairings from nature as well as variations found in nature.  By learning what’s acceptable, you can design better sequence spaces.  They were able to get significant improvements in folding. (average for synthetic library is 75%, and they got greater than 90%.) Also, good binding was observed.

What can be detected in the human repertoire?  There are repertoire deficiencies, selection biases, and signal amplification.

There are systematic biases in some Vh genes in viral infections, and possibly in cancers.  We should be looking at the antibody repertoire to learn more about how these influence health.  We can’t continue to treat it as a black box.

BTW, there is a huge difference between having code that can run an algorithm, and having the published algorithm.  All of the code is available on sourceforge. [I couldn’t agree more!  Great job!]

 

 

Normalizing reads for Next-Gen Sequencing.

This isn’t a particularly in depth blog.  Rather, I just want to touch on a few points in reply to a twitter question asking about how to normalize reads.

Actually, normalization is something I haven’t studied in great depth beyond the applications for Chip-Seq, where – of course – it’s still an open question.  So, if my answer is incomplete, please feel free to point out other resources in the comments.  Any contributions would be welcome.

First, one can do all sorts of fancy things to perform a normalization.  Frankly, I think the term is pretty abused, so a lot of what passes for normalization is a bit sketchy.  Anything that makes two samples “equivalent” in any way is often referred to as normalization, so yeah, your millage may vary when it comes to applying any approach.

For RNA-Seq, I’ve heard all sorts of techniques being applied.  The most common is to simply count the number of reads in the two samples, and then normalize by dividing by the ratio of reads between the two samples.  Very crude, and often very wrong, depending on what the two samples actually are.  I don’t use this approach unless I’m feeling lazy, someone has asked me to normalize a sample and it’s 5:40 on a Thursday evening. (I have better things to do on Thursday evenings.)

The second method is to bin the reads, then apply the same normalization as above.  The bins can be as large as a whole chromosome or as small as a few hundred base pairs, but the general method is the same: Use some average over a collection of bins to work out the ratio, then use that ratio to force the two samples to have the same approximate total read numbers.  It’s a bit better than what’s above, but not a whole lot better.

The third method that I’m aware of it to use a subset of reads to determine the normalization ratio.  This is a bit better – assuming you know enough to pick a good subset.   For instance, if you know housekeeping genes, you can use the total coverage over that set to approximate the relative abundance of reads in order to set the correct ratios.  This method can be dramatically better, if you happen to know a good set of genes (or other subset) to use, as it prevents you from comparing non-equivalent sets.

Just to harp on that last point, if you’re comparing a B-Cell- and a Mammary-derived cell line, you might be tempted to normalized on the total number of reads, however, it would quickly become apparent (once you look at the expressed genes) that some B-Cell genes are highly expressed and swamp your data set.  By paring those out of the normalization subset, you’d find your core genes in common to be more comparable – and thus less prone to bias introduced by genes only expressed in one sample.

You’ll notice, however, that all of the methods above use a simple ratio, with increasingly better methods of approximation.  That’s pretty much par for the course, as far as I’ve seen in RNA-Seq.  It’s not ideal, but I haven’t seen much more elegance than that.

When it comes to ChIP-Seq, the same things apply – most software does some variation of the above, and many of them are still floundering around with the first two types, of which I’m not a big fan.

The version I implemented in FindPeaks 4.0 goes a little bit differently, but can be applied to RNA-Seq just as well as for ChIP-Seq. (yes, I’ve tried) The basic idea is that you don’t actually know the subset of house-keeping genes in common in ChIP-Seq because, well, you aren’t looking at gene expression.  Instead, you’re looking at peaks – which can be broadly defined as any collection of peaks above the background.  Thus, the first step is to establish what the best subset of your data should be used for normalization – this can be done by peak calling your reads.  (Hence, using a peak caller.)

Once you have peak-calling done for both datasets, you can match the peaks up.  (Note, this is not a trivial operation, as it must be symmetrical, and repeatable regardless of the order of samples presented.)  Once you’ve done this, you’ll find you have three subsets:  peaks in sample 1, but not in sample 2.  Peaks in sample 2 but not in sample 1, and peaks common to both. (Peaks missing in both are actually important for anchoring your data set, but I won’t get into that.) If you only use the peaks common to both data sets, rather than peaks unique to one sample, you have a natural data subset ideal for normalization.

Using this subset, you can then perform a linear regression (again, it’s not a standard linear regression, as it must be symmetrical about the regression line, and not Y-axis dependent) and identify the best fit line for the two samples.  Crucially, this linear regression must pass through your point of origin, otherwise, you haven’t found a normalization ratio.

In any case, once all this is done, you can then use the slope of the regression line to determine the best normalization for your data sets.

The beauty of it is that you also end up with a very nice graph, which makes it easy to understand the data set you’ve compared, and you have your three subsets, each of which will be of some interest to the investigator

(I should also note, however, that I have not expanded this method to more than two data sets, although I don’t see any reason why it could not be.  The math becomes more challenging, but the concepts don’t change.)

Regardless, the main point is simply to provide a method by which two data sets become more comparable – the method by which you compare them will dictate how you do the normalization, so what I’ve provided above is only a vague outline that should provide you with a rough guide to some of the ways you can normalize on a single trait.  If you’re asking more challenging questions, what I’ve presented above may not be sufficient for comparing your data.

Good luck!

[Edit:  Twitter user @audyyy sent me this link, which describes an alternate normalization method.  In fact, they have two steps – a pre-normalization log transform (which isn’t normalization, but it’s common.  Even FindPeaks 4.0 has it implemented), and then a literal “normalization” which makes the mean = 0, and the standard deviation =1.   However, this is only applicable for one trait across multiple data sets (eg, count of total reads for a large number of total libraries.)  That said, it wouldn’t be my first choice of normalization techniques.]

 

Working with Jackie Chan.

Since I’ve been posting jobs, I figured I may as well point people to another set of open positions.  Of course, again, I have no relationship with the people posting it… however, I just couldn’t not say anything about this set.

Apparently, if you work in the Pallen Group, you get to work on next gen sequencing pipelines in the lab with Jackie Chan. (Research Fellow in Microbial Bioinformatics)

How cool is that?Jackie Chan

Anyhow, The other position (Research Technician in Bioinformatics), doesn’t (apparently) involve martial arts.

Java 1.6 based fork of the Ensembl API.

Just in case anyone is still interested, I have started an ensj (Ensembl Java API) project at sourceforge, using the latest version of the ensj-core project as the root of the fork.  It fixes at least one bug with using the java API on hg19, and makes some improvements to the code for compatibility with java 1.6.

There are another few thousand changes I could make, but I’m just working on it slowly, if at all.

I’m not intending to support this full time, but interested parties are welcome to join the project and contribute, making this a truly open source version of the ensembl interface.  That is to say, community driven.

Ensembl isn’t interested in providing support (they no longer have people with the in-depth knowledge of the API to provide support), so please don’t use this project with the expectation of help from the Ensembl team.  Also note that significant enhancements or upgrades are unlikely unless you’re interested in contributing to them! (I have my own dissertation to write and am not looking to take this on as a full time job!)

If you’re interested in using it, however, you can find the project here:

https://sourceforge.net/projects/ensj/

and a few notes on getting started here and here.  I will get around to posting some more information on the project on sourceforge web site when I get a chance.

 

Celebrations

Some days you celebrate the little victories, other days, you celebrate the big ones.

Today, I get to celebrate a pretty significant victory, in my humble opinion: I managed to get the ensembl java API to compile and generate a fully operational battle station jar file that works with my Java code.

I know, it doesn’t sound like such a big deal, but that means I worked out all of it’s dependencies, managed to get all of it to compile without errors and THEN managed to fix a bug.  Not bad for a project I thought would take months.  In fact, I’ve even made some significant upgrades, for instance, it now creates a java 1.6 jar file, which should run a bit faster than the original java 1.4.  I’ve also gone through and upgraded some of the code – making it a bit more readable and in the java 1.6 format with “enhanced loops”.  All in all, I’m pretty pleased with this particular piece of work.  Considering I started on friday, and I’ve managed to make headway on my thesis project in the meantime, I’d say I’m doing pretty well.

So, as I said, I get to celebrate a nice little victory…. and then I’ll have to immediately get back to some more thesis writing.

—–

For posterity’s sake, here are the steps required to complete this project:

  1. Get the full package from the Ensembl people. (They have a version that includes the build file and the licence for the software.  The one I downloaded from the web was incomplete.)
  2. Get all of the dependencies.  They are available on the web, but most of them are out of date and new ones can be used.
  3. Figure out that java2html.jar needs to be in ~/.ant/lib/, not in the usual ./lib path
  4. Fix the problem of new data types in AssemblyMapperAdaptorImpl.java. (It’s a 2 line fix, btw.)
  5. Modify the build.properties file to use the latest version of the mysql API, and then copy that to the appropriate ./lib path.
  6. Modify the build.properties file to reflect that you’re generating a custom jar file.
  7. Modify the build.xml to use java 1.6 instead of 1.4
  8. Figure out how to use the ant code.  Turns out both “ant build” and “ant jar” both work.
  9. Note, the project uses a bootstrap manifest file which isn’t available in the source package on the web. If you use that code, you have to modify the build.xml file to generate a custom manifest file, which is actually pretty easy to do.  This isn’t required, however, if you have the full source code.

When you write it out that way, it doesn’t sound like such a big project does it?  I’m debating putting the modified version somewhere like sourceforge, if there’s any interest from the java/bioinformatics community.  Let me know if you think it might be useful.

Cancer as a network disease

A lot of my work these days is in trying to make sense of a set of cancer cell lines I’m working on, and it’s a hard project.  Every time I think I make some headway, I find myself running up against a brick wall – Mostly because I’m finding myself returning back to the same old worn out linear cancer signaling pathway models that biochemists like to toss about.

If anyone remembers the biochemical pathway chart you used to be able to buy at the university chem stores (I had one as a wall hanging all through undergrad), we tend to perceive biochemistry in linear terms.  One substrate is acted upon by one enzyme, which then is picked up by another enzyme, which acts on that substrate, ad nauseum.  This is the model by which the electron transport cycle works and the synthesis of most common metabolites.  It is the default model to which I find myself returning when I think about cellular functions.

Unfortunately, biology rarely picks a method because it’s convenient to the biologist.  Once you leave cellular respiration and metabolite synthesis and move on to signaling, nearly all of it, as far as I can tell, works along a network model.  Each signaling protein accepts multiple inputs and is likely able to signal to multiple other proteins, propagating signals in many directions.  My colleague referred to it as a “hairball diagram” this afternoon, which is pretty accurate.  It’s hard to know which connections do what and if you’ve even managed to include all of them into your diagram. (I wont even delve into the question of how many of the ones in the literature are real.)

To me, it rather feels like we’re entering into an era in which systems biology will be the overwhelming force for driving the deep insight.  Unfortunately, our knowledge of systems biology in the human cell is pretty poor – we have pathway diagrams which detail sub-systems, but they are next to imposible to link together. (I’ve spent a few days trying, but there are likely people better at this than I am.)

Thus, every time I use a pathway diagram, I find myself looking at the “choke points” in the diagram – the proteins through which everything seems to converge.  A few classic examples in cancer are AKT, p53, myc and the Mapk’s.  However, the more closely I look into these systems, the more I realize that these choke points are not really the focal points in cancer.  After all, if they were, we’d simply have to come up with drugs that target these particular proteins and voila – cancer would be cured.

Instead, it appears that cancers use much more subtle methods to effect changes on the cell.  Modifying a signaling receptor, which turns on a set of transcription factors that up-regulates proto-oncogenes and down-regulates cancer-supressors, in turn shifting the reception of signalling that reinforce this pathway…

I don’t know what the minimum number of changes required are, but if a virus can do it with only a few proteins (EBV uses no more than 3, for instance), then why should a cell require more than that to get started?

Of course, this is further complicated by the fact that in a network model there are even more ways to create that driving mutation.  Tweak a signaling protein here, a receptor there… in no time at all, you can drive the cell in to an oncogenic pattern.

However, there’s one saving grace that I can see:  Each type of cell expresses a different set of proteins, which affects the processes available to activate cancers.  For instance inherited mutations to RB generally cause cancers of the eye, inherited BRCA mutations generally cause cancers of the breast and certain translocations are associated with blood cancers.  Presumably this is because the internal programs of these cells are pre-disposed to disruption by these particular pathways, whereas other cell types are generally not susceptible because of a lack of expression of particular genes.

Unfortunately, the only way we’re going to make sense of these patterns is to assemble the interaction networks of the human cells in a tissue specific manner.  It won’t be enough to know where the SNVs are in a cell type, or even which proteins are on or off (although it is always handy to know that).  Instead, we will have to eventually map out the complete pathway – and then be capable of simulating how all of these interactions disrupt cellular processes in a cell-type specific manner.  We have a long way to go, yet.

Fortunately, I think tools for this are becoming available rapidly.  Articles like this one give me hope for the development of methods of exposing all sorts of fundamental relationships in situ.

Anyhow, I know where this is taking us.  Sometime in the next decade, there will need to be a massive bioinformatics project that incorporates all of the information above: Sequencing for variations, indels and structural variations, copy number variations and loss of heterozygosity, epigenetics to discover the binding sites of every single transcription factor, and one hell of a network to tie it all together. Oh, and that project will have to take all sorts of random bits of information into account, such as the theory that cancer is just a p53 aggregation disease (which, by the way, I’m really not convinced of anyhow, since many cancers do not have p53 mutations).  The big question for me is if this will all happen as one project, or if science will struggle through a whole lot of smaller projects.  (AKA, the human genome project big-science model vs. the organized chaos of the academic model.)  Wouldn’t that be fun to organize?

In the meantime, getting a handle on the big picture will remain a vague dream at best, and tend to think cancer will be a tough nut to crack.  Like my own work and, for the time being, will be limited to one pathway at a time.

That doesn’t mean there isn’t hope for a cure – I just mean that we’re at a pivotal time in cancer research.  We now know enough to know what we don’t know and we can start filling in the gaps. But, if we thought next gen sequencing was a deluge of data, the next round of cancer research is going to start to amaze even the physicists.

I think we’re finally ready to enter the realms of real big biology data, real systems biology and a sudden acceleration in our understanding of cancer.

As we say in Canada… “GAME ON!”

Why I haven’t graduated yet and some corroborating evidence – 50 breast cancers sequenced.

Judging a cancer by it’s cover tissue of origin may be the wrong approach.  It’s not a publication yet, as far as I can tell, but summaries are flying around about a talk presented at AACR 2011 on Saturday, in which 50 breast cancer genomes were analyzed:

Ellis et al. Breast cancer genome. Presented Saturday, April 2, 2011, at the 102nd Annual Meeting of the American Association for Cancer Research in Orlando, Fla.

I’ll refer you to a summary here, in which some of the results are discussed.  [Note: I haven’t seen the talk myself, but have read several summaries of it.] Essentially, after sequencing 50 breast cancer genomes – and 50 matched normal genomes from the same individuals – they found nothing of consequence.  Everyone knows TP53 and signaling pathways are involved in cancer, and those were the most significant hits.

“To get through this experiment and find only three additional gene mutations at the 10 percent recurrence level was a bit of a shock,” Ellis says.

My own research project is similar in the sense that it’s a collection of breast cancer and matched normal samples, but using cell lines instead of primary tissues.  Unfortunately, I’ve also found a lot of nothing.  There are a couple of genes that no one has noticed before that might turn into something – or might not.  In essence, I’ve been scooped with negative results.

I’ve been working on similar data sets for the whole of my PhD, and it’s at least nice to know that my failures aren’t entirely my fault. This is a particularly difficult set of genomes to work on and so my inability to find anything may not be because I’m a terrible researcher. (It isn’t ruled out by this either, I might add.)  We originally started with a set of breast cancer cell lines spanning across 3 different types of cancer.  The quality of the sequencing was poor (36bp reads for those of you who are interested) and we found nothing of interest.  When we re-did the sequencing, we moved to a set of cell lines from a single type of breast cancer, with the expectation that it would lead us towards better targets.  My committee is adamant  that I be able to show some results of this experiment before graduating, which should explain why I’m still here.

Every week, I poke through the data in a new way, looking for a new pattern or a new gene, and I’m struck by the absolute independence of each cancer cell line.  The fact that two cell lines originated in the same tissue and share some morphological characteristics says very little to me about how they work. After all, cancer is a disease in which cells forget their origins and become, well… cancerous.

Unfortunately, that doesn’t bode well for research projects in breast cancer.  No matter how many variants I can filter through, at the end of the day, someone is going to have to figure out how all of the proteins in the body interact in order for us get a handle on how to interrupt cancer specific processes.  The (highly overstated) announcement of p53’s tendency to mis-fold and aggregate is just one example of these mechanisms – but only the first step in getting to understand cancer. (I also have no doubts that you can make any protein mis-fold and aggregate if you make the right changes.)  The pathway driven approach to understanding cancer is much more likely to yield tangible results than the genome based approach.

I’m not going to say that GWAS is dead, because it really isn’t.  It’s just not the right model for every disease – but I would say that Ellis makes a good point:

“You may find the rare breast cancer patient whose tumor has a mutation that’s more commonly found in leukemia, for example. So you might give that breast cancer patient a leukemia drug,” Ellis says.

I’d love to get my hands on the data from the 50 breast cancers, merge it with my database, and see what features those cancers do share with leukemia.  Perhaps that would shed some light on the situation.  In the end, cancer is going to be more about identifying targets than understanding its (lack of ) common genes.

New developments…

I’ve not been blogging lately because I have managed to convince myself that blogging was taking time away from other things I need to be doing, which I feel I need to focus on.  The most important thing at the moment is to get a paper done, which will be the backbone of my thesis.  Clearly, it is high priority, however, it’s becoming harder and harder not to talk about things that are going on, so I thought I’d interrupt my “non-blogging” with a few quick updates.  I have a whole list of topics I’m dying to write about, but just haven’t found the time to work on yet, but trust me, they will get done.  Moving along….

First, I’m INCREDIBLY happy that I’ve been invited to attend and blog the Copenhagenomics 2011 conference (June 9/10, 2011).  I’m not being paid, but the organizers are supporting my travel and hotel (and presumably waiving the conference fee), so that I can do it.  That means, of course, that I’ll be working hard to match or exceed what I was able to do for AGBT 2011. And, of course, I’ll be taking a few days to see some of Denmark and presumably do some photography.  Travel, photography, science and blogging!  What a week that’ll be!

Anyhow, this invitation came just before the wonderful editorial in Nature Methods, in which social media is discussed as a positive scientific communication for conference organizers.  I have much to say on this issue, but I don’t want to get into it at the moment.  It will have to wait till I’m a few figures further in my paper, but needless to say, I believe very strongly in it and think that conferences can get a lot of value out of supporting bloggers.

Moving along (again), I will also be traveling in June to give an invited talk, which will be my first outside of Vancouver. Details have not been arranged yet, but once things are settled down, I’ll definitely share some more information.

And, a little closer to home, I’ve been invited to sit on a panel for VanBug (the Vancouver Bioinformatics Users Group) on their “Careers in Bioinformatics” night (April 14th).  Apparently, my bioinformatics start-up credentials are still good and I’ve been told I’m an interesting speaker.  (In case you’re wondering, I will do my best to avoid suggesting a career as a permanent graduate student…) Of course, I’m looking forward to sitting on a panel with the other speakers: Dr. Inanc Birol, Dr. Ben Good and Dr. Phil Hieter – all of whom are better speakers than I am.  I’ve had the opportunity to interact with all of them at one point or another and found them to be fascinating people. In fact, I took my very first genomics course with Dr. Hieter nearly a decade ago, in an interesting twist of fate.  (You can find the poster for the event here.)

Even with just the few things I’ve mentioned above, the next few months should be busy, but I’m really excited.  Not only can I start to see the proverbial light at the end of the tunnel for grad school, I’m really starting to get excited about what comes after that.  It’s hard to not want to work, when you can see the results taking shape infront of your eyes.  If only there were a few more hours in the day!