A stab at the future of bioinformatics

I had a conversation the other day about where bioinformatics is headed, and it left me thinking about it for the past few days.  Generally, the question was more about whether bioinformatics (and biotechs) are at the start of something big, or whether this is all a fad.  Unfortunately, I can’t tell the future, but that doesn’t mean I shouldn’t take a guess wild stab in the dark.

Some things are clear because some things never change.  Unless armageddon is upon us or aliens land, we can be sure that sequencing will continue to get cheaper until it hits bottom – by which I mean about the same cost as any other medical test. (At which point, profit margins go up while sequencing costs go down, of course!)  But, that means that for the foreseeable future, we should expect the volume of human sequencing data to continue to rise.

That, naturally, translates pretty directly to an increase in the amount of data that needs to be processed.  Bioinformatics, unlike many other fields, is all about automation and discovery – and in this case, automation is really the big deal.  (I’ll get back to discovery later.)  Pipelines that take care of the human data are going to be more and more valuable, particularly when they add value to the automation and interpretation.  (Obviously, I should disclose that I work for a company that does this.)  I can’t say that I see this need going away any time soon.  However, doing it well requires significant investment and (I’d like to think) skill.  (As an aside, sorry for all of the asides.)

Clearly, though, automation will probably be a big employer of bioinformaticians going forward.  A great pipeline is one that is entirely invisible to the people using it, and keeping a pipeline for the automation of bioinformatics data current isn’t an easy task.  Anyone who has ever said “Great! We’re done building this pipeline!” isn’t on the cutting edge.  Or even on the leading edge.  Or any edge at all.  If you finish a pipeline, it’ll be obsolete before you can commit it to your git repository.

But, the state of the art in any field, bioinformatics included, is all about discovery.  For the most part, I suspect that it means big data.  Sometimes big databases, but definitely big data sets.  (Are you old enough to remember when big data in bioinformatics came in a fasta file, and people thought perl was going to take over the world?)  There are seven billion people on earth, and they all have genomes to be sequenced.  We have so much to discover that every bioinformatician on the planet could work on that full time, and we could keep going for years.

So yes, I’m pretty bullish on the prospects of bioinformaticians in the future.  As long as we perceive knowledge about ourselves is useful, and as long as our own health preoccupies us – for insurance purposes or diagnostics – there will be bioinformatics jobs out there.  (Whether there are too many bioinformaticians is a different story for another post.)  Discovery and re-discovery will really come sharply into focus for the next few decades.

We can figure out some of the more obvious points:

  • Cancer will be a huge driver of sequencing because it changes over time, and so we’ll constantly be driven to sequence again and again looking for markers or subpopulations. It’s a genetic disease and sequencing will give us a window into what it’s doing where nothing else can.  Like physicists and the hunt for subatomic particles, bioinformaticians are going to spend the next hundred years analyzing cancer data sets over and over and over.  There are 3 billion bases in the human genome, and probably as many unique variantions that make a cell oncogenic. (Big time discovery)
  • Rare disease diagnostics should become commonplace.  Can you imagine catching every single childhood disease within two weeks of the birth of a child?  How much suffering would that prevent?   Bioinformaticians will be at the core of that, automating systems to take genetic councillors out of the picture. (discovery turning to automation)
  • Single cell sequencing will eventually become a thing…. and then we’ll have to spend the next decade figuring out how the heck we should interpret it.  That’ll be a whole new field of tools. (discovery!)
  • Integration with medical records will probably happen.  Currently, it’s far from ideal, but mostly because (as far as I can tell) electronic medical records are built for doctors. Bioinformaticians will have to step in and have an impact.  Not that we haven’t seen great strides, but I have yet to hear of an EMR system that handles whole genome sequencing.  (automation.)
  • LIMS.  ugh. It’ll happen and drain the lives from countless bioinformaticians.  No further comment necessary. (automation)

At some point, however, it’s going to become glaringly obvious that the bioinformatics component is the most expensive part of all of the above processes.  Each will drive massive cost savings in healthcare and efficiency, but the actual process of building the tools doesn’t scale the same way as the data generation.

Where does that leave us?  I’d like to think that it’s a bright future for those who are in the field.  Interesting times ahead.

Ants..

This is a strange way to begin, but moving to California has reminded me of an interest in an Algorithm that I’ve always found fascinating: Ant Walks.

I hadn’t expected to return to that particular algorithm, but it turns out there’s a reason why people become fascinated with it. Because it’s somewhat an attempt to describe the behaviour of ants… which California has given me an opportunity to study first hand.

I’m moving in a week or two, but I have to admit, I have a love/hate relationship with the Ant colony in the back yard. I won’t really miss them, because they’re seriously everywhere. Although I’ve learned how to keep them out of the house, and they dont’ really bother me much, they’re persistent and highly effective at finding food. Especially crumbs left on the kitchen floor. (By the way, strategic placement of ant repellent, and the ants actually have a pretty hard time finding their way in… but that’s another post for another day.)

Regardless, the few times that the ants have found their way inside have inspired me to watch them and learn a bit about how they do what they do – and it’s remarkably similar to the algorithm based off of their behaviour. First, they take advantage of sheer numbers. They don’t really care about any one individual, and thus they just send each ant out to wander around. Basically, it’s just a divide and conquer, with zero planning. The more ants they send out, the more likely they are to find something. If you had only two or three ants, it would be futile… but 50-100 ants all wandering in a room with a small number of crumbs will result in the crumbs all being found.

And then there’s the whole thing about the trails. Watching them run back and forth along the trails really shows you that the ants do know exactly where they’re going, when they have somewhere to be. When they get to the end, they seem to go back into the “seeking” mode, so you can concentrate the search for relevance to a smaller area, for a more directed random search.

All and all, it’s fascinating. Unfortunately, unlike Richard Feynman, I haven’t had the time to set up Ant Ferries as a method of discouraging the ants from returning – my daughter and wife are patient, but not THAT patient – but that doesn’t mean I haven’t had a chance to observe them.  I have to admit, of all the things that I thought would entertain me in  California, I didn’t expect that Ants would be on that list.

Anyone interested in doing some topology experiments? (-;

(A bit over) A year in Oakland.

Naturally, I say “I’m going to blog more”, and then I get sick for a week and a half, and nothing gets written. Never fails! I should have said, “I’m never going to blog again”, at which point, I don’t doubt the health-fairy would come and make me all better.

But I never seem to do things that way.

I was thinking about blogging about work a bit more, since I’ve been given some leeway to do so, but I kinda feel like there’s a bit of low hanging fruit I wanted to tackle first…

Guess what – I’ve been living in California for 14 months. And you know what? It’s been fascinating. I’ve been amused, frustrated, annoyed and thrilled at the experience, and I think I should share some of it with you. As one zookeeper said to the other, Do you want the good gnus or the bad gnus?

Ok, lets start with the down side. Oakland – and much of what I’ve seen of the bay area – is far less clean than Canada. I’d heard Americans come north and say that Canada is clean, but I have to say that the overwhelming impression when you come south is the opposite. For bonus points, I’ve been living near an overpass, where people love to dump their garbage. It’s not pretty, and there’s a level of grit that’s just always there, presumably courtesy of the vast volume of traffic from the highway behind our house, and the busy street in front of it. Worse, though, I’ve seen people toss stuff out of moving cars, open their doors at street lights to casually let garbage fall out, and sometimes even just walking along, drop whatever they’re carrying if they don’t want it anymore. It has been very hard to teach my daughter how to be responsible when we’re constantly seeing examples of what not to do. Fortunately, my daughter has figured it out – and she likes to tell people that littering is wrong. I support her campaign entirely!

There’s also the unexpected social constructs of the bay area – It’s hard not to notice the racial divides that are present here. I could be wrong, given that I don’t spend a lot of time exploring that aspect of life here, but there seems to be a socioeconomic divide that falls along racial lines. Oddly enough, I don’t recall that happening in Canada to the same extent. It’s there, but not nearly as close to the surface as I seem to find it here.

Finally, I have to admit I’ve had a LOT of dealings with the IRS and the CRA, over the past year. For those of you who aren’t yet 18, or have only lived on one side of “the border”, those are the U.S and Canadian tax agencies, respectively. Overwhelmingly, I have to say that the attitudes of the people at the two agencies are night and day. After dealing with the IRS, I actually look forward to dealing with the Canadian Revenue Agency. Where the IRS gives off an air of “we’re too big to care about you” in pretty much all of it’s interactions, the CRA seems friendly and almost like they’re really there to help you – even when they’re trying to extract more money out of you than you’ve ever owned. Bizarre, that. At any rate, I’ve amassed a significant number of stories, if anyone ever wants to hear them.

In contrast to the above, I also have to admit, there are some amazing things about living in the bay, which make me really glad I’m here.

First, watching Oakland transform is pretty damn cool. Despite the garbage and selfish attitude of a small minority of the residents, Oakland is transforming. You can see the city is repaving streets to create bike paths, new buildings are going up everywhere, and houses everywhere are starting to get a little more care. It’s probably mostly “gentrification”, as the rich from San Francisco realize that this side of the bay is actually convenient for living and working, but it’s not a dirty word. It may be displacing some people, but the influx of families and artists and all of that is kinda like watching a flower bloom in slow motion. The neighbourhood I’ve lived in for the past 14 months has seen a creeping increase in the number of children in the area… and come spring, I have no doubt my daughter would be out there making friends at the park again.

Well, she would be, if we weren’t moving. We’ve found an apartment that will fit us better – and upon doctors orders, we will be further from the aforementioned traffic, which has been an issue for us. But, after a year, you start to find all these interesting pocket neighbourhoods, which you’d never find if you don’t make your way off the beaten path. I grew up some where where a “hill” was a couple of meters tall – and there weren’t many of them, so this is fascinating. The bay area has hills and valleys and microclimates and beaches and wineries… (don’t mind the meandering topic but)… oh my god, if you go a bit out of town, there are parks that blow your mind. On our first real family outing, we ended up at Limantour beach on a semi-foggy day, and were entertained by several pods of whales and dolphins parading up and down the beach, close enough that I probably could have hit one with a rock, if I’d tried. (And I have a lousy arm for tossing rocks…) When California decides to put on a show, it’s mind blowing. Sonoma in the fall was incredible, where my daughter and I played for an hour in the falling leaves, and Napa had us ooo-ing and aah-ing over incredible produce in the market in the summer.

And then there’s the people. Yes, there are homeless people, and aggressive panhandlers – especially around the Berkeley BART station! – but the overwhelming majority of Americans have absolutely no problem suddenly breaking out into a conversation at the drop of a hat. Random people will cheerfully begin chatting with you, when you least expect it. It’s the opposite of living in a Canadian Suburban Centre. At times, it’s surprising, but it’s always interesting and it makes you feel just a little more connected into a community that has as many people as any random 3-4 Canadian provinces combined. While I can’t say I’ve made a lot of friends outside of work, I can say I’ve met a lot of interesting people in my neighbourhood.

I haven’t decided which of the above stories, or even the many unmentioned ones, I want to tell on my blog yet, but I’m starting to think that a bit of Oakland is going to spill over into my writing, along with a bit of bioinformatics. To be be entirely candid, sometimes it’s hard to tell which one is stealing the show.

A surprise revelation today.

I feel like blogging about blogging tonight… but I’ll keep it short.

I’ve realized that blogging and twitter were an extension of my interest in living on the cutting edge of bioinformatics. I’m always interested in new technologies and new developments, and when I cut back on blogging (mainly because my priorities shifted as a parent… sleep, glorious sleep.), I also cut back on my interactions with the field around me.

That was mostly OK for a while. When I was working at UBC, I was able to attend lectures and interact in the academic world, so I had a bit of a life line. However, in Oakland, I haven’t been tied in to that greater flow of information. It sort of snuck up on me this morning, and I realized two things today.

The first is that I’m way out of touch, and it’s time to re-engage. I actually don’t begrudge the loss of interactions over the past year or so, really. The things my team and I have accomplished in the past year have been nothing short of awe inspiring, and I’ve learned a lot about my work, pipeline bioinformatics and what you can really make a computer do if you try hard enough. (Order of magnitude performance increases just make me feel warm and tingly… one day I should post the performance graph of Omicia’s software.) But, it’s time.

The second thing I realized, is that the lack of engagement drove me to reddit’s bioinformatics forum for a similar reason. I love writing, but without stimulation, you have nothing to write about. Reddit gives you a series of writing prompts, which can be fun, but I can get the same thing from reading other blogs and twitter – and that’s far more interesting than Reddit’s usual repertoire. (How many times can you give the same advice to people who want to get into bioinformatics?)

Regardless, If you’re looking for me, I’m going to be back on twitter, feeding my addiction to science and bioinformatics.

And yes, as of a few days ago, my daughter finally learned to sleep through the night. Strange how everything is connected, isn’t it?

Pac Bio Sequel

This isn’t anything others haven’t heard about, I’m sure, but I just saw the announcement for the Pac Bio Sequel.

It’s a pretty looking machine, and it’s promise (according to the press release) is pretty awesome. Actually, I’ve always had a sweet spot for Pac Bio, despite never having worked with Pac Bio data. It’s just that I so much want it to work. There’s just something appealing to me about tethered enzymes and single molecule sequencing.

Anyhow, I don’t have much commentary, though I’d love to hear if others do, about the Sequel.

http://blog.pacificbiosciences.com/2015/09/introducing-sequel-system-scalable.html

Something they don’t tell you about PyMongo 3.0 and Multiprocessing.

EDIT: This post turned into a bug report over at the mongo python driver wiki, where it was confirmed to be a bug, and not a feature. Ultimately, the issue hasn’t been resolved yet, but version 3.0.4 will now throw a warning, preventing this issue from failing silently. Thanks to A. Jesse Jiryu Davis for suggesting I file it as a bug, and Anna Herlihy for the patch!

I had an interesting bug in a piece of software that I’ve been working on, that involves some heavy multithreading.  Running 18 processes simultaneously, of which at least 9 of them require some form of database interaction with MongoDB, is really not all that complicated… but I hit something that tossed in a wrench and confused me for 2 days.  What was it, you might ask?

Well, it looked like this:

 File "something.py", line 177, in flush
  b.execute()
File "/Users/afejes/sandboxes/pipeline4/lib/python2.7/site-packages/pymongo/bulk.py", line 582, in execute
  return self.__bulk.execute(write_concern)
File "/Users/afejes/sandboxes/pipeline4/lib/python2.7/site-packages/pymongo/bulk.py", line 430, in execute
  with client._socket_for_writes() as sock_info:
File "/usr/local/Cellar/python/2.7.10/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 17, in __enter__
  return self.gen.next()
File "/Users/afejes/sandboxes/pipeline4/lib/python2.7/site-packages/pymongo/mongo_client.py", line 663, in _get_socket
  server = self._get_topology().select_server(selector)
File "/Users/afejes/sandboxes/pipeline4/lib/python2.7/site-packages/pymongo/topology.py", line 121, in select_server
address))
File "/Users/afejes/sandboxes/pipeline4/lib/python2.7/site-packages/pymongo/topology.py", line 97, in select_servers
  self._error_message(selector))
ServerSelectionTimeoutError: No servers found yet

Basically, the new pymongo drivers (3.0.x) have changed their initialization, so that they no longer actually create the connection pool when you initialize them.  You say:

mongo = MongoClient()

and they go off and do a non-blocking initialization of everything pymongo needs to start the server. All is good.

However, if you’re doing multiprocessing, the temptation is to allow each of your threads to launch a new instance of the MongoClient. Indeed, I’ve done that before with 2.8.x series of pymongo, and it worked well. However, in this case, pymongo 3.0.2 REALLY doesn’t like it, and you’ll get the “No Servers found yet” error when you try to retrieve results from your database. Oddly enough, it’s especially hard to figure out because pymongo has one more hidden surprise for you: serverSelectionTimeoutMS.

You probably have never heard of this parameter, but it’s kinda important, now. It goes on your initialization of the MongoWapper:

self.mongo = MongoClient(mongo_url, mongo_port, serverSelectionTimeoutMS=500) 

If you don’t put it there, the default value is 30 seconds… Which means your application sits there, waiting to see if the mongo database will connect for 30 seconds, once it realizes that the database is missing. When it finally does fail, you’ll get the error above… 30 seconds after your database went down. That’s cool… except when the issue is actually not related to the database going down.

In my case, the issue was not that the database went down, but that each thread should not be initializing a new instance of MongoClient! The only solution: have the parent thread create one instance of MongoClient, and then pass that as a parameter to the processes. Tada! – the error disappears, and your program starts to run, instead of failing and waiting 30 seconds to tell you.

On the subject of indels..

Ah, a blog post. It’s been a while, as life has been busy lately. My daughter turned 3 last week, and I’ve moved half way across the world and back, but I have slowly found myself with things to say again.

And, the one that needs saying first is that, as a community, NGS people have done a terrible job on standardizing how we deal with Indels. SNVs aren’t bad – we only have half a dozen ways to mess them up – but indels are just something else.

After a year of working hard on SNVs, indels have fallen back on the menu, and I’ve been beating my head on the wall trying to solve it all in one shot. Needless to say, it’s not going to be that easy, but there are a few things that are really worth pointing out:

If you can represent something in the genome two different ways, you should pick the easiest, right? Wrong, there are people who don’t agree with this, and I can give you an example. Lets say you have a reference sequence GAAAC, and you delete two As. Personally, I’d pick the left justified version and say GAA -> G. That’s pretty clear: you’ve removed to A’s after the G. Using the single redundant G makes it left justified , and anchored or rooted, and intuitively obvious. However, other people might disagree.

For instance, if you use a more old school style, that pre-dates Next-gen sequencing, you’d probably right justify it: AAC ->C… or take it one step further and drop the C, giving you AA->-. Yes, that’s a dash. Between the left and right justification, there’s not much to say: it’s either one standard or the other. Right justification is used by a lot of databases, such as clinvar, where many (most? all?) of the known deletions are pulled from clinical papers, who adopted that as the standard.

However, that’s far from the worst you can do.  You can also add one step to the confusion and pad your variant.  For instance, you could also represent the deletion of the two As with GAAAC->GAC.  Now, you’ll see it’s anchored on the left and the right, which is not necessarily a bad thing, but it is redundant.  You don’t need both for an unambiguous representation of the indel.  This is a non-reduced representation of the variant.  You can make them more confusing, if you try, though.  There are no bounds to the padding you can add.  Want a simple SNV to look more complicated?  How about: ACGTACTCGGCTAG->AGGTACTCGGCTAG. I would probably just shift the position over by one to the right and call it a C->G variant, and drop the padding.

Why do people not use reduced representation padding, though?  Because it’s more convenient for them.   Here’s an example I got from ExAC:  GAAA -> G,GA,GAA.  See what they’ve done there?  It’s actually three variants at the same position that I would represent with three different reference sequences, but by padding the variants, they can place them all on one line.  GA->G, GAA->G and GAAA->G.  If you don’t know that they’ve done this, it’s a bit surprising.  Indeed, I had to write to them to ask about it, because it wasn’t intuitively obvious to me why they show reduced variants on their web page, but distribute a VCF file with non-reduced variants.  There is a blog post about how to reduce variants, but as of last week, it wasn’t referenced in the readme files of their FTP site.

Regardless, ExAC isn’t the only one to use non-reduced representations – dbSNP does it as well, and I haven’t even begun to look at the myriad of other data sources we depend on for indel interpretation. It was rightly pointed out to me that non-reduced representations are not forbidden in the VCF 4.2 standard.  It’s definitely not forbidden, but then again, as a community, taking the position that anything not forbidden is allowed is a dangerous path for those who would like to see a unified standard.  We’re just not going to converge on the same page, if we keep stuff like this going.

Alas, Indels are a difficult minefield.  They are hard to call, hard to represent and hard to interpret.  We have a long path ahead of us to straighten it all out, but I don’t doubt we’ll get there.  This is just one more step we’ll have to take, in order to make sure we start getting these things right.

 

AMA for fun.

I’ve been asked by a few people to do an AMA, since I seem to be one of the few PhD-level Bioinformaticians working in industry who are active on the Reddit bioinformatics forum.  There are probably a lot of others, but I suspect that the bulk of people there are mostly graduate students or academics.

Anyhow, If anyone is interested in such silliness, here’s the link.

Of course, I’m going to feel pretty silly about the whole thing if no one asks any questions…

The glamour of Pipeline bioinformatics

I’m going to have to eat a bit of humble pie.  When I was a grad student, I may have just slightly looked down on “pipeline bioinformatics”, thinking it was a subject that was boring.  It clearly wasn’t as glamorous as designing new algorithms or plucking hidden bits of information out of giant data sets… I may have even thought it was something you just did as an after thought.

I was wrong.

I have to admit, now that I’ve had a taste of it, I’m enjoying it for exactly the opposite reasons:  It’s a fascinating game of balancing everything you know about computers and biology all at the same time, while making sure you get the right answer consistently.  It’s a cross between doing jigsaw puzzles and playing jeopardy…  and I’m kinda liking it.

In order to build a good pipeline, you need infrastructure that glues all the parts together, you need planning to make sure that it has room for growth, and you need to know what constraints the pipeline will face…  And, you need to be able to understand how everything from the bits of data you’re pushing through it will interact with all of the hardware on all of the machines and wires it’s going to run on.  That’s no small feat – but it’s an exhilarating challenge.

While I may have thought algorithm design was the cat’s pyjamas, building a pipeline is the same resource management challenge scaled up to include a whole lot more moving parts.  And, to those who manage all of those working parts, I finally grok what it is that drives you – and I am only working on a pipeline that was assembled by others, not even one of my own creation – which just increases my respect for those who have built pipelines out of nothing:

The thrill of watching data cascade through the waterfall that is the pipeline.

The excitement of having each individual piece operating in harmony, squeezing out that last bit of performance.

The fun of adding in three more pieces you thought would never fit, but making it work.

The satisfaction of knowing you managed to tame the mangy electrons that seemed so unruly before they entered your pipeline.

The reward of having someone look at the data afterwords, and learning something new from it.

Yes, pipeline bioinformaticians, I owe you an apology, your product is a magnificent work of art in it’s own right – and it is only truly completed when people are able to forget that it’s there. Cheers to you!

 

How do you become a bioinformatician?

I’ve been following the bioinformatics sub-reddit for the past couple of months, ever since I stumbled upon it when a colleague asked me about bioinformatics resources on the web.  It’s a fascinating place to visit, but it’s incredibly repetitive in that people keep asking “How do I become a bioinformatician?”

Unfortunately there is not a single answer, because bioinformatics isn’t a single job – it’s a collection of people who have found a way to live with one foot in each of two worlds: computer programming and biology.  Getting a firm footing in each can be a serious challenge, as people spend years studying just one of those to become proficient at it.

However, I think there are some common threads that tie the field together.  You need to invest the time in at least a handful of basic fields: some basic programming, some elementary cell biology and at least a simple understanding of math or statistics.  What you can accomplish with just that little can be incredibly productive.  Mostly in terms of automation of data processing or modelling of your results.

On the other hand, bioinformatics also includes a lot of sub-disciplines.  Great programmers can build incredible pipelines.  Great mathematicians can invent or apply algorithms to create new ways of interpreting data, and great biologists can develop heuristics and re-interpret data in new ways to generate insights that others have overlooked.  There’s even room for “neat freaks” in organizing and imposing order on unruly data.

The challenge of becoming a bioinformatician is learning where your strengths and weaknesses lay, and using them to your advantage.  Finding a research group that shores up your weaknesses – or helps you fill them in – can be a great boost to your career.  After my masters degree, I felt I had two big gaping holes in my resume: big data and databases, which I made the focus of my PhD research. Coming out of my defence, I felt I was able to bring a more balanced approach to the table – and had simultaneously purged any instinct I might have ever had to reach for a spreadsheet to interpret information. (Spreadsheets and big data don’t mix.)

So, where does that lead an aspiring bioinformatician?  Unless you take the time to do both a computer science degree and a biology degree, you probably won’t be able to shoehorn everything in to become an expert in both, and not everyone wants to get their PhD to fill in the gaps left in an undergrad education.

With that said, let me lay down a few useful points:

  1. Pick and chose to study subjects that interest you because you’ll at least end up with strengths in things you enjoy, which leads to jobs doing things you enjoy.
  2. You can always learn something new later… but take opportunities to try new things when they come.
  3. Remember that you’re not going to be the expert in every field you put your foot into – so look for opportunities to collaborate with the people who are.  (If you’re going into bioinformatics and expect to do everything yourself, you’re probably doing it wrong.)
  4. Don’t be afraid of the fact that you don’t know stuff.  Your job isn’t to be the best biologist and best computer scientist at the same time – it’s to be the bridge between.  The stronger your foundations, the better a bridge you can be, but unlike a concrete bridge, you can always invest in learning more.
  5. Yes, higher education does help in this field.  Bioinformatics is still dominated by research based organizations, and the academic hierarchy saturates the mindset of bioinformaticians everywhere.  (Or, almost everywhere.)
  6. Bioinformatics is also about the “soft” skills.  Don’t forget that bioinformaticians are also in a good place to be good leaders – since you’ll be one of the few people who can speak both languages, and tie together groups that would otherwise lack a common language.
  7. Don’t believe the hype about what you should learn:  R isn’t really the only language for doing bioinformatics.  Perl isn’t always evil (just most of the time, though it did save the human genome…), Java isn’t the slowest language out there, and c isn’t only for hardcore programmers. (Python, though, is a pretty good all-around language.)  Everyone has an opinion on where bioinformatics is going – but it’s just an opinion, so make your own choices.

At the end of the day, I always give students the same piece of advice:  As you go through life, you will learn new skills that you can apply as you see fit.  At the end of the day, each of these skills will be a tool in your toolbox that you can turn to when you hit a problem.  If you only have a hammer in your toolbox, your repertoire is pretty limited.  On the other hand, if you collect a fantastic assembly of tools, you’ll be equipped to handle just about anything that comes your way.  Your job is to invest your time into building the best toolkit you can, so that when you get out of school, you’ll be ready to solve as many problems as you can.

Bioinformatics is just a special case of toolbox building, in that you need the tools of at least two disciplines in your toolbox.  What you chose to put into your toolbox is entirely up to you, but (to stretch the toolbox analogy just a little too far), take a few minutes to ask if you’d like to be a plumber or a carpenter before you start collecting your tools. Or, without the metaphoric toolkit, ask yourself what kind of bioinformatician you want to be.

Once you know the answer to that question, you’ll figure out pretty quickly which tools you want to start collecting.  And the path towards becoming a bioinformatician will start to become clear.  It may not take you where you expect, but I can guarantee that you’ll be walking down an interesting road.