>Community

>This week has been a tremendous confluence of concepts and ideas around community. Not that I’d expect anyone else to notice, but it really kept building towards a common theme.

The first was just a community of co-workers. Last week, my lab went out to celebrate a lab-mate’s successful defense of her thesis (Congrats, Dr. Sleumer!). During the second round of drinks (Undrinkable dirty martinis), several of us had a half hour conversation on the best way to desalinate an over-salty martini. As weird as it sounds, it was an interesting and fun conversation, which I just can’t imagine having with too many people. (By the way, I think Obi’s suggestion wins: distillation.) This is not a group of people you want to take for granted!

The second community related event was an invitation to move my blog over to a larger community of bloggers. While I’ve temporarily declined, it raised the question of what kind of community I have while I keep my blog on my own server. In some ways, it leaves me isolated, although it does provide a “distinct” source of information, easily distinguishable from other people’s blogs. (One of the reasons for not moving the larger community is the lack of distinguishing marks – I don’t want to sink into a “borg” experience with other bloggers and just become assimilated entirely.) Is it worth moving over to reduce the isolation and become part of a bigger community, even if it means losing some of my identity?

The third event was a talk I gave this morning. I spent a lot of time trying to put together a coherent presentation – and ended talking about my experiences without discussing the actual focus of my research. Instead, it was on the topic of “successes and failures in developing an open source community” as applied to the Vancouver Short Read Analysis Package. Yes, I’m happy there is a (small) community around it, but there is definitely room for improvement.

Anyhow, at the risk of babbling on too much, what I really wanted to say is that communities are all around us, and we have to seriously consider our impact on them, and the impact they have on us – not to mention how we integrate into them, both in our work and outside. If you can’t maximize your ability to motivate them (or their ability to motivate you), then you’re at a serious disadvantage. How we balance all of that is an open question, and one I’m still working hard at answering.

I’ve attached my presentation from this morning, just in case anyone is interested. (I’ve decorated it with pictures from the South Pacific, in case all of the plain text is too boring to keep you awake.)

Here it is (it’s about 7Mb.)

>4 Freedoms of Research

>I’m going to venture off the beaten track for a few minutes. Ever since the discussion about conference blogging started to take off, I’ve been thinking about what the rights of scientists really are – and then came to the conclusion that there really aren’t any. There is no scientist’s manifesto or equivalent oath that scientists take upon receiving their degree. We don’t wear the iron ring like engineers, which signifies our commitment to integrity…

So, I figured I should do my little part to fix that. I’d like to propose the following 4 basic freedoms to research, without which science can not flourish.

  1. Freedom to explore new areas
  2. Freedom to share your results
  3. Freedom to access findings from other scientists
  4. Freedom to verify findings from other scientists

Broadly, these rights should be self evident. They are tightly intermingled, and can not be separated from each other:

  • The right to explore new ideas depends on us being able to trust and verify the results of experiments upon which our exploration is based.
  • The right to share information is contingent upon other groups being able to access those results.
  • The purpose of exploring new research opportunities is to share those results with people who can use them to build upon them
  • Being able to verify findings from other groups requires that we have access to their results.

In fact, they are so tightly mingled, that they are a direct consequence of the scientific method itself.

  1. Ask a question that explores a new area
  2. Use your prior knowledge, or access the literature to make a best guess as to what the answer is
  3. Test your result and confirm/verify if your guess matches the outcome
  4. share your results with the community.

(I liked the phrasing on this site) Of course if your question in step 1 is not new, you’re performing the verification step.

There are constraints on what we are allowed to do as scientists as well, we have to respect the ethics of the field in which we do our exploring, and we have to respect the fact that ultimately we are responsible to report to the people who fund the work.

However, that’s where we start to see problems. To the best of my knowledge, funding sources define the directions science is able to explore. We saw the U.S. restrict funding to science in order to throttle research in various fields (violating Research Freedom #1) for the past 8 years, which was effectively able to completely halt stem cell research, and suppress alternative fuel sources, etc. In the long term, this technique won’t work, because the scientists migrate to where the funding is. As the U.S. restores funding to these areas, the science is returning. Unfortunately, it’s Canada’s turn, with the conservative government (featuring a science minister who doesn’t believe in evolution) removing all funding from genomics research. The cycle of ignorance continues.

Moving along, and clearly in a related vein, Freedom #2 is also a problem of funding. Researchers who would like to verify other group’s findings (a key responsibility of the basic peer-review process) aren’t funded to do this type of work. While admitting my lack of exposure to granting committees, I’ve never heard of a grant being given to verify someone else’s findings. However, this is the basic way by which the scientists are held accountable. If no one can repeat your work, you will have many questions to answer – and yet the funding for ensuring accountability is rarely present.

The real threat to an open scientific community occurs with the last two Freedoms: sharing and access. If we’re unable to discuss the developments in our field, or are not even able to gain information on the latest work done, then science will come grinding to a major halt. We’ll waste all of our time and money exploring areas that have been exhaustively covered, or worse yet, come to the wrong conclusions about what areas are worth exploring in our ignorance of what’s really going on.

Ironically, Freedoms 3 and 4 are the most eroded in the scientific community today. Even considering only the academic world, where freedoms are taken for granted our interaction with the forums for sharing (and accessing) information are horribly stunted:

  • We do not routinely share negative results (causing unnecessary duplication and wasting resources)
  • We must pay to have our results shared in journals (limiting what can be shared)
  • We must pay to access other scientists results in journals (limiting what can be accessed)

It’s trivial to think of other examples of how these two freedoms are being eroded. Unfortunately, it’s not so easy to think of how to restore these basic rights to science, although there are a few things we can all do to encourage collaboration and sharing of information:

  • Build open source scientific software and collaborate to improve it – reducing duplication of effort
  • Publish in open access journals to help disseminate knowledge and bring down the barriers to access
  • Maintain blogs to help disseminate knowledge that is not publishable

If all scientists took advantage of these tools and opportunities to further collaborative research, I think we’d find a shift away from conferences towards online collaboration and the development of tools favoring faster and more efficient communication. This, in turn, would provide a significant speed up in the generation of ideas and technologies, leading to more efficient and productive research – something I believe all scientists would like to achieve.

To close, I’d like to propose a hypothesis of my own:

By guaranteeing the four freedoms of research, we will be able to accomplish higher quality research, more efficient use of resources and more frequent breakthroughs in science.

Now, all I need to do is to get someone to fund the research to prove this, but first, I’ll have to see what I can find in the literature…

>More on conference blogging…

>If you’ve been following along with the debate on conference blogging, you’ve surely been reading Daniel McArthur’s blog, Genetic Future. His latest post on the subject provides a nifty idea: presenters who are ok with their talks being discussed should have an icon in the conference proceedings beside the anouncement of their talks so that members of the audience know it’s safe to discuss their work. He even goes so far as to present a few icons that could be used.

On the whole, I’m not opposed to such a scheme – particularly at conference like Cold Spring, where unpublished information is commonly presented and even encouraged by the organizers. However, Cold Spring is one of the few rare venues where the attendance is “open”, but the policy on disclosing the information is restricted. It’s entirely regulated for journalists, but in the past has not been an issue for scientists. However, if a conference begins to restrict what the scientists are allowed to disclose outside of the meetings, the organizers are really removing themselves from the free and open scientific debate. A conference that does that isn’t technically a conference – at best it’s a closed door meeting – and the material should explicitly be labeled as confidential.

Assuming that the vast majority of presentations can’t be discussed without explicit permission is quite the anathema of science. If you look at the way technology is handled in western society, you’ll see a general trend: The patent system is based around the idea of disclosure, copyright is based on the idea of retaining rights after disclosure, and even our publication/peer review system demands full disclosure as the minimum standard. (Well, that plus a wad of cash for most journals…) For most conferences, then, I suggest we use a more fitting model than opting-in to allow disclosure, as proposed by Daniel. Rather, we should provide the opportunity to opt-out.

All presenters should have the option of choosing “I do not want my presentation disclosed.” We can even label their presentation with a nice little dohicky that indicates that the material is not for public discussion.


Audience members who attend the talk then agree that they are not allowed to discuss this information after leaving the room. Why operate in half measures? It’s either confidential or it’s not. Why should we forbid people from discussing it online, and then turn a blind eye to someone reading their notes in front of the non-attending members of their institution?

Hyperbole aside, what we’re all after here is a common middle-ground. Science Bloggers don’t want to bite the hands of the conference organizers, and I can’t really imagine conference organizers not being interested in fostering a healthy discussion. After all, conferences like AGBT have done well because of the buzz that surrounds their organization.

As I said in my last post on the topic, Science does well when the free and open exchange of ideas is allowed to take place, and people presenting at conferences should be aware of why they’re presenting. (I leave figuring out those reasons as exercise to the student.)

Lets not throw the blogger out with the bathwater in our haste to find a solution: Conferences are about disclosure and blogs are about communication: aren’t we all working towards the same goal?

>Another day, another result…

>I had the urge to just sit down and type out a long rant, but then common sense kicked in and I realized that no one is really interested in yet another graduate student’s rant about their project not working. However, it only took a few minutes for me to figure out why it’s relevant to the general world – something that’s (unfortunately) missing from most grad student projects.

If you follow along with Daniel McArthur’s blog, Genetic Future, you may have caught the announcement that Illumina is getting into the personal genome sequencing game. While I can’t admit that I was surprised by the news, I will have to admit that I am somewhat skeptical about how it’s going to play out.

If your business is using arrays, then you’ll have an easy time sorting through the relevance of the known “useful” changes to the genome – there are only a couple hundred or thousand that are relevant at the moment, and several hundred thousand more that might be relevant in the near future. However, when you’re sequencing a whole genome, interpretation becomes a lot more difficult.

Since my graduate project is really the analysis of transcriptome sequencing (a subset of genome sequencing), I know firsthand the frustration involved. Indeed, my project was originally focused on identifying changes to the genome common to several cancer cell lines. Unfortunately, this is what brought on my need to rant: there is vastly more going on in the genome than small sequence changes.

We tend to believe blindly what we were taught as the “central paradigm of molecular biology”. Genes are copied to mRNA, mRNA is translated to proteins, and the protein goes off to do it’s work. However, cells are infinitely more complex than that. Genes can be inactivated by small changes, can be chopped up and spliced together to become inactivated or even deregulated, interference can be run by distally modified sequences, gene splicing can be completely co-opted by inactivating genes we barely even understand yet and desperately over-expressed proteins can be marked for deletion by over-activating garbage collection systems so that they don’t have a chance to get where they were needed in the first place. And here we are, looking for single nucleotide variations, which make up a VERY small portion of the information in a Cell.

I don’t have the solution, yet, but whatever we do in the future, it’s not going to involve $48,000 genome re-sequencing. That information on it’s own is pretty useless – we’ll have to study expression (WTSS or RNA-Seq, so figure another $30,000), changes to epigenetics (of which there are many histone marks, so figure 30 x $10,000) and even dna methylation (I don’t begin to know what this process costs.)

So, yes, while I’m happy to see genome re-sequencing move beyond the confines of array based SNP testing, I’m pretty confident that this isn’t the big step forward it might seem. The early adopters might enjoy having a pretty piece of paper that tells them something unique about their DNA, and I don’t begrudge it. (In fact, I’d love to have my DNA sequenced, just for the sheer entertainment value.) Still, I don’t think we’re seeing a revolution in personal genomics – not quite yet. Various experiments have shown we’re on the cusp of a major change, but this isn’t the tipping point: we’re still going to have to wait for real insight into the use of this information.

When Illumina offers a nice toolkit that allows you to get all of the SNVs, changes in expression and full ChIP-Seq analysis – and maybe even a few mutant transcription factor ChIP-Seq experiments thrown in – and all for $48,000, then we’ll have a truly revolutionary system.

In the meantime, I think I’ll hold out on buying my genome sequence. $48,000 would buy me a couple more weeks in Tahiti, which would currently offer me a LOT more peace of mind. (=

And on that note, I’d better get back to doing the things I do…. new FindPeaks tag, anyone?

>Once more into the breach…

>I haven’t been able to follow the whole conversation going on with respect to conference blogging, since I’m still away at a conference for another day. Technically, the conference ended a on thursday, but I’m still here visiting with some of the more important people in my life – so that is my excuse.

At any rate, I received an interesting comment from someone posting as “such.ire”, to which I wrote a reply. In the name of keeping the argument going (since it is such a fascinating topic), I thought I’d post my reply to the front page. For context, I suggest reading such.ire’s comment first:

click here for his comment.

My reply is below:

——-

Hi Such.ire,

I really appreciate your comment – it’s a great counter point to what I said, and really emphasizes the fact that this debate will have plenty of nuances, which will undoubted carry this conversation on long after the blogosphere has finished with it.

To rebut a few of your points, however, I should point out that your examples aren’t all correct.

Yes, conferences are well within their rights to ask you to sign NDAs as an attendee – or to require that confidentiality is a part of the conference – there is no debate on that point. However, if you attend a conference that is open and does not have an explicit policy, then it really is an open forum, and they do not have the right to retroactively dictate what you can (or can’t) do with the information you gathered at the conference.

I think all of us would agree that the boundaries for a conference should be clearly specified at the time of registration.

As for lab talks for your lab members – those are not “public disclosures” in the eye of the law. All of your lab colleagues are bound by the rules that govern your institution, and I would be surprised if your institution hadn’t asked you to sign various confidentiality rules or policies about disclosure at the time you joined them.

Department seminars are somewhat different – if they are advertised outside the department to individuals that are not members of the institution, then again, I would suggest they are fair game.

I don’t blog departmental talks or RIP talks for that reason. They are not public disclosures of information.

Finally, my last point was not that journalists and bloggers do anything different up front, but that the method of their publishing should have a major impact on how they are treated. Bloggers can make corrections that reach all of their audience members and can update their stories, while journalists can not.

If a conference demands to see the material a journalist publishes up front, it makes sense. If they demand to do the same thing for a blogger, it completely ignores the context of the media in which the communication occurs.

>The Rights of Science Blogging

>An article recently appeared on scienceweb, in relation to Daniel McArthur’s blogging coverage of a conference he attended at Cold Spring Harbor, which has raised a few eyebrows (the related article is here). Cold Spring Harbor has a relatively strict policy for journalists, but it appears that Daniel wasn’t constrained by it, since he’s not a “journalist”, by the narrow definition of the word.  More than half of the advice I’ve ever received on blogging science conferences comes from Daniel, and I would consider him one of the more experienced and professional of the science bloggers – which makes this whole affair just that much more interesting.  If anyone is taking exception to blogging, Daniel’s coverage of an event is guaranteed to be the least offensive, best researched and most professional of the blogs, and hence the least likely to be the one that causes the outcry.

As far as I can tell from the articles, Cold Spring is relatively upset about this whole affair, and is going down the path that many other institutions have chosen: Trying to suppress blogging, instead of embracing it.

Unfortunately, there really very few reasons for this to be an issue – and I thought I’d put forward a few counter-points to those who think science blogging should be restrained.

1.  Public disclosure

Unless the conference organizers have explicitly asked each participant to sign a non-disclosure agreement, the conference contents are considered to be a form of public disclosure.  This is relevant, not because of the potential for people to talk about it is important, but because legally, this is when the clock starts ticking if you intend to profit from your discovery.  In most countries, the first time an invention is disclosed is when you begin to lose rights to an invention – broadly speaking, it often means that you have one year to officially file the patent, or the patent rights to it become void.  Public disclosure can be as simple as emailing your invention in an un-encrypted file, leaving a copy of a document in a public place….  the bar for public disclosure is really quite low.  More crucially, you can lose your rights to patenting things at all if they’re disclosed publicly before the patent is filed.

Closer to home, you might have to worry about academic competition.  If you stand up in front of a room and tell everyone what you’ve just discovered (before you’ve submitted it), any one can then replicate that experiment and scoop you…  The academic world works on who has published what first – so we already have the built in instinct to keep our work quiet – until we’re ready to release it.  (There’s another essay in that on open source science, but I’ll get to it another day.)  So, when academics stand up in front of an audience, it’s always something that’s ready to be broadcast to the world.  The fact that it’s then being blogged to a larger audience is generally irrelevant at that point.

2.  Content quality

An argument raised by Cold Spring suggests that they are afraid that the material being blogged may not be an accurate reflection of the content of the presentation.  I’m entirely prepared to call B*llsh!t on this point.

Given a journalist with a bachelors degree in general science, possibly a year or two of journalism school and maybe a couple years of experience writing articles and a graduate student with several years of experience tightly focussed on the subject of the conference, who is going to write the more accurate article?

I can’t seriously believe that Cold Spring or anyone else would have a quality problem with science blogging – when it’s done by scientists with an interest in the field.  More on this in the conclusion.

3. Journalistic control

This one is more iffy to begin with.  Presumably, the conference would like to have tighter control over the journalists who write articles in order to make sure that the content is presented in a manner befitting the institution at which the conference took place.  Frankly, I have a hard time separating this from the last point:  If the quality of the article is good, what right does the institution have to dictate the way it’s presented by anyone who attended.  If I sit down over beers with my colleagues and discuss what I saw at the conference, we’d all laugh if a conference organizer tried to censor my conversation.  It’s both impossible and violates a right to free speech. (Of course, if you’re in russia, or china, that argument might have a completely different meaning, but in North America or Europe, this shouldn’t be an issue.)  The fact that I record that conversation and allow free access to it in print or otherwise should not change my right to freely convey my opinions to my colleagues.

Thus, I would argue you can either have a closed conference, or an open conference – you have to pick one or the other, and not hold different attendees to different standards depending on the mode by which they converse with their colleagues.

4. Bloggers are journalists

This is a fine line.  Daniel and I have very different takes on how we interact with the blogosphere.  I tend to publish notes and essays, where Daniel focusses more on news, views and well-researched topic reviews.  (Sorry about the alliteration.)  There is no one format for bloggers, just as there isn’t one for journalists. Rather, it’s a continuous spectrum of how information is distributed and for journalists to get upset about bloggers in general makes very little sense.  Most bloggers work in the niches where journalists are sparse.  In fact, for most people, the niches are what making blogs interesting.  (I’m certainly not aware of any journalists who work on ChIP-Seq full time, and that is, I suspect the main reason why people read my feeds.)

Despite anything I might have to say on the subject, the final answer will be decided by the courts, who have been working on this particular thorny issue for years.  (Try plugging “are bloggers journalists” into google, and you’ll find more nuances to the issue than you might expect.

What it comes down to is that bloggers are generally protected by the same laws that protect journalists, such as the right to keep their sources confidential, and bound by the same limits, such as the ability to be sued for spreading false information.  Responsibility goes hand in hand with accountability.

And, of course, that should be how institutions like Cold Spring Harbor have to address the issue.

Conclusion:

Treating science bloggers the way Cold Spring Harbor treats journalists doesn’t make sense.  Specialists talking about a field in the public is something that the community has been trying to encourage for years: greater disclosure, more open dialog and sharing of ideas are the fundamental pillars of western science.  To force the bloggers into the category of the journalists in the world of print magazines is utterly ridiculous: bloggers articles can be updated to fix typos, to adjust the content and to ensure clarity.  Journalists work in a world in which a typo becomes part of the permanent record and misunderstandings can remain in the public mind for decades.   The power to reach a large audience exists – but only bloggers have the ability to go back and make corrections.    Working with bloggers is a far better strategy than working against them.

No matter how you slice it, institutions with a vested interest in a single business model always resist change – and so do those who have not yet come to terms with the advances of technology.  Unfortunately, it sounds like Cold Spring Harbor hasn’t yet adapted to the internet age and are trying to fig a square peg into a round hole.  

I’d like to go on the record in support of Daniel McArthur – blogging a conference is an important method of creating dialog in the science community.  We can’t all attend each conference, but we shouldn’t all be left out of the discussion – and blogs are one important way that that can be achieved.

If Cold Spring Harbor has a problem with Daniel’s blog, let them come forward and identify the problem.  Sure, they can ask bloggers to announce their blog urls before the conference – allowing the organizers to follow along and be aware of the reporting, I wouldn’t argue against that.  It provides accountability for those blogging the conference – which serious bloggers won’t object to – and it allows the bloggers to go forth and engage the community.  

To strangle the communication between conference attendees and their colleagues, however, is to throttle the scientific community itself.  Lets all challenge Cold Spring to do the right thing and adapt with the times, rather than to ask scientists to drop a useful tool just because it’s inconvenient and doesn’t fit in with the way the conference organizers currently interact with their audience.

>It never rains, but it pours…

>Today is a stressful day. Not only do I need to to finish my thesis proposal revisions (which are not insignificant, because my committee wants me to focus more on the biology of cancer), but we’re also in the middle of real estate negotiations. Somehow, this is more than my brain can handle on the same day… At least we should know by 2pm if our counter-offer was accepted on the sales portion of the transaction, which would officially trigger the countdown on the purchase portion of the transaction. (Of course, if it’s not accepted, then more rounds of offers and counter-offers will probably take place this afternoon. WHEE!)

I’m just dreading the idea of doing my comps the same week as trying to arrange moving companies and insurance – and the million other things that need to be done if the real estate deal happens.

If anyone was wondering why my blog posts have dwindled down this past couple of weeks, well, now you know! If the deal does go through, you probably won’t hear much from me for the rest of this year. Some of the key dates this month:

  • Dec 1st: hand in completed and reviewed Thesis Proposal
  • Dec 5th: Sales portion of real estate deal completes.
  • Dec 6th: remove subjects on the purchase, and begin the process of arranging the move
  • Dec 7th: Significant Other goes to Hong Kong for~2 weeks!
  • Dec 12th: Comprehensive exam (9am sharp!)
  • Dec 13th: Start packing 2 houses like a madman!
  • Dec 22nd: Hannukah
  • Dec 24th: Christmas
  • Dec 29th: Completion date on the new house
  • Dec 30th: Moving day
  • Dec 31st: New Years!

And now that I’ve procrastinated by writing this, it’s time to get down to work. I seem to have stuff to do today.

>Bioinformatics Companies

>I was working on my poster this afternoon, when I got an email asking me to provide my opinions on certain bioinformatics areas I’ve blogged on before, in return for an Apple iPod Touch in a survey that would take about half an hour to complete. Considering that ratio of value to time (roughly 44x what I get paid as a graduate student), I took the time to take the survey.

Unfortunately, at the very end of the survey, it told me I wasn’t eligible to recieve the iPod. Go figure. Had they told me that first, I probably would have (wisely) spent that half hour on my poster or studying. (Then they told me they’d ship it in 4-6 weeks…. ok, then.)

In any case, the survey asked very targeted questions with multiple choice answers which really didn’t encompas the real/full answers to the questions, and questions which were so leading that there really was no way to give the complete answer. (I like boxes to give my opinions… which kind of describes my blog, I suppose – A box into which I write my opinion. Anyhow…) In some ways, I have to wonder if the people who wrote the survey were trying to sell their product, or get feedback on it. Still, it led me to think about bioinformatics applications companies. (Don’t worry, this will make sense in the end.)

The first thing you have to notice as a bioinformatics software company is that you have a small audience. A VERY small audience. If microsoft could only sell it’s OS to a couple hundred or a thousand labs, how much would it have had to charge to make several billion dollars? (Answer: too much.)

And that’s the key issue – bioinformatics applications don’t come cheap. To make a profit on a bioinformtics application, you can only do one of four things:

  1. Sell at a high volume
  2. Sell at a high price
  3. Find a way to tie it to something high price, like a custom machine.
  4. Sell a service using the application.

The first is hard to do – there aren’t enough bioinformatics labs for that. The second is common, but really alienates the audience. (Too many bioinformaticians believe that a grad student can just build their own tools from scratch cheaper than buying a pre-made and expensive tool, but that’s another rant for another day. I’ll just say I’m glad it’s not a problem in my lab!) The third is good, but buying a custom machine has hidden support costs and in a world where applications get faster all the time, runs the risk of the device becoming obsolete all too fast. The last one is somewhat of a non-starter. Who wants to send their results to a third party for processing? Data ownership issues aside, if the bandwidth isn’t expensive enough, the network transfer time usually negates the advantages of doing that.

So that leaves anyone who wants to make a profit in bioinformatics in a tight spot – and I haven’t even mentioned the worst part of it yet:

If you are writing proprietary bioinformatics software, odds are, someone’s writing a free version of it out there somewhere too. How do you compete against free software, which is often riding on the cutting edge? Software patents are also going to be hard to enforce in the post-bilski legal world, and even if a company managed to sue a piece of software out of existence (e.g. injunctions), someone else will just come along and write their own version. After all, bioinformaticians are generally able to program their own tools, if they need to.

Anyhow, all this was sparked by the survey today, making me want to give the authors of the survey some feedback.

  1. Your audience knows things – give them boxes to fill in to give their opinions. (Even if they don’t know things, I’m sure it’s entertaining.)
  2. Don’t try to lead the respondents to the answers you want – let them give you their opinions. (That can also be paraphrased as “less promotional material, and more opinion asking.” Isn’t that the point of asking their opinions in the first place?
  3. Make sure your survey works! (The one I did today asked a few questions to test if I was paying attention to what I was reading, and then told me I got the answers wrong, despite confirming that the answer I checked was correct. Oops.)

So how does all of that tie together?

If you ask questions with the only possible answers being the ones you’ve provided, you’re going to convince yourself that the audience and pricing for your product are something that it may not be. Bioinformatics software is a hard field to be successful in – and asking the wrong questions will only make it harder to understand the pitfalls ahead. With pressure on both the business side and the software side, this is not a field in which you can afford to ask the wrong questions.

>SNP calling from MAQ

>With that title, you’re probably expecting a discussion on how MAQ calls snps, but you’re not going to get it. Instead, I’m going to rant a bit, but bear with me.

Rather than just use the MAQ snp caller, I decided to write my own. Why, you might ask? Because I already had all of the code for it, my snp caller has several added functionalities that I wanted to use, and *of course*, I thought it would be easy. Was it, you might also ask? No – but not for the reasons you might expect.

I spent the last 4 days doing nothing but working on this. I thought it would be simple to just tie the elements together: I have a working .map file parser (don’t get me started on platform dependent binary files!), I have a working snp caller, I even have all the code to link them together. What I was missing was all of the little tricks, particularly the ones for intron-spanning reads in transcriptome data sets, and the code that links together the “kludges” with the method I didn’t know about when I started. After hacking away at it, bit by bit things began to work. Somewhere north of 150 code commits later, it all came together.

If you’re wondering why it took so long, it’s three fold:

1. I started off depending on someone else’s method, since they came up with it. As is often the case, that person was working quickly to get results, and I don’t think they had the goal of writing production quality code. Since I didn’t have their code (though, honestly, I didn’t ask for it either since it was in perl, which is another rant for another day) it took a long time to settle all of the 1-off, 2-off and otherwise unexpected bugs. They had given me all of the clues, but there’s a world of difference between being pointed in the general direction towards your goal and having a GPS to navigate you there.

2. I was trying to write code that would be re-usable. That’s something I’m very proud of, as most of my code is modular and easy to re-purpose in my next project. Half way through this, I gave up: the code for this snp calling is not going to be re-usable. Though, truth be told, I think I’ll have to redo the whole experiment from the start at some point because I’m not fully satisfied with the method, and we won’t be doing it exactly this way in the future. I just hope the change doesn’t happen in the next 3 weeks.

3. Name space errors. For some reason, every single person has a different way of addressing the 24-ish chromosomes in the human genome. (Should we include the mitochondrial genome in our own?) I find myself building functions that strip and rename chromosomes all the time, using similar rules. Is the Mitochondrial genome a “MT” or just “M”? What case do we use for “X” and “Y” (or is it “x” and “y”?) in our files? Should we pre-pend “chr” to our chromsome names? And what on earth is “chr5_random” doing as a chromosome? This is even worse when you need to compare two active indexes, plus the strings in each read… bleh.

Anyhow, I fully admit that SNP calling isn’t hard to do. Once you’ve read all of your sequences in, determined which bases are worth keeping (prb scores), determined the minimum level of coverage, minimum number of bases that are needed to call a snp, there’s not much left to do. I check it all against the Ensembl database to determine which ones are non-synonymous, and then: tada, you have all your snps.

However, once you’re done all of this, you realize that the big issue is that there are now too many snp callers, and everyone and their pet dog is working on one. There are several now in use at the GSC: Mine, at least one custom one that I’m aware of, one built into an aligner (Bad Idea(tm)) under development here and the one tacked on to the swiss army knife of aligners and tools: MAQ. Do they all give different results, or is one better than another? who knows. I look forward to finding someone who has the time to compare, but I really doubt there’s much difference beyond the alignment quality.

Unfortunately, because the aligner field is still immature, there is no single file output format that’s common to all aligners, so the comparison is a pain to do – which means it’s probably a long way off. That, in itself, might be a good topic for an article, one day.