This is entirely inspired by a blog post of a very similar name from Xianjun Dong on the r-bloggers.com site. The R-specific focus didn’t do much for me, given that R as a language leaves me annoyed and frustrated, although I do understand why others use it. I haven’t come across Xianjun’s work before, and have never met him either online or in person, but I hope he doesn’t mind me revisiting his list with a broader scope. Thanks to Xianjun for creating the original list!
I’ve paraphrased his points in underline, written out my point response, and highlighted what I feel is the take away. So, lets upgrade the list a bit, shall we?
1. Use a non-random seed. Actually, that’s pretty good, but the real point should extend this to all areas of your work: determinism is the key both to debugging and to science – you need to be able to recreate all of your work upon demand. That’s the basis of how we do science.
2. The original said “set your own tmp directory” so that you don’t overlap toes with other applications. Frankly, I’d skip that, and instead, suggest you learn how the other applications work! If you’re running a piece of code, take the time to learn it – and by extension, all of it’s parameters. The biggest mistake I see from novice bioinformaticians is trying to use code they’re not familiar with, and doing something the author never intended. Don’t just run other people’s tools, use them properly!
3. An R-specific file name hint. This point was far too R-centric, so I’ll just point you back to another key point: Take the time to learn the biology. Don’t get so caught up in the programming that you forget that underneath all of the code lies an actual biology or chemistry problem that you’re trying to study, simulate or interpret. Most often, the best bioinformatics solutions are the ones that are inspired by the biology itself.
4. Create a Readme file for your work. This is actually just the tip of the iceberg – Readme files are the last resort for any serious software project. A reasonable software project should have a wiki or a manual, as well as a host of other documentation. (Bug trackers, feature trackers, unit tests, example data files.) The list should grow with the size of the project. If your project is going to last more than a couple of weeks, then a readme file needs to grow into something larger. Documentation should be an integral part of your coding practice, however you do it.
5. Comment your code. Yes – please do. But, don’t just comment your code, write code that doesn’t need comments! One of the reasons why I love python is because there is a pythonic way to do things, and minimal comments are necessary to make it obvious what its supposed to do. Of course, anytime you think of a “clever” trick, that’s a prime candidate for extra documentation, and the more clever you are, the more documentation I expect.
6. Backup your code. Yep – I’m going to agree with the original. However, I do disagree with the execution. Don’t just back up your code to an extra disk, get your code into version control. The only person who doesn’t need version control is the person who never edits their code… and I haven’t met them yet. If you expect your project to be successful, then expect it to mature over time – and in turn, that you’ll have multiple versions. Trust me, version control doesn’t just back up, it makes code management and collaboration possible. Three for the price of one…. or for free if you use github.
7. clean up your intermediate data. Actually, I think keeping intermediate data around is a useful thing, while you’re working. Yes, biological data can create big files, and you should definitely clean up after yourself, but the more important lesson is to be aware of the resources that are available to you – of which disk space is just one. Indeed, all of programming is a tradeoff between CPU, Memory and Disk, and they’re interchangeable, of course. If you’re not aware of the Space-Time tradeoff, then you really haven’t started your journey as a bioinformatician. Really – this is probably the most important lesson you can learn as a programmer.
8. .bam, not .sam. This point is a bit limited in scope, so lets widen it. All of the data you’ll ever deal with is going to be in a less-than-optimal format for storage, and it’s on you to figure out what the right format it going to be. Have VCFs? Gzip them! Have .sam files? Make them .bam files! Of course, this doesn’t just go for storage: Do the same for how you access them. That gzipped VCF? You should have bgzipped it and then tabix indexed it. Same goes for your Fasta file (FAIDX?), or whatever else you have. Don’t just use compression, use it to your advantage.
9. Parallelize your code. Oh man, this is a can of worms. On the one hand, much of bioinformatics is embarrassingly parallelizeable. That’s the good news. The bad news is that threaded/multiprocessed code is harder to debug and maintain. This should be the last path you go down, after you’ve optimized the heck out of your code. Don’t parallelize what you can optimize – but use parallelization to overcome resource limitations. And only when you can’t access the resources in any other way. (If you work with a cluster, though, this may be a quick and dirty way to get more resources…)
10. clean up and back up. This was just a repeat of earlier points, so lets talk about networking. The best way to keep yourself current is to listen to what others have to say. That means making time to go to conferences, reading papers, blogs or even twitter. Talk to other bioinformaticians because they’ll always have new ideas, and it’s far too easy to get in to a routine where you’re not exposing yourself to whatever is new and exciting.
11. OOP: Inheritance, Encapsulation, Polymorphism. Actually, on this point, I completely agree. Understanding object oriented programming takes you from being able to write scripts to being able to write a program. A subtle distinction, but it will broaden your horizons in so many ways, of which the most important is clearly code re-use. And reusing your existing code means you start developing a toolkit instead of making everything a one off.
12. Save the URL of your references. Again, great start, but don’t just save the URL of your references. Make notes on everything. Whatever you find useful or inspiring, make a note in your lab book. Wait, you think bioinformaticians don’t have lab books? If that’s true, it’s only because you’ve moved on to something else that keeps a permanent record, like version control for your code, or electronic notebooks for your commands. Make sure everything you do is documented.
13. Keep Learning. YES! This! If you find yourself treading water as a bioinformatician, you’re probably not far from sinking. Neither programming or biology ever really stand still – there’s always something new that you should get to know. Keeping up with both fields is tough, but absolutely necessary.
14. Give back what you learn. Again, gotta agree here. There are lots of ways to engage the community: share your code, share your experience, share your opinions, share your love of science… but get out and share it somehow.
15. Stand up on occasion. Ok, I’ll go with this too. The sitting/standing desks are fantastic, and definitely worth the money, if you can get one. Bioinformaticians spend way too much time sitting, and you shouldn’t neglect your health. Or your family, actually. Don’t forget to work hard, and play hard.