Bioinformatics Post-doc position in Hawaii

A friend of mine asked me to distribute this.  It’s been a while since I’ve posted anything bioinformatics related, but if you’re looking for a bioinformatics post doc position in a beautiful location, this one sounds pretty good.  Yes, I used my danish keyboard to obfuscate the email address… shouldn’t be too hard to figure out, for those who aspire to be post docs. (-;

——–

Garmire Group (starting 09/01/2012 or later)

University of Hawaii Cancer Research Center

Job description: Located on the beautiful sea shore of Honolulu, Hawaii, overlooking the Pacific Ocean, the University of Hawaii Cancer Center (UHCC) is one of only 66 research organizations in the country designated by the National Cancer Institute. Its mission is to focus on key cancers that impact the multi-ethic population of Hawaii, as well as worldwide. Home to some of the nation’s leading cancer experts, UHCC is ambitiously recruiting more new talents to work collaboratively towards international excellence: developing new strategies for discovery, driving findings into clinical practice and delivering the optimal outcome when a patient is faced with the prospect of treatment and recovery.

Projects available in the bioinformatics/genomics group, include but are not limited to (1) Integrative analysis with next-generation sequence etc. to decipher the regulatory role of microRNAs genome-wide. (2) microRNA biomarker discovery in various cancers using next-generation sequence data, microarray data, and survival data etc. (3) GWAS study of DNA methylation given different nutritional statuses. (4) Characterization and comparative genomics of long-intergenic non-coding RNAs with next-generation sequencing data etc. (5) developing sensitive structural variation detection algorithm(s) in targeted deep sequencing technology applied to cancer samples.

Requirements:

  • PhD degree in bioinformatics, bioengineering, biostatistics, (bio)physics, electrical engineering, or computer science.
  • Strong independent research capabilities with high profile, first-author publications
  • Experience with high-throughput data processing using next generation sequencing, microarray etc.
  • Proficient with Linux/Unix environment
  • Advanced experience in two or more programming languages: perl, R, matlab, C, C++, Java
  • Familiar with machine learning and data mining methods
  • Experience with collaborations with experimental biologists
  • Experience with high-performance computing is a plus

Salaries are competitive depending on qualification and experience. If interested, please send your CV and contact information of three references to lgarmire åt gmail døt com

Where’s the collaboration?

I had another topic queued up this morning, but an email from my sister-in-law reminded me of a more pressing beef: Lack of collaboration in the sciences. And, of course, I have no statistics to back this up, so I’m going to put this out there and see if anyone has anything to comment on the topic.

My contention is that the current methods for funding scientists is the culprit for driving less efficient science, mixed with a healthy dose of Zero Sum Game thinking.

First, my biggest pet peeve is that scientists – and bioinformaticians in particular – spend a lot of time reinventing the wheel.  How many SNP callers are currently available?  How many ChiP-Seq packages? How many aligners?  And, more importantly, how can you tell one from the other?  (How many of the hundreds of snp callers have you actually used?)

It’s a pretty annoying aspect of bioinformatics that people seem to feel the need to start from scratch on a new project every time they say “I could tweak a parameter in this alignment algorithm…”  and then off they go, writing aligner #23,483,337 from scratch instead of modifying the existing aligner.  At some point, we’ll have more aligners than genomes!  (Ok, that’s a shameless hyperbole.)

But, the point stands.  Bioinformaticians create a plethora of software that solve problems that are not entirely new.  While I’m not saying that bioinformaticians are working on solved problems, I am asserting that the creation of novel software packages is an inefficient way to tackle problems that someone else has already invested time/money into building software for. But I’ll come back to that in a minute.

But why is the default behavior to write your own package instead of building on top of an existing one?  Well, that’s clear: Publications.  In science, the method of determining your progress is how many journal publications you have, skewed by some “impact factor” for how impressive the name of the journal is.  The problem is that this is a terrible metric to judge progress and contribution.  Solving a difficult problem in an existing piece of software doesn’t merit a publication, but wasting 4 months to rewrite a piece of software DOES.

The science community, in general, and the funding community more specifically, will reward you for doing wasteful work instead of focusing your energies where it’s needed. This tends to squash software collaborations before they can take off simply by encouraging a proliferation of useless software that is rewarded because it’s novel.

There are examples of bioinformatics packages where collaboration is a bit more encouraged – and those provide models for more efficient ways of doing research.  For instance, in the molecular dynamics community, Charmm and Amber are the two software frameworks around which most people have gathered. Grad students don’t start their degree by being told to re-write one or the other packages, but are instead told to learn one and then add modules to it.  Eventually the modules are released along with a publication describing the model.  (Or left to rot in a dingy hard drive somewhere if they’re not useful.)   Publications come from the work done and the algorithm modifications being explained.  That, to me, seems like a better model – and means everyone doesn’t start from scratch

If you’re wondering where I’m going with this, it’s not towards the Microsoft model where everyone does bioinformatics in Excel, using Microsoft generated code.

Instead, I’d like to propose a coordinated bioinformatics code-base.  Not a single package, but a unified set of hooks instead.  Imagine one code base, where you could write a module and add it to a big git hub of bioinformatics code – and re-use a common (well debugged) core set of functions that handle many of the common pieces.  You could swap out aligner implementations and have modular common output formats.  You could build a chip-seq engine, and use modular functions for FDR calculations, replacing them as needed.  Imagine you could collaborate on code design with someone else – and when you’re done, you get a proper paper on the algorithm, not an application note announcing yet another package.

(We have been better in the past couple years with tool sets like SAMTools, but that deals with a single common file format.  Imagine if that also allowed for much bigger projects like providing core functions for RNA-Seq or CNV analysis…  but I digress.)

Even better, if we all developed around a single set of common hooks, you can imagine that, at the end of the day (once you’ve submitted your repository to the main trunk), someone like the Galaxy team would simply vacuum up your modules and instantly make your code available to every bioinformatician and biologist out there.  Instant usability!

While this model of bioinformatics development would take a small team of core maintainers for the common core and hooks, much the same way Linux has Linus Torvalds working on the Kernel, it would also cut down severely on code duplication, bugs in bioinformatics code and the plethora of software packages that never get used.

I don’t think this is an unachievable goal, either for the DIY bioinformatics community, the Open Source bioinformatics community or the academic bioinformatics community.  Indeed, if all three of those decided to work together, it could be a very powerful movement.  Moreso, corporate bioinformatics could be a strong player in it, providing support and development for users, much the way corporate Linux players have done for the past two decades.

What is needed, however, is buy-in from some influential people, and some influential labs.  Putting aside their own home grown software and investing in a common core is probably a challenging concept, but it could be done – and the rewards would be dramatic.

Finally, coming back to the funding issue.  Agencies funding bioinformatics work would also save a lot of money by investing in this type of framework.  It would ensure more time is spent on more useful coding, more time is spent on publications that do more to describe algorithms and to ensure higher quality code is being produced at the end of the day.  The big difference is that they’d have to start accepting that bioinformatics papers shouldn’t be about “new software” available, but “new statistics”, “new algorithms” and “new methods” – which may require a paradigm change in the way we evaluate bioinformatics funding.

Anyhow, I can always dream.

Notes: Yes, there are software frameworks out there that could be used to get the ball rolling.  I know Galaxy does have some fantastic tools, but (if I’m not mistaken), it doesn’t provide a common framework for coding – only for interacting with the software.  I’m also aware that Charmm and Amber have problems – mainly because they were developed by competing labs that failed to become entirely enclusive of the community, or to invest substantially in maintaining the infrastructure in a clean way.Finally, Yes, the licensing of this code would determine the extent of corporate participation, but the GPL provides at least one successful example of this working.

“bioinformatics is in decline”

Actually, the original post said “Is bioinformatics on decline?”  That lack of consideration for English grammar might be a bit of a foreshadowing for the flavour of the arguments that I’ll relay below.  (Granted, I would never ignore a post solely because of bad grammar and the author is clearly not a native English speaker, so lets not dwell on that point.)

Let me summarize the original post:

  • A link to a post that indicates the author is upset that some people have to ask what fields are encompassed by bioinformatics.
  • A link to a blog post that says “Less people are searching for the word bioinformatics using google”.
  • Some ambiguous claim that this applies to India, which is later clarified to mean “northern India”, which is never explained.
  • The author, by the way, claims to be a Search Engine Optimization (SEO) “expert” on his blog.

The last point is a great indication to me about the quality of the information you’re about to hear from him, in my opinion.  Anyone who claims to be interested in search engine optimization is all about hits, not about quality of information disseminated.  That’s a simple guarantee –  and I’m certain that Mr Brij has gotten a lot of hits on his web page by making what is not a controversial statement, but an obtuse one.

In essence, I believe this whole claim is likely more about driving traffic to his blog than making a valid contribution to a discussion.  I regret having visited it already, and I certainly won’t be linking to it.

In any case, lets deal with his first two points.  The first is that “People are asking what bioinformatics is about.”  To me, that indicates that more people are interested in the field, rather than that fewer people are interested.  If he never had to answer the question about bioinformatics, then I might be concerned that he’d either saturated the world with people knowledgeable about the field, or that no one new was hearing about it.  Clearly the first is impossible.  (So long as I have family members, I can guarantee first hand knowledge that there are people who are clueless about what bioinformatics really is.)

In reality,  this is just anecdotal evidence that the author is relaying.  We’re better off ignoring it and assuming that the conclusions drawn by Mr. Brij are just inaccurate extrapolations from bad data.

Lets move on to the second claim that Mr. Brij makes about the decline of bioinformatics on google trends.  I’m going to ignore the  wonderful comment someone made about how all searches on google trends have to be considered in relation to the growing number of searches for “porn”.  I don’t know if it’s true or not, though it seems plausible, but it’s perhaps not the best refutation for Mr. Brij’s assertion that bioinformatics is declining.

Instead, lets look at a different set of words:  {bioinformatics, biotech, pharma and antibody}. This will give us a control experiment, as there are magazines and entire industries devoted to watching biotech and pharma for trends.  If either of those were in steep decline, it would undoubtedly be all over the front pages of every industry publication.

Guess what?  They all show exactly the same downward trend!  (Even down to the little dip over the North American winter holidays.  I’d definitely be interested in hearing more about that phenonmenon!)  Does that mean that the pharma and biotech industries are going to be extinct in the near future?  Not so much – it’s simply a function of the overall traffic on google.

I surmise that in order to predict the decline of something on the web, you’ll need much more useful information than the proportion of google searches that include a specific term.  People haven’t stopped using antibodies over the past decade and bioinformatics sure as heck hasn’t gotten any less popular.  Frankly, I doubt most bioinformaticians spend their time googling for “bioinformatics” anyhow, which makes this a pretty moot point.  We all have better things to do.

The third point from the original author is that this somehow relates specifically to bioinformatics in India.  I’m not even going to deal with this point other than to say that the data presented doesn’t support this claim.  No statistics are presented on Mr. Brij’s blog and the excerpt from an excel spreadsheet that is given only demonstrates that there is fluctuation in the frequency of the search terms by month.  It’s not at all enlightening.

So, if I may pass along one piece of advice to the “SEO expert” Mr. Brij: Instead of trying to get more traffic to your site by placing links to your blog next to poorly phrased questions on popular web sites, I suggest you try to produce quality analyses and insightful observations.  That will drive up the number of hits you get in the long term far more than trying to suck people in with ridiculous claims.

In the meantime, I won’t be visiting your page again.

A Reply to: Should a Rich Genome Variant File End the Storage of Raw Read Data?

Before you read my post, you might want to zip over to Complete Genomic’s blog where C.S.O. Dr. Rade Drmanac wrote an entry titled “Should a Rich Genome Variant File End the Storage of Raw Read Data?”  It’s an interesting perspective where he suggests that, as the article’s title might indicate, we should only be keeping a rich variant file as the only trace of a sequencing run.

I should mention that I’m really not going to distinguish between storing raw reads and storing aligned reads – you can go from one to the other by stripping out the alignment information, or by aligning to a reference template.  As far as I’m concerned, no information is lost when you align, unlike moving to a rich variant file (or any other non-rich variant file, for that matter.)

I can certainly see the attraction of deleting the raw or aligned data – much of the data we store is never used again, takes up space and could be regenerated at will from frozen samples of DNA, if needed.  As a scientist, I’ve very rarely have had to go back into the read-space files and look at the pre-variant calling data – and only a handfull of times in the past 2 years.  As such, it really doesn’t make much sense to store alignments, raw reads or other data. If I needed to go back to a data set, it would be (could be?) cheaper just to resequence the genome and regenerate the missing read data.

I’d like to bring out a real-world analogy, to summarize this argument. I had recently been considering a move, only to discover that storage in Vancouver is not cheap.  It’ll cost me $120-150/month for enough room to store some furniture and possessions I wouldn’t want to take with me.  If the total value of those possessions is only $5,ooo, storing them for more than 36 months means that (if I were to move) it would have been cheaper to sell it all and then buy a new set when I come back.

Where the analogy comes into play quite elegantly is if I have some interest in those particular items.  My wife, for instance, is quite attached to our dining room set.  If we were to sell it, we’d have to eventually replace it, and it might be impossible to find another one just like it at any price.  It might be cheaper to abandon the furniture than to store it in the long run, but if there’s something important in that data, the cost of storage isn’t the only thing that comes into play.

While I’m not suggesting that we should be emotionally attached to our raw data, there is merit in having a record that we can return to – if (and only if) there is a strong likelyhood that we will return to the data for verification purposes.  You can’t always recreate something of interest in a data set by resequencing.

That is a poor reason to keep the data around, most of the time.  We rarely find things of interest that couldn’t be recreated in a specific read set when we’re doing genome-wide analysis. Since most of the analysis we do uses only the variants and we frequently verify our findings with other means, the cost/value argument is probably strongly in favour of throwing away raw reads and only storing the variants.  Considering my recent project on storage of variants (all 3 billion of the I’ve collected), people have probably heard me make the same arguments before.

But lets not stop here. There is much more to this question than meets the eye. If we delve a little deeper into what Dr. Drmanac is really asking, we’ll find that this question isn’t quite as simple as it sounds.  Although the question stated basically boils down to “Can a rich genome variant file be stored instead of a raw read data file?”, the underlying question is really: “Are we capable of extracting all of the information from the raw reads and storing it for later use?”

Here, I actually would contend the answer is no, depending on the platform.  Let me give you examples of what data I feel we do a poor example of extracting, right now.

  • Structural variations:  My experience with structural variations is that no two SV callers give you the same information or make the same calls.  They are notoriously difficult to evaluate, so any information we are extracting is likely just the tip of the iceberg.  (Same goes for structural rearrangements in cancers, etc.)
  • Phasing information:  Most SNP callers aren’t giving phasing information, and some aren’t capable of it.  However, those details could be teased from the raw data files (depending on the platform).  We’re just not capturing it efficiently.
  • Exon/Gene expression:  This one is trivial, and I’ve had code that pulls this data from raw aligned read files since we started doing RNA-Seq.  Unfortunately, due to exon annotation issues, no one is doing this well yet, but it’s a huge amount of clear and concise information available that is obviously not captured in variant files.  (We do reasonably well with Copy Number Variations (CNVs), but again, those aren’t typically sored in rich variant files.)
  • Variant Quality information: we may have solved the base quality problems that plagued early data sets, but let me say that variant calling hasn’t exactly come up with a unified method of comparing SNV qualities between variant callers.  There’s really no substitute for comparing other than to re-run the data set with the same tools.
  • The variants themselves!  Have you ever compared the variants observed by two SNP callers run on the same data set? I’ll spoil the suspense for you: they never agree completely, and may in fact disagree on up to 25% of the variants called. (Personal observation – data not shown.)

Even dealing only with the last item, it should be obvious:  If we can’t have two snp callers produce the same set of variants, then no amount of richness in the variant file will replace the need to store the raw read data because we should always be double checking interesting findings with (at least) a second set of software.

For me, the answer is clear:  If you’re going to stop storing raw read files, you need to make sure that you’ve extracted all of the useful information – and that the information you’ve extracted is complete.  I just don’t think we’ve hit those milestones yet.

Of course, if you work in a shop where you only use one set of tools, then none of the above problems will be obvious to you and there really isn’t a point to storing the raw reads.  You’ll never return to them because you already have all the answers you want.  If, on the other hand, you get daily exposure to the uncertainty in your pipeline by comparing it to other pipelines, you might look at it with a different perspective.

So, my answer to the question “Are we ready to stop storing raw reads?” is easy:  That depends on what you think you need from a data set and if you think you’re already an expert at extracting it.  Personally, I think we’ve barely scratched the surface on what information we can get out of genomic and transcriptomic data, we just don’t know what it is we’re missing yet.

Completely off topic, but related in concept: I’m spending my morning looking for Scalable Vector Graphics (.svg) files for many jpg and png files I’d created along the course of my studies.  Unfortunately, jpg and png are lossy formats and don’t reproduce as nicely in the Portable Document Format (PDF) export process.  Having deleted some of those .svg files because I thought I had extracted all of the useful information from them in the export to png format, I’m now at the point where I might have to recreate them to properly export the files again in a lossless format for my thesis.  If I’d just have stored them (as the cost is negligible) I wouldn’t be in this bind….   meh.

 

Nature Comment : The case for locus-specific databases

There’s an interesting comment available in Nature today (EDIT: it came out last month, though I only found it today.) Unfortunately, it’s by subscription only, but let me save you the hassle of downloading it, if you don’t already have a subscription.  It’s not what I thought it was.

The entire piece fails to make the case for locus-specific databases, but instead conflates locus-specific with “high-resolution”, and then proceeds to tell us why we need high resolution data.  The argument can roughly be summarized as:

  • Omim and databases like it are great, but don’t list all known variations
  • Next-gen sequencing gives us the ability to see genome in high resolution
  • You can only get high-resolution data by managing data in a locus-specific manner
  • Therefore, we should support locus-specific databases

Unfortunately, point number three is actually wrong.  It’s just that our public databases haven’t yet transitioned to the high resolution format.  (ie, we have an internal database that stores data in a genome-wide manner at high resolution…  the data is, alas, not public.)

Thus, on that premise, I don’t think we should be supporting locus specific databases specifically –  indeed, I would say that the support they need is to become amalgamated in to a single genome-wide database at high resolution.

You wouldn’t expect major gains in understanding of car mechanics if you, by analogy, insisted that all parts should be studied independently at high resolution.  Sure you might improve your understanding of each part, and how it works alone, but the real gains come from understanding the whole system.  You might not actually need certain parts, and sometimes you need to understand how two parts work together.  It’s only by studying the whole system that you begin to see the big picture.

IMHO, Locus-specific databases are blinders that we adopt in the name of needing higher resolution, which is more of a comment on the current state of biology.  In fact, the argument can really be made that we don’t need locus-specific databases, we need better bioinformatics!

Dueling Databases of Human Variation

When I got it to work this morning, I was greeted by an email from 23andMe’s PR company, saying they have “built one of the world’s largest databases of individual genetic information.”   Normally, I wouldn’t even bat an eye at a claim like that.  I’m pretty sure it is a big database of variation…  but I thought I should throw down the gauntlet and give 23andMe a run for their money.  (-:

The timing for it couldn’t be better for me.  My own database actually ran out of auto-increment IDs this week, as we surpassed 2^31 snps entered into the db and had to upgrade the key field to bigint from int. (Some variant calls have been deleted and replaced as variant callers have improved, so we actually have only 1.2 Billion variations recorded against the hg18 version of the human genome.  A few hundred million more than that for hg19.)  So, I thought I might have a bit of a claim to having one of the largest databases of human variation as well.  Of course, comparing databases really is dependent on the metric being used, but hey, there’s some academic value in trying anyhow.

In the first corner, my database stores information from 2200+ samples (cancer and non-cancer tissue), genome wide (or transcriptome wide, depending on the source of the information.), giving us a wide sampling of data, including variations unique to individuals, as well as common polymorphisms.  In the other corner, 23andMe has sampled a much greater number of individuals (100,000) using a SNP chip, meaning that they’re only able to sample a small amount of the variation in an individual – about 1/3rd of a single percent of the total amount of DNA in each individual.

(According to this page, they look at only 1 million possible SNPs, instead of the 3 Billion bases at which single nucleotide variations can be found – although arguments can be made about the importance of that specific fraction of a percent.)

The nature of the data being stored is pretty important, however.  For many studies, the number of people sampled has a greater impact on the statistics than the number of sites studied and, since those are mainly the ones 23andMe are doing, clearly their database is more useful in that regard.  In contrast, my database stores data from both cancer and non-cancer samples, which allows us to make sense of variations observed in specific types of cancers – and because cancer derived variations are less predictable (ie, not in the same 1M snps each time) than the run-of-the-mill-standard-human-variation-type snps, the same technology 23andMe used would have been entirely inappropriate for the cancer research we do.

Unfortunately, that means comparing the two databases is completely impossible – they have different purposes, different data and probably different designs.  They have a database of 100k individuals, covering 1 million sites, whereas my database has 2k individuals, covering closer to 3 billion base pairs.  So yeah, apples and oranges.

(In practice, however, we don’t see variations at all 3 Billion base pairs, so that metric is somewhat skewed itself.  The number is closer to 100 Million bp –  a fraction of the genome nearly 100 times larger than what 23andMe is actually sampling.)

But, I’d still be interested in knowing the absolute number of variations they’ve observed…  a great prize upon which we could hold this epic battle of “largest database of human variations.”  At best, 23andMe’s database holds 10^11 variations, (1×10^6 SNPs x 1×10^5 people), if every single variant was found in every single person – a rather unlikely case.  With my database currently  at 1.2×10^9 variations, I think we’ve got some pretty even odds here.

Really, despite the joking about comparing database sizes, the real deal would be the fantastic opportunity to learn something interesting by merging the two databases, which could teach use something both about cancer and about the frequencies of variations in the human population.

Alas, that is pretty much certain to never happen.  I doubt 23andMe will make their database public – and our organization never will either.  Beyond the ethical issues of making that type of information public, there are pretty good reasons why this data can only be shared with collaborators – and in measured doses at that.  That’s another topic for another day, which I won’t go into here.

For now, 23andMe and I will just have to settle for both having “one of the world’s largest databases of individual genetic information.”  The battle royale for the title will have to wait for another day… and who knows what other behemoths are lurking in other research labs around the world.

On the other hand, the irony of a graduate student challenging 23andMe for the title of largest database of human variation really does make my day. (=

[Note: I should mention that when I say that I have a database of human variation, the database was my creation but the data belongs to the Genome Sciences Centre – and credit should be given to all of those who did the biology and bench work, performed the sequencing, ran the bioinformatics pipelines and assisted in populating the database.]

CPHx: Ion Torrent 316 chip is due out in Q2, 2011

So, I put my foot in my mouth, figuratively, today.  I wrote that the Ion Torrent 316 chip would be out in Q4 in an earlier post.  Oops.  That was totally wrong.  Let me set the record straight that the Ion Torrent 316 Chip is due out in Q2, 2011.  I’ve even been contacted by a rep for Ion Torrent with photographic proof of it.

So, in case anyone was wondering, whatever I wrote about the 316 chip, it was wrong.  It will be out shortly, and I’ve corrected the article to point here and reflect the correct date.

My humble apologies. I’ll do my best to ensure I don’t make any more mistakes.

CPHx: Roald Forsberg, Sponsored by CLC bio.

CLC Genomics Gateway

Roald Forsberg

CLC’s take on next generation of genome informatics tools.

At the 10th aniversary of the human genome project.  We now have a common and comprehensible coordinate system.. That’s a real tangible effect of the project.  So much of the work we now do is based upon this first human genome coordinate set.    HUGE list of things we can do, because of it.

However, if you look at pre-NGS work, all of the efforts were focussed on making sense of the data and annotating it – and sometimes looking at it.  It was all centralized, curated, etc.

In the post-NGS, sequencing has now been democratized.  We have access to big dataset, changing our focus to analysis, mapping, modeling. Our data resources are now many, varied, decentralized and non-curated.  A complete paradigm shift in bioinformatics focus. [my words.]

CLC has built a platform for doing this type of work.  Server based, client based, scalability, etc..  It supports the majority of genomics workflows (de novo, reseq, chip-seq, rna-seq, smrna-seq, tag-seq.)  We lack the tools for pushing further down into integrative biology.  Pathways, diagnosis, systems biology…. but there ARE no tools for doing this.

Genome browsing is no longer enough. [Thank you, I’ve been waiting for someone to say this!]

Perl IS NOT A FRAMEWORK.  [Halelujah!]

You will not get the medical community to use this technology with perl and spreadsheets.

It is time to move downstream: CLC genomics gateway.

  • an integrated framework to visualize, analyze and combine data from the same reference genome
  • seamlessly add own expreiments in an integrated manner
  • foundation for systems biology, functional genomics, etc.

Technical details

  • One data object to store and /or point to data sources
  • lazy fetching’
  • federation of data sources
  • a graphical user interface
  • completely integrated with other bioinformatics tools
  • part of the clc systems developer kit, so developers can add components and reuse existing ones without having to maintain the gui.

Enables a wide variety of tools and integrates so you can work with other tools like blast, primer design, etc.

Visualization tools:  Bring in all of your tools into one place.  [looks clean, actually.  nice.]

Federation of data: can bring in together sources like UCSC, ENsembl, your own data, flat files, etc.

Analysis types beta. Some tools for SNP comparison and annotation, Filtering snps from federated data… basically, it brings together annotations from a variety of sources and puts it all together in one place.

More analysis types to be added, including cancer biology, further advanced features.

Beta tool is out now, with new versions coming out soon.

Also, they’re hiring bioinformaticians.

 

 

CPHx: Mike Hubank, UCL Genomics, Inst. of Child Health – Genome-wide Transcriptional Modeling

Genome-wide Transcriptional Modeling

Mike Hubank, UCL Genomics, Inst. of Child Health

—-

Challenges in complex transcriptional analysis.

Interpretation beyond the “favorite genes/Top 10” style are less frequent.  Ask better questions:

  • What TF activities control the response?
  • which genes are targets of which TF?
  • How does TF activity affect the expression patter?
  • How do tf activities interact to shape the response?

Data driven mathematical modeling offers solutions.

[ooo.. actual math!  Can’t copy it, tho.]

Goal is to assign values to the parameters, including correct rate of change of gene, concentration, degredation rate, activity of transcription factor, etc.

[more equations..  rearrangements of the one before]

Everything you can measure on one side, what you can’t on the other.  Discretisation by Lagrange interpolation.  Genome wide transcriptional modeling (GWTM) generates production profiles for every transcript.

Example:  DNA damage response model, Radiating T-cell cell-line.  Model based screening of activity profiles.  Identifies jointly regulated transcripts, groups them, with probabilities.

Validation done.  Knock out one gene, see if predictions make sense in taking out other genes as well.  “obviously they went quite well, otherwise I wouldn’t be here talking to you.”

GWTM explains a high proportion of the response.  68% of upregulated genes correctly assigned.  Unexplained transcripts result of co-regulation or measurement error.  Aim to eradicate error with Sequencing based measurement.

There is a problem with NGS data, however, when it gives “zeros”.  When you have a read of zero, it may not actually be completely off, so it’s hard to model around that.  New Correction tool – Dirac-sigma truncated log normal. [did I get that right? I’ve never herd of it.]

Conclusions: Dynamic transcriptional modelling can be used for:

  • Data driven deconstruction of complex responses
  • Ability to conduct complex “experiments” in silico (eg, vary TF activity or parameter models to make predictions.)
  • Generation of biologically meaningful parameter values (eg, correlation with apoptosis regulators.)
  • Linkage of functional modules
  • (Benefit of) Economy

Much of this is already available in R, the rest of it will be shortly.