#AGBTPH – Fowzan Alkuraya – It’s your variant, it’s your problem, and mine

Fowzan Alkuraya, Alfaisal University

We currently know a small number are benign, and a smaller number are pathogenic.  The idea is to drive it towards knowing every possible variant.  Even if we could classify every variant, it would be outdated shortly.  However, we can use phenotype, which keeps up with the gene pool – that way we can ask how the genotype translates to phenotype.  It’s not really that easy…

The formidable challenge of heterozygosity.

We are robust to heterozygous mutations, obviously.

Gene level challenge.  Is it dispensable? is there a non-disease phenotype?  Is it a recessive disease phenotype.

Variant level: Some we’ll never see because they’re embryonically lethal.  Some may never be clinically consequential.  non-coding?  truncating genes in dominant genes with no phenotype?

Fortunately, it’s all in the same species!  And, if we can show something is pathogenic, we can know that for next time.  Exploiting the special structure of the Saudi population to improve our understanding of the human genome.

  • High rates of consanguinity – endless source of homozygotes.
  • Large family size – great segregation power

Examples for Discovery of novel disease genes.

Some workflow: use predictive technologies, use frequency data.  Use model organisms. etc.  Use family data to identify how this variant exerts effect.

At the end of the day, this data can be shared so that everyone can benefit from this knowledge.

In second example, finding novel “lethal” genes.  Can’t do it statistically because it’s so rare.  Best hope is to observe balletic variants in non-viable embryonic tissue.  Show a case in which homozygous variant was present in all non-viable embryos from single family.  Able to do that without knowing anything about biology about the gene.

What do they do with it?  They put it out so everyone can share in the knowledge.  You never know which family is going to be making life-altering decisions based on the variant.

Published it – turned out to be the most frequent mutation in fetal losses in Saudi population.  Turned out to be important in endothelia protein.  (Cerebral Haemorrhages)

Now in Clinvar.

Example where it’s hard to understand the mechanism of the disease, and an example where prediction tools aren’t able to get it right.

How many variants are we just missing because they’re in the dark matter of the genome?  variants in non-coding parts of the genome/variants in the coding part = ?

We don’t know either of these, so it’s a hard problem:  Homozygosity mapping to the rescue.  Challenge of non-coding mutations.

104 families with recessive genotype that maps to a single locus.  101 of 104 were found to have genic mutations.  Vaast majority of disease causing mutations are in genes, then.

Good news: presumed non genic mutations <3%.

Bad new: many others will be missed for other reasons.

Demonstrated this with a sample cohort (33 families)

Catalogue of balletic LOF in well phenotypes individuals.    Able to find several genes that have been erroneously linked to disease phenotype.

[My paraphrasing: So, in the end, we should all be concerned about all of the variants, and getting them right.]

#AGBTPH – Hakon Hakonarson, CHP – Genomics-driven biomarker discovery and utilization: The precision medicine experience from CHOP

How they’ve leveraged their biobank into discoveries.

Novel gene editing technologies are coming – human examples are here as well.

CAG @ CHOP, founded in June 2006.  Recruit enough children that even rare conditions become common.  About 100,000 kids have been recruited, 70,000 can be reconnected with.  Almost 300,000 samples through collaborations in addition.

An early disease that was worked on, Neuroblastoma – usually found in advanced stages. Hard to treat.

Found some markers.  1% hereditary, 99% sporadic.

12% of sporadic cases had ALK mutations.  Existing drug for that, so was able to go straight to trial, which was successful and rapid.

Another project: Neurocognitive phenotyping. [Huge data collection effort, covering a very broad set of data gathering methods]  ADHD was a component of it.  Identified CNV in cohort, which clustered on Glutamate receptors in the brain (Elia, Glessner et al, 2011) Replicated  in 5 different cohorts. CNVRs overrepresented in cohort.

Have seen similar things in other neurocognitive diseases.

At the time, there was no drug for mGlutR Pathway.  There was, however, a drug that was indicated for another disease, but didn’t make it to the market.   Found up to 20% of patients have glutamate copy number variants.  They undertook new studies to demonstrate that the drug was useful for ADHD that have these mutations.  End up approved by FDA, down to 12 years of age.  IRB in November 2014, completed by May 2015.   Efficacy was extremely robust in this preliminary setting.  80%of patients had improvement following highest dose.

Expanded to included new mutations that influence mGLuR signalling, then expanded further to genes that influence that.

Tier one response was much stronger, those with mutations in the expansion groups did not have as high responses.

Some overlap with children who had co-morbid autism (including 22q deletion).  Major improvement in social behaviour and language.

Started a separate trial for 22q11.2 deletion syndrome based on effects seen in earlier results.

Repurposing compounds that already have safety data makes for rapid drug trials.


#AGBTPH – Teri Manolio, NHGRI – Genomics and the Precision Medicine Initiative

Substitute speaker for Eric Green, who could not attend. “Sends his regrets… and me.”

Precision medicine initiative, announced by president in Jan 2015.  Foundation for something that will change the way we practice medicine.

  • Genomics: Clingen is one resource that will be huge help.
  • EHR have changed a lot in the last 12 years. (Paper replaced by banks of computers.)
  • Technologies, such as wearable devices. Sensors, for instance
  • Data science/Big Data is also transformative.
  • Participant Partnerships, patients become partners, not subjects.

PMI Cohort: One million volunteers, reflecting the make up of the U.S, focus on underrepresented groups.  Longitudinal cohort. (Anyone can volunteer, or, via selection processes.)

Reflect: People, health status, geography, data types.


  • Large and diverse,
  • support focus on underserved,
  • complimenting existing cohorts, not duplicating.

Possible issue: Biasing towards Geeky people. [nice!]

Initial aware were made for pilot studies.  Developing brand, etc.

In July, $55M for Cohort Program Components.

Collaborate with Million Veteran program.

Start with the basic usual information, but will expand as the project grows.

Transformational approach to data access – data sharing with researchers and participants. Colleges, high school etc.  Industry, citizen science.

Will launch when ready and right – want to launch before current administration leaves office, but will happen “when it’s ready”.  Anticipate 3-4 years to reach one million participants.

Funnel of innovation being used: Exploration R&D -> Platform definition -> Advance definition -> Production -> Launch.  Also, Landing Zones: MVP, Goals and Stretch Goals.  Divided into areas that must be done.  [Basically, using industry practices for R&D on academic research?]

#AGBTPH – Howard Jacob, Hudson Alpha – Clinical sequencing for patients, adoptees and the health curious

Market segments: reference labs, sequencing technology companies, bioinformatics companies, data storage companies.

How do we get all this implemented into healthcare?

Why isn’t insurance paying?  Researchers are publishing conflicting information on many questions, ethics, costs, accuracy, etc.   NGS is not a validated test.

Rare disease is a huge problem

Lots of genes… lots of possible errors, therefore many possible combinations.  Diagnosis can be far off – 8 appointments,  7 years average.

How much of the genome should we test?  80% by Encode.  Exome is 1.5% of genome.  Which would you pick?

Panels are standard, but only useful relative to clinical phenotype.  Whole genome adds value over time.

Need WGS and bioinformatics to solve value of non-coding.  We need the data in the non-coding to make sense of it all.

3000 genomes at St. Jude’s life.  But how do we do this clinically?  Example: can you find genes for developmental delay.  376 families (primary trios).  339 family done -just past 100 diagnoses this week (102.)  28% diagnosed.

Families not diagnosed are open to reanalysis…. can revisit the data over and over again.

Also part of Undiagnosed Diseases Network.  This is about patients.

Genetic tests is largely underused.  Policy is state by state – mainly because we’re still arguing over how accurate the data is.  Literature shows we’re not completely accurate, different labs are getting different results.  Exomes are being funded, but Genomes aren’t.  Doesn’t make a lot of sense.

Picking on insurance companies.  Lets start getting companies to pay for sequencing.

Is it really that inaccurate?  Lined up Baylor vs Hudson Alpha – not easy to do an apples to apples comparison.  Do they come up with the same thing: There will, of course, be differences.  However, the analytical teams both came down to the same variants being diagnostic.

Reproducibility: It’s possible, requires new tests, still evolving.  More genomes -> More accuracy.

What data to return?

Have a lot of ethicists at Hudson Apha – Options are presented to parents: Primary, Child no Rx, Adult Action and Adult-No Action.

Asked audiences: 31% of geneticist based audiences say yes, they want it, compared to ~50% of lay-people.  Not all that different.

Huge implications:  ethical, legal and social.

Some paediatric geneticists consider “diagnosis” as “actionable” because it prevents you from having to run from place to place.

They way you view the data influences how you interact with it.  Personal decisions/Personal Medicine.  Precision medicine is for physicians.

Many excellent examples of where genomic medicine would have been really helpful and either saved lives, saved money or prevented suffering.

Roi is impressive.

Average workup for patients at each new hospital on your way to diagnosis is $20,000.  If it takes 8 hospitals on average to get a diagnosis, that’s a huge cost.

WGS can be done once, and re-used over and over.

Healthcare is about taking averages. Dosing is based off of averages, is it always useful that way? No.

Rolling out Insignt Genome, being driven by utility.  What data will people use?  On average, very few variants will have a major effect at the population level. Physicians make decisions every day with incomplete data.

How do we get the system to care?


Julie Segre – Microbial Genomics in a clinical setting. #AGBTPH

Two cases.

Genetic disorders and microbial disorders often interact.  Nearly all microbes can be uniquely identified by shotgun nucleic acid sequencing.

Topic 1.  Infectious diseases in hospitalized patients.  Sometimes can’t tell the kingdom, even.   Sample -> sequencing -> Bioinformatics ->  hopefully identifying agent.

Human genome is often considered the contamination – can’t physically extract it out.  Opening cells for fungi requires some harsh treatment.

URPI bioinformatics pipeline used.  What do you get out, and is it even in your database?

case 1: 3 hospitalization over 4 months – 44 days in ICU. over 100 inconclusive tests.  Cured 2 weeks after NGS dx with appropriate treatment.

Very clear hit found with Leptospira santarosai.   Had been travel to Puerto Rico, and Leptospira is a water-born disease.  Used appropriate treatment, and the infection resolved. (Tests were run that validated the diagnosis.)

Clia validation of these methods is required.  It’s a step by step process.  Happens over a year.  [appears to take nearly 2 years?  April 2015 to March 2017]

Asside: Nanopore sequencing  may also be a hugely exciting development for this field because it’s so fast.

Topic 2:  Unsing sequencing to inform healthcare-associated infections.

CRE – carbapenem resistant Enterobacteriacae.  We have no antibiotics left to fight these bacteria.  (Klebsiella pneumoniae)

Patient 1 – June, but several patients in August.  Either patient 1 was unrelated or transmission occurred.

Sequencing happened: patient 1 unine sample (assume as reference genome.)  3 variants in throat isolate, 3 different snps in lung.  Patient 2 and 3, identical to throat sample from first patient.  (one extra snp in patient 3).

Patient 1 and 3 overlapped in ICU.  3 and 2 overlapped in ICU.

Patient 4 had variants matching lung(?)  so separate transmission.

This data showed that transmission was happening – ultimately, a transmission map was created with other patients.  It was ultimately clear how it was transmitted.  Helped to identify which avenues needed to be tracked down by cohorting patients.

Resistance genes are generally on plasmids, so we need to be aware of the possibility of transmission of the plasmid to other organisms.

National Pathogen Reference Database – CDC, FDA and NIH.

If you have a reference, you can pretty much assemble anything.


Notes from NGS applications – Stephen Kingsmore, Fowzan Alkuraya, Hakon Hakonarson

I’m dumping my notes as I’ve done for other conferences – obvious mistakes are obviously my fault, and not those of the speakers.

Fowzan Alkuraya, Alfaisal University

Case 1 – 13 month old girl – developmental delay.
Unrelated parents
MRI brain atrophy
Karyotype: 45,X (non mosaic turner syndrome)

In past, would have said “atypical Turner Syndrome”
Now, that’s not good enough – we can find something else. “Atypical” should really not be used anymore – there’s probably more than one “lesion”

Exome sequencing:
ADRA2B – Arg222* -> homozygous truncating mutation.

Lesson: Don’t assume – there’s no excuse for “atypical” in the genomics era.

Case 2 – 4 year old suspected autism.
Un-contributory family history with healthy brother – should raise flag: autism is more common in boys than girls. Mendelian form of Autism?
Documented cognitive impairment – otherwise normal.
All guidelines : molecular karyotype: de novo 300kb deletion on chr10.
Is it pathogenic?
Use Decipher database of structural variants. ***** why don’t we use this?
found a match.

Were conducting a study that included clinical genomics approach, and exome sequencing found:
homozygous mutation in CC2D1A. Skipping exon 6. Not in exac, but found in saudi Arabians. (1 in 500) Known to be entirely correlated with mental development.

Beware of founder mutations in different ethnic groups.
Exome sequencing in parallel with molecular karyotyping for neurodevelopmental disorders.

But, when do we stop? Do we always need Exome sequencing?

Case 3 – (consanguineous) Couple lost two children with severe lactic acidosis (Severe, unexplained)
First child died on 2nd day
Second child died within hours.

normal electron transport chain. Sequencing of candidate genes was negative. clinical Exome sequencing: negative.

clinical Whole Genome sequencing: Negative.

Research grade exome sequencing: Found a splicing mutation in ECHS1, a known source of acidosis

Severe reduction in NMD.

30-50% of cases in exome sequencing remain without diagnosis. Are we normally missing the mutation at the capture and sequencing stage, or at the intepretaiotn stage?

Analyzed 33 cases with negative clinical exome/genome sequencing. Found it in 29 cases.

In 18 cases, gene was novel or within 6 months at time of diagnostics.
probably not reported
In 11 cases were in known genes.

Clinical directors are probably not reported because of filtering and interpretation issues.

If you have a novel mutation, it’s likely to be missed by clinical sequencing.

Stephen Kingsmore, Rady Children’s Hospital:

2 cases:
first, ad birth, acute liver failure,
surgically corrected
spine defects, renal defects, surgically correctable.
Doing well until day 40, when he started to develop liver disfunction. Diagnostic workup was unrevealing.
On day 55, Rady would brought in. Race against the clock.
Whole Genome Sequencing time cut to just 26 hours.

1. consent at time 0:00
sample transport
Dna isolation 1 hour
18 hour genome sequencing, completed at time 24:30

40x genome. 120,000,00,000 bases
2,8M bases
5,1M variants
1.3M variants1% filter applied.
1,3k pathogenic or likely pathogenic.
2 variants that could cause the 341 conditions (below), both in the same gene: perforin 1.

– very typical, but has to be done fast. FPGA informatincs.

ACMG guidelines on how build cases.

Focus on pathogenic and likely pathogenic

Big issue: what are the issues that are related to the genotypd of intrerest. Used Phenomizer, etc, were able to narrow down to 341 conditions that may match the symptions.

1st variant was vary rare.
2nd variant was in 3%.

If second variant is in trans with another pathogenic, it’s likely pathogenic as well.
Provisional Diagnosis – FDA gave permission to give a verbal putative diagnosis under cases where a child’s life is in imininent danger.

Confirmatory testing was done, and the diagnosis was positive. Fortunately, there is a treatment, and the child is now thriving. Does still have the disorder,which may require a bone transplant, but the child’s life was saved.

Case 2: firstborn with transient hypoglycemia. Transient to Nicu.

At 1 month, the nurse practitioner noticed low blood suger – hyperinsulinemia.

Similar numbers to previous, whittled down to 160 conditions, with only 1 variant that matched that disease.

Known pathogenic mutation : ABCC8.

Recessive condition, inherited from father. There is also: Focal hyperinsulinism, which presents from father (uniparental disomy)

Second event was a de novo mutation in the child, which was shown to be only at the head of the pancreas – so were able to remove the damaged segment of the pancreas.

Pancreatectomy was scheduled.

TOtal time: 7 days from start to cure.

Avoided major morbidity – probably major neurologic damaged.

Does it scale?

35 cases: cohort. 57% rate percent diagnosis. In contrast, 9% are cured by standard tests.

These cases were cherry picked as having likely genetic diseases., but still demonstrates power.

2nd version:
80 cases cohort : 58% diagnosis ate.

Brand new info: Kansas city test. However, trial was discontinued because it was obvious that diagnosis was working: 15% rate in normal tests. 41% in clinical exome based tests.

Makes a significant impact in all aspects of care.

For every child tested, 2,9 quality years improvement. or, $3500 per quality years.

Hakon Hakonarson – Children’s Hospital of Pensylvania

Centre for Applied Genomics at CHOP. Collaborate with Penn.

Case from Lipid Cohort. Familial form of lipid disease. 1700 subjects, 900 families.

Case: 55 year old man – phenotype described (no fat, mild diabetes, lipoprotein panel appeared normal.) [Missing much of it – don’t know the terms]

Many features overlapped with adult progeria.

Initial genetic analysis: turned out to be homozygous for PLIN1 and heterozygous for WRN.
Balanced translocation t(8;10) as well.

Pedigree shown. Two brothers, both with much more gentle phenotypes, both had liver issues.

Goal became to map the breakpoint: were there any additional genes or elements contributing to the phenotype? The condition is far more advanced in proband.

Used Linked Read technology for translocation breakpoint mapping. (Quick review of barcoding for this technology) Gel beads in emulsion.

Fine mapping of region: near cyp26C1 and CYP26A1 and ADHFE1 on the other side.

No single gene jumped out.

Many hypotheses were considered – not clear what was going on. Next step is investigation into WRN. Cell lines used for WRN activity, protein expression and transcript assessment.

Asses changes in genes near breakpoint.

Not totally solved, but very interesting case.

15 practical tips for bioinformaticians.

This is entirely inspired by a blog post of a very similar name from Xianjun Dong on the r-bloggers.com site.  The R-specific focus didn’t do much for me, given that R as a language leaves me annoyed and frustrated, although I do understand why others use it.  I haven’t come across Xianjun’s work before, and have never met him either online or in person, but I hope he doesn’t mind me revisiting his list with a broader scope.  Thanks to Xianjun for creating the original list!

I’ve paraphrased his points in underline, written out my point response, and highlighted what I feel is the take away.  So, lets upgrade the list a bit, shall we?

1. Use a non-random seed.  Actually, that’s pretty good, but the real point should extend this to all areas of your work:  determinism is the key both to debugging and to science – you need to be able to recreate all of your work upon demand.  That’s the basis of how we do science.

2.  The original said set your own tmp directory” so that you don’t overlap toes with other applications.  Frankly, I’d skip that, and instead, suggest you learn how the other applications work!  If you’re running a piece of code, take the time to learn it – and by extension, all of it’s parameters. The biggest mistake I see from novice bioinformaticians is trying to use code they’re not familiar with, and doing something the author never intended.  Don’t just run other people’s tools, use them properly!

3. An R-specific file name hint.  This point was far too R-centric, so I’ll just point you back to another key point: Take the time to learn the biology.  Don’t get so caught up in the programming that you forget that underneath all of the code lies an actual biology or chemistry problem that you’re trying to study, simulate or interpret.  Most often, the best bioinformatics solutions are the ones that are inspired by the biology itself.

4. Create a Readme file for your work. This is actually just the tip of the iceberg – Readme files are the last resort for any serious software project. A reasonable software project should have a wiki or a manual, as well as a host of other documentation. (Bug trackers, feature trackers, unit tests, example data files.)  The list should grow with the size of the project.  If your project is going to last more than a couple of weeks, then a readme file needs to grow into something larger.  Documentation should be an integral part of your coding practice, however you do it.

5. Comment your code.  Yes – please do.  But, don’t just comment your code, write code that doesn’t need comments!  One of the reasons why I love python is because there is a pythonic way to do things, and minimal comments are necessary to make it obvious what its supposed to do.  Of course, anytime you think of a “clever” trick, that’s a prime candidate for extra documentation, and the more clever you are, the more documentation I expect.

6. Backup your code.  Yep – I’m going to agree with the original.  However, I do disagree with the execution.  Don’t just back up your code to an extra disk, get your code into version control.  The only person who doesn’t need version control is the person who never edits their code… and I haven’t met them yet.  If you expect your project to be successful, then expect it to mature over time – and in turn, that you’ll have multiple versions.   Trust me, version control doesn’t just back up, it makes code management and collaboration possible.  Three for the price of one…. or for free if you use github.

7. clean up your intermediate data.  Actually, I think keeping intermediate data around is a useful thing, while you’re working. Yes, biological data can create big files, and you should definitely clean up after yourself, but the more important lesson is to be aware of the resources that are available to you – of which disk space is just one.  Indeed, all of programming is a tradeoff between CPU, Memory and Disk, and they’re interchangeable, of course.  If you’re not aware of the Space-Time tradeoff, then you really haven’t started your journey as a bioinformatician.  Really – this is probably the most important lesson you can learn as a programmer.

8. .bam, not .sam. This point is a bit limited in scope, so lets widen it.  All of the data you’ll ever deal with is going to be in a less-than-optimal format for storage, and it’s on you to figure out what the right format it going to be.  Have VCFs?  Gzip them!  Have .sam files?  Make them .bam files!  Of course, this doesn’t just go for storage: Do the same for how you access them.  That gzipped VCF?  You should have bgzipped it and then tabix indexed it.  Same goes for your Fasta file (FAIDX?), or whatever else you have.  Don’t just use compression, use it to your advantage.

9. Parallelize your code.  Oh man, this is a can of worms.  On the one hand, much of bioinformatics is embarrassingly parallelizeable.  That’s the good news.  The bad news is that threaded/multiprocessed code is harder to debug and maintain.  This should be the last path you go down, after you’ve optimized the heck out of your code.  Don’t parallelize what you can optimize – but use parallelization to overcome resource limitations. And only when you can’t access the resources in any other way.  (If you work with a cluster, though, this may be a quick and dirty way to get more resources…)

10. clean up and back up.  This was just a repeat of earlier points, so lets talk about networking.  The best way to keep yourself current is to listen to what others have to say.  That means making time to go to conferences, reading papers, blogs or even twitter.  Talk to other bioinformaticians because they’ll always have new ideas, and it’s far too easy to get in to a routine where you’re not exposing yourself to whatever is new and exciting.

11. OOP: Inheritance, Encapsulation, Polymorphism. Actually, on this point, I completely agree.  Understanding object oriented programming takes you from being able to write scripts to being able to write a program.  A subtle distinction, but it will broaden your horizons in so many ways, of which the most important is clearly code re-use.  And reusing your existing code means you start developing a toolkit instead of making everything a one off.

12. Save the URL of your references. Again, great start, but don’t just save the URL of your references.  Make notes on everything. Whatever you find useful or inspiring, make a note in your lab book.  Wait, you think bioinformaticians don’t have lab books?  If that’s true, it’s only because you’ve moved on to something else that keeps a permanent record, like version control for your code, or electronic notebooks for your commands.  Make sure everything you do is documented.

13. Keep Learning.  YES!  This!  If you find yourself treading water as a bioinformatician, you’re probably not far from sinking.  Neither programming or biology ever really stand still – there’s always something new that you should get to know.  Keeping up with both fields is tough, but absolutely necessary.

14. Give back what you learn.  Again, gotta agree here.  There are lots of ways to engage the community: share your code, share your experience, share your opinions, share your love of science… but get out and share it somehow.

15. Stand up on occasion.  Ok, I’ll go with this too.  The sitting/standing desks are fantastic, and definitely worth the money, if you can get one.  Bioinformaticians spend way too much time sitting, and you shouldn’t neglect your health.  Or your family, actually.  Don’t forget to work hard, and play hard.

A stab at the future of bioinformatics

I had a conversation the other day about where bioinformatics is headed, and it left me thinking about it for the past few days.  Generally, the question was more about whether bioinformatics (and biotechs) are at the start of something big, or whether this is all a fad.  Unfortunately, I can’t tell the future, but that doesn’t mean I shouldn’t take a guess wild stab in the dark.

Some things are clear because some things never change.  Unless armageddon is upon us or aliens land, we can be sure that sequencing will continue to get cheaper until it hits bottom – by which I mean about the same cost as any other medical test. (At which point, profit margins go up while sequencing costs go down, of course!)  But, that means that for the foreseeable future, we should expect the volume of human sequencing data to continue to rise.

That, naturally, translates pretty directly to an increase in the amount of data that needs to be processed.  Bioinformatics, unlike many other fields, is all about automation and discovery – and in this case, automation is really the big deal.  (I’ll get back to discovery later.)  Pipelines that take care of the human data are going to be more and more valuable, particularly when they add value to the automation and interpretation.  (Obviously, I should disclose that I work for a company that does this.)  I can’t say that I see this need going away any time soon.  However, doing it well requires significant investment and (I’d like to think) skill.  (As an aside, sorry for all of the asides.)

Clearly, though, automation will probably be a big employer of bioinformaticians going forward.  A great pipeline is one that is entirely invisible to the people using it, and keeping a pipeline for the automation of bioinformatics data current isn’t an easy task.  Anyone who has ever said “Great! We’re done building this pipeline!” isn’t on the cutting edge.  Or even on the leading edge.  Or any edge at all.  If you finish a pipeline, it’ll be obsolete before you can commit it to your git repository.

But, the state of the art in any field, bioinformatics included, is all about discovery.  For the most part, I suspect that it means big data.  Sometimes big databases, but definitely big data sets.  (Are you old enough to remember when big data in bioinformatics came in a fasta file, and people thought perl was going to take over the world?)  There are seven billion people on earth, and they all have genomes to be sequenced.  We have so much to discover that every bioinformatician on the planet could work on that full time, and we could keep going for years.

So yes, I’m pretty bullish on the prospects of bioinformaticians in the future.  As long as we perceive knowledge about ourselves is useful, and as long as our own health preoccupies us – for insurance purposes or diagnostics – there will be bioinformatics jobs out there.  (Whether there are too many bioinformaticians is a different story for another post.)  Discovery and re-discovery will really come sharply into focus for the next few decades.

We can figure out some of the more obvious points:

  • Cancer will be a huge driver of sequencing because it changes over time, and so we’ll constantly be driven to sequence again and again looking for markers or subpopulations. It’s a genetic disease and sequencing will give us a window into what it’s doing where nothing else can.  Like physicists and the hunt for subatomic particles, bioinformaticians are going to spend the next hundred years analyzing cancer data sets over and over and over.  There are 3 billion bases in the human genome, and probably as many unique variantions that make a cell oncogenic. (Big time discovery)
  • Rare disease diagnostics should become commonplace.  Can you imagine catching every single childhood disease within two weeks of the birth of a child?  How much suffering would that prevent?   Bioinformaticians will be at the core of that, automating systems to take genetic councillors out of the picture. (discovery turning to automation)
  • Single cell sequencing will eventually become a thing…. and then we’ll have to spend the next decade figuring out how the heck we should interpret it.  That’ll be a whole new field of tools. (discovery!)
  • Integration with medical records will probably happen.  Currently, it’s far from ideal, but mostly because (as far as I can tell) electronic medical records are built for doctors. Bioinformaticians will have to step in and have an impact.  Not that we haven’t seen great strides, but I have yet to hear of an EMR system that handles whole genome sequencing.  (automation.)
  • LIMS.  ugh. It’ll happen and drain the lives from countless bioinformaticians.  No further comment necessary. (automation)

At some point, however, it’s going to become glaringly obvious that the bioinformatics component is the most expensive part of all of the above processes.  Each will drive massive cost savings in healthcare and efficiency, but the actual process of building the tools doesn’t scale the same way as the data generation.

Where does that leave us?  I’d like to think that it’s a bright future for those who are in the field.  Interesting times ahead.


This is a strange way to begin, but moving to California has reminded me of an interest in an Algorithm that I’ve always found fascinating: Ant Walks.

I hadn’t expected to return to that particular algorithm, but it turns out there’s a reason why people become fascinated with it. Because it’s somewhat an attempt to describe the behaviour of ants… which California has given me an opportunity to study first hand.

I’m moving in a week or two, but I have to admit, I have a love/hate relationship with the Ant colony in the back yard. I won’t really miss them, because they’re seriously everywhere. Although I’ve learned how to keep them out of the house, and they dont’ really bother me much, they’re persistent and highly effective at finding food. Especially crumbs left on the kitchen floor. (By the way, strategic placement of ant repellent, and the ants actually have a pretty hard time finding their way in… but that’s another post for another day.)

Regardless, the few times that the ants have found their way inside have inspired me to watch them and learn a bit about how they do what they do – and it’s remarkably similar to the algorithm based off of their behaviour. First, they take advantage of sheer numbers. They don’t really care about any one individual, and thus they just send each ant out to wander around. Basically, it’s just a divide and conquer, with zero planning. The more ants they send out, the more likely they are to find something. If you had only two or three ants, it would be futile… but 50-100 ants all wandering in a room with a small number of crumbs will result in the crumbs all being found.

And then there’s the whole thing about the trails. Watching them run back and forth along the trails really shows you that the ants do know exactly where they’re going, when they have somewhere to be. When they get to the end, they seem to go back into the “seeking” mode, so you can concentrate the search for relevance to a smaller area, for a more directed random search.

All and all, it’s fascinating. Unfortunately, unlike Richard Feynman, I haven’t had the time to set up Ant Ferries as a method of discouraging the ants from returning – my daughter and wife are patient, but not THAT patient – but that doesn’t mean I haven’t had a chance to observe them.  I have to admit, of all the things that I thought would entertain me in  California, I didn’t expect that Ants would be on that list.

Anyone interested in doing some topology experiments? (-;

(A bit over) A year in Oakland.

Naturally, I say “I’m going to blog more”, and then I get sick for a week and a half, and nothing gets written. Never fails! I should have said, “I’m never going to blog again”, at which point, I don’t doubt the health-fairy would come and make me all better.

But I never seem to do things that way.

I was thinking about blogging about work a bit more, since I’ve been given some leeway to do so, but I kinda feel like there’s a bit of low hanging fruit I wanted to tackle first…

Guess what – I’ve been living in California for 14 months. And you know what? It’s been fascinating. I’ve been amused, frustrated, annoyed and thrilled at the experience, and I think I should share some of it with you. As one zookeeper said to the other, Do you want the good gnus or the bad gnus?

Ok, lets start with the down side. Oakland – and much of what I’ve seen of the bay area – is far less clean than Canada. I’d heard Americans come north and say that Canada is clean, but I have to say that the overwhelming impression when you come south is the opposite. For bonus points, I’ve been living near an overpass, where people love to dump their garbage. It’s not pretty, and there’s a level of grit that’s just always there, presumably courtesy of the vast volume of traffic from the highway behind our house, and the busy street in front of it. Worse, though, I’ve seen people toss stuff out of moving cars, open their doors at street lights to casually let garbage fall out, and sometimes even just walking along, drop whatever they’re carrying if they don’t want it anymore. It has been very hard to teach my daughter how to be responsible when we’re constantly seeing examples of what not to do. Fortunately, my daughter has figured it out – and she likes to tell people that littering is wrong. I support her campaign entirely!

There’s also the unexpected social constructs of the bay area – It’s hard not to notice the racial divides that are present here. I could be wrong, given that I don’t spend a lot of time exploring that aspect of life here, but there seems to be a socioeconomic divide that falls along racial lines. Oddly enough, I don’t recall that happening in Canada to the same extent. It’s there, but not nearly as close to the surface as I seem to find it here.

Finally, I have to admit I’ve had a LOT of dealings with the IRS and the CRA, over the past year. For those of you who aren’t yet 18, or have only lived on one side of “the border”, those are the U.S and Canadian tax agencies, respectively. Overwhelmingly, I have to say that the attitudes of the people at the two agencies are night and day. After dealing with the IRS, I actually look forward to dealing with the Canadian Revenue Agency. Where the IRS gives off an air of “we’re too big to care about you” in pretty much all of it’s interactions, the CRA seems friendly and almost like they’re really there to help you – even when they’re trying to extract more money out of you than you’ve ever owned. Bizarre, that. At any rate, I’ve amassed a significant number of stories, if anyone ever wants to hear them.

In contrast to the above, I also have to admit, there are some amazing things about living in the bay, which make me really glad I’m here.

First, watching Oakland transform is pretty damn cool. Despite the garbage and selfish attitude of a small minority of the residents, Oakland is transforming. You can see the city is repaving streets to create bike paths, new buildings are going up everywhere, and houses everywhere are starting to get a little more care. It’s probably mostly “gentrification”, as the rich from San Francisco realize that this side of the bay is actually convenient for living and working, but it’s not a dirty word. It may be displacing some people, but the influx of families and artists and all of that is kinda like watching a flower bloom in slow motion. The neighbourhood I’ve lived in for the past 14 months has seen a creeping increase in the number of children in the area… and come spring, I have no doubt my daughter would be out there making friends at the park again.

Well, she would be, if we weren’t moving. We’ve found an apartment that will fit us better – and upon doctors orders, we will be further from the aforementioned traffic, which has been an issue for us. But, after a year, you start to find all these interesting pocket neighbourhoods, which you’d never find if you don’t make your way off the beaten path. I grew up some where where a “hill” was a couple of meters tall – and there weren’t many of them, so this is fascinating. The bay area has hills and valleys and microclimates and beaches and wineries… (don’t mind the meandering topic but)… oh my god, if you go a bit out of town, there are parks that blow your mind. On our first real family outing, we ended up at Limantour beach on a semi-foggy day, and were entertained by several pods of whales and dolphins parading up and down the beach, close enough that I probably could have hit one with a rock, if I’d tried. (And I have a lousy arm for tossing rocks…) When California decides to put on a show, it’s mind blowing. Sonoma in the fall was incredible, where my daughter and I played for an hour in the falling leaves, and Napa had us ooo-ing and aah-ing over incredible produce in the market in the summer.

And then there’s the people. Yes, there are homeless people, and aggressive panhandlers – especially around the Berkeley BART station! – but the overwhelming majority of Americans have absolutely no problem suddenly breaking out into a conversation at the drop of a hat. Random people will cheerfully begin chatting with you, when you least expect it. It’s the opposite of living in a Canadian Suburban Centre. At times, it’s surprising, but it’s always interesting and it makes you feel just a little more connected into a community that has as many people as any random 3-4 Canadian provinces combined. While I can’t say I’ve made a lot of friends outside of work, I can say I’ve met a lot of interesting people in my neighbourhood.

I haven’t decided which of the above stories, or even the many unmentioned ones, I want to tell on my blog yet, but I’m starting to think that a bit of Oakland is going to spill over into my writing, along with a bit of bioinformatics. To be be entirely candid, sometimes it’s hard to tell which one is stealing the show.