>Point was to implement a C-algorithm in Matlab. (Pique-Regi et al, 2008). Uses sparse Bayesian Learning (SPL) and Backward Elimination. (Used microarray data for this experiment.)
Identifying gains, loss or neutral. (in this case, they looked at specific genes, rather than regions.) [Probably because they were using array data, not 2nd gen sequencing.]
Novelty of algorithm: piece-wise constant (pwc) representation of breakpoints.
Assume normal distribution of weights, forumale as a posteriori estimate, and apply SBL. Hierarchical prior of the weights and hyperparameters….
[some stats in here] Last step is to optimize using (expectation maximization) EM algorithm.
Done in matlab “because you can do fancy tricks with the code”, easily readable. It’s fast, and diagonals from matrices can be calculated quickly and easily.
Seems to take 30 seconds per chromosome.
Have to filter out noise, which may indicate false breakpoints. So, backwards elimination algorithm – measures significance of each copy number variation found, and removes insignificant points. [AH! This algorithm is very similar to sub-peak optimization in FindPeaks… Basically you drop out the points until you find and remove all points below threshold.]
It’s slower, but more readable than C.
Use CNAHMMer by Sohrab Shah (2006). HMM with Gaussian mixture model to assign CNA type (L,G,N). On the same data set, results were not comparable.
SBL not much faster than CNAHMMer. (Did not always follow vectorized code, however, so some improvements are possible.)
Now planning to move this to Next-Gen sequencing.
Heh.. they were working from template code with Spanish comments! Yikes!
[My comments: this is pretty cool! What else do I need to say. Spanish comments sound evil, though… geez. Ok, so I should say that all their slagging on C probably isn’t that warranted…. but hey, to each their own. ]