What I am most curious about is how people are planning to do the statistics of gene expression comparison from the EST sequencing library approach. It made sense to me for the SAGE approach, but how do you get the overall expression for the gene (really you want the per-transcript numbers). Do you assemble and count the union of all tags across a transcript? Do you normalize that by length of the transcript? Do you only count 3′ biased tags?
Though I’ve been taking my time about answering, its a really good question. I’ve been working with transcriptomes for a while, now, and have a fair amount of experience with it. I don’t want to give away all of my secrets, but I can give a few pointers. If anyone wants to collaborate on something, you know where to find me. (-;
So, first things first, with transcriptome sequencing using Illumina based sequencing, each read you get is presumably from a single molecule of DNA, which presumably came from a single molecule of cDNA, from your library. I can’t speak for all of the protocols used by the labs here at the Genome Science Centre, but the results I’ve seen have shown a very close correlation with expression levels measured by Nimblegen/Affymetrix DNA arrays, and so I tend to believe that the number of tags we’re observing per region(eg gene/transcript/exon) are a direct (or nearly direct) measurement of the RNA levels in the general cell population used to create the library.
I should also mention that this is very dependent upon the protocols being used. If your protocol involves amplifying the cDNA with the use of PCR, you’re really not going to maintain that relationship. Consult an expert on this subject, if you plan to try this at home. (-;
The other questions Jason asked are not quite as straight forward. We have a protocol here at the GSC that gives pretty darn even coverage across transcripts as a whole, which means that transcript end bias is pretty minimal. That totally negates the need to look at biases, or otherwise. Of course, this comes down to a lot of lab technique (which is totally outside the scope of my post), as it seems to be dependent on following the appropriate protocols. I’ve seen libraries which are completely skewed, libraries that perfectly follow the transcript outlines, and libraries somewhere in between. As it stands, I now run my tools over each data set as it appears to judge the quality before I ask for more lanes.
So, the short answer is: no, I don’t normalize or exclude any data when I deal with transcriptomes, but I’m in the fortunately position of being able to identify (and accept or reject!) which data sets meet a reasonable level of quality before I process them.