(Planned) We can also use this inferred set of novel splice junctions to feed the pseudo-alignment quantifiers such as kallisto for a more accurate quantification.
Problem:
We cannot obtain any meaningful statistics from raw count data - they are affected by non-biological artifacts such as the length of transcripts (longer transcripts will have more reads than shorter transcripts, even if their abundances are equal, since we cut all reads to a certain length), and the total number of reads in a sample.
A solution:
Note that there are also count-based methodologies (based Negative Binomial count, or NB2 models, for example), but I do not cover them here.
Also note that there is still a subtle problem in between-sample normalization: for example, if a single gene is very highly expressed in a treatment sample, not only will the proportion be increased for that gene, the proportion will be decreased for all other genes, perhaps even significantly. Although I think this is not much of a problem for the whole expression profile, it may play an important role when we build models for transcript-specific abundances later - I'll cover this in more detail in the modeling sections below.
Possible paper topic: How to deal with the fact that in principle we cannot compare RPKM and TPM values directly across samples? (Maybe select a set of random or pre-selected genes across samples that are assumed to be stable across samples and conditions to normalize upon, such as 'housekeeping genes'?)
Problem:
Many reads are ambiguous - due to sequencing errors, rare variants in the sample, sequence similarity, etc.
A solution:
Many aligners have basic error correction schemes, such as tolerance level for the number of nucleotide errors when performing alignment. However, it is a well-known and nontrivial problem that there are biases in sequencing errors that can also be platform-specific, and that rare variants in the sample affect the alignment and quantification process. I will not go into the details of these research topics at the moment.
As for multi-mapping reads (reads that map to multiple locations in the genome due to sequence similarity), it is a bit embarrassing to admit that we don't have a good answer to this, especially in the case of trans-eQTLs. Both RSEM and pseudo-aligners such as kallisto have implemented an EM algorithm on the transcript abundances to account for multi-mapping reads probabilistically. However, although we haven't covered this concept yet, cis- and trans- eQTLs may require different considerations. Since we are doing a whole-genome scan for significant eQTLs in the case of trans-eQTLs, reads that map to genes in different chromosomes may induce false strong associations with trans- loci in the trans-eQTL scenario. Right now, we basically throw out multi-mapping reads in the case of trans-eQTLs, which indeed introduces a negative bias towards genes with sequence similarity, although for cis-eQTLs we don't need to be as careful.
Possible paper topic: How to deal with these multi-mapping reads in a more informed fashion? In principle, if a region in gene A and gene B are multi-mapping, and gene A has higher expression overall, we will see a deviation in expression levels in the overlapping region (lower for gene A, higher for gene B). We can think about a good implementation of this.
Problem:
These alignment/quantification steps are very slow and memory-intensive.
A solution:
Refer below to pseudo-alignment methods.
- An example comparison of Bowtie2, STAR, and many more:
It seems like for STAR is currently the best performing for splice-aware and SNP-tolerant alignment scenario.
Alternatively, we can use:
3. Quantify using pseudo-alignment (Kallisto, Sailfish, Salmon, etc.)
Kallisto
Kallitso uses a pseudo-alignment based on transcript-based colored De Bruijn Graph (T-DBG) to assign reads to equivalence classes, and then performs EM algorithm to optimize the following likelihood function: