Why does AltAnalyze need to filter my data?
Answer: Prior to alternative exon analysis, AltAnalyze can be used to remove probe sets that are not deemed as sufficiently expressed. This filtering is important since including non-expressed probesets can result in false positive alternative exon detection. This occurs when the expression of a non-expressed probeset is normalized to the expression of the gene (e.g., constitutive probesets). Thus, if a gene is transcriptionally regulated, it can result in an artificial change in probeset expression, after normalization. For this reason, by default, AltAnalyze removes probesets that are not expressed above a non-log expression value of 70 and have a DABG p-value of less than 0.05. These filters can be adjusted by the user, for example, if a user does not want to filter their data, they can set these thresholds both equal to 1 (expression values below 1 do not exist for RMA).
Probeset Filtering Method for Affymetrix Data
For the two conditions that AltAnalyze compares (e.g., cancer versus normal), a probe set will be removed if neither condition has a mean detection above background (DABG) p value less than the user threshold (e.g., 0.05). Likewise, if neither condition has a mean probe set intensity greater than the user threshold (e.g., 70), then the probe set will be excluded from the analysis. When comparing two conditions (pairwise comparison) for probe sets used to determine gene transcription (e.g., constitutive aligning), both conditions will be required to meet these expression thresholds in order to ensure that the genes are expressed in both conditions and thus reliable for detecting alternative exons as opposed to changes in transcription. When comparing all biological groups in the user dataset, however, these additional filters are not used.
Probeset Filtering Method for RNASeq Data
For the two conditions that AltAnalyze compares (e.g., cancer versus normal), an exon or junction feature will be removed if neither condition has a mean read count greater than the user threshold (e.g., 2 or 10 respectively). Since exons tend to have higher expression than junctions due to their longer lengths, the exon thresholds are typically recommended to be larger. These features can also be filtered based on RPKM. For non-RPKM analyses (quantile normalized or un-adjusted read-count analyses), count data is filtered as discussed above for Affymetrix probe set expression values. For RPKM analyses, gene-level RPKM values are calculated/filtered prior to alternative exon analyses using the same strategy outlined for exon-level data above. Only exon-level results are used to calculate gene expression values and statistics when present (not junctions).