Kallisto-Splice: Ultrafast pseudo alignment for gene and splicing analysis in AltAnalyze

The latest version of AltAnalyze introduces a new method to quickly process raw sequencing data (FASTQ files) to directly produce gene expression and alternative splicing estimates, without any additional software on your desktop or laptop computer. This innovative, integrated and fast approach for the comprehensive analysis of raw RNA-Seq data is called Kallisto-splice. Kallisto-splice builds upon the program kallisto for ultra-fast pseudoalignment and isoform quantification from RNA-Seq FASTQ files. Kallisto is integrated within AltAnalyze to automate transcriptome analyses. Kallisto-splice builds upon kallisto by producing direct splicing estimates (exon-exon junction and exon-intron junction) from FASTQ files. A typical FASTQ file can be processed by Kallisto-splice in 1-30 minutes on a desktop or laptop computer with at least 8GB of RAM (16GB recommended). The full workflow runs in 15 minutes to 4 hours, dependent on the number of samples, dataset complexity and hardware.

Inputs to Kallisto-splice

  1. FASTQ files (single-end, paired-end, gz or uncompressed)
  2. AltAnalyze compatible species database

Direct Outputs of Kallisto-splice

  1. Sorted BAM files - Genome-position RNA-Seq read-pseudoaligned
  2. Isoform-level transcript-per million (TPMs) expression and read-counts
  3. Gene-level TPMs and read-counts
  4. Junction bed files for automated splicing analysis
  5. Summary read-count file

Outputs of Kallisto-splice Workflow in AltAnalyze

  1. Differential gene expression results with annotations
  2. ICGS results
  3. Alternative splicing results and annotations
  4. Pathway and network analysis
  5. Quality control, cell-type prediction, dimensionality reduction and heatmaps

Running Kallisto-splice

Demo Files Two zip files with very small FASTQ files for demonstration purposes are available here: http://altanalyze.org/Data/Hs_GSE45419_FASTQs.zip (344MB - Human downsampled) http://altanalyze.org/Data/Mm-FASTQ-GSE70245.zip (59MB - Mouse scRNA-Seq)

The Breast Cancer dataset was downsampled from the original fast files as described here: https://www.synapse.org/#!Synapse:syn7286377/files/ The human breast cancer samples correspond to two subtypes (ER-positive and Triple-negative). Unzip these file before proceeding (right click and extract of open and extract to this directory - e.g., WinZip).

Graphical User-Interface 1. Install AltAnalyze (http://altanalyze.org or pip install altanalyze) 2. Double-click on the AltAnalyze executable or type altanalyze on the command-line (PyPi installed) (see Running-AltAnalyze file in the program directory for problems opening). 3. Download the species database when prompted. 4. For the demo dataset, ensure Homo Sapiens is downloaded. Any version of the database should be compatible (e.g., EnsMart72). 5. From the main menu in AltAnalyze, select RNA-Seq as the “Select vendor/data type”. Then select the Continue button. 6. Select the “Process RNA-Seq reads” radio button. Then select the Continue button. 7. Dataset Location: Enter a dataset name of choice (e.g., Breast_cancer). For the “Select FASTQ files to run in Kallisto” , select the location of the unzipped FASTQ directory. The program will process all FASTQ files in that directory. Select the output directory, which is the folder to save all results and subsequent input files to. Then select the Continue button. 8. Expression Analysis Parameters: Choose the additional options you want to include or exclude for the pipeline analyses (optional). These include pathway analyses options, which statistical comparison tests to apply for differential expression analysis. If users wish they can select “no” for the option “Perform alternative analysis, which will skip the splicing analyses and process the Kallisto TPM expression file instead of the produced exon-exon junction derived gene RPKM file. When complete, select the Continue button. 9. Pathway Analysis Options: Here, the user will be prompted to specify the statistical cutoff applied for differential gene and splicing analyses. The adjp indicates an FDR corrected p-value versus a non-corrected p-value. 10. Alternative Splicing Analysis Options: The default recommended method for splicing analysis is selected (MultiPath-PSI), however, alternative and additional algorithm options are available. When finished, select the “Run Analysis” option. 11. Groups Designation: Type a label for each FASTQ sample shown (e.g., ERpos, TripleNeg). This will create a “groups.” text file in the output directory folder ExpressionInput. This file will be reloaded when. 12. Comparisons Designation: Select the experimental and control datasets to compare to (e.g., TripleNeg vs. ERpos). Select “Continue” to run the analysis. 13. Analysis Progress: A black screen will appear once the analysis has begun. Be patient as the software is performing a series of in-depth analyses, including indexing of the Kallisto transcriptome (run the first time FASTQ files are processed), Kallisto pseudo-alignment to the reference transcriptome, BAM file generation with genome coordinates for all pseudo-aligned reads, gene expression quantification, differential gene expression analysis, QC analysis, network analysis, marker identification, pathway analysis and alternative splicing analysis.

Command-Line 1. Install AltAnalyze (http://altanalyze.org or pip install altanalyze) 2. Download the species: altanalyze --species Mm --update Official --version EnsMart72 3. Run the analysis: altanalyze --platform "RNASeq" --species Mm —fastq_dir /Users/altanalyze/DemoData/Mm-FASTQ-GSE70245-DownSampled/ --groupdir /Users/altanalyze/DemoData/Mm-FASTQ-GSE70245-DownSampled/groups.Breast_cancer.txt --compdir /Users/altanalyze/DemoData/Mm-FASTQ-GSE70245-DownSampled/comp.Breast_cancer.txt --output /Users/altanalyze/DemoData/Mm-FASTQ-GSE70245-DownSampled/output --expname Breast_cancer --runGOElite yes --returnPathways all

Outputs of Kallisto-Splice

There are a large array of results from this workflow which can be found in the below described folders. Note, a separate PDF file is saved to the root directory describing the files in each of these folders. Please refer to those PDFs for details. 1. ExpressionInput: This includes all expression estimates for exon-exon junctions, kallisto isofroms and genes as normalized values (TPM and RPKM) and counts. All Kallisto results are saved to the Kallisto_results folder along with the number of percentage of aligned reads. 2. ExpressionOutput: This folder contains all computed differential gene expression results, primarily found in the DATASET file. The MarkerFinder folder contains the top markers assigned to each sample group (AllGenes_correlations-ReplicateBased.txt file). 3. DataPlots: This folder contains the majority of saved plots as pdf and png files. Note, that the MarkerFinder folder in this directory contains additional plots. Splicing associated plots will also be saved to the folder AltResults/AlternativeOutput. 5. AltResults: This directory contains all splicing analysis results. This most important file is “Hs_RNASeq_top_alt_junctions-PSI_EventAnnotation.txt”, which contains all MultiPath-PSI detected splicing events and associated annotations. Statistical comparison results are saved to the “Events-dPSI” folder and splicing graphs to the SashimiPlots folder in the output directory (derived from the BAM files in the output directory. 3. SashimiPlots: This folder contains the PDF and PNG outputs for genome and exon-exon junction aligned reads associated with example top-significant alternative splicing events. Users can output addition such plots from the Additional Analyses menu option “AltExon Viewer”. 5. GO-Elite: This folder contains all pathway and gene-set enrichment analysis results. See each folder for the “pruned-results_z-score_elite.txt” (open in Excel or equivalent). Network graphs, heatmaps comparing different comparison groups and optional colored WikiPathways are saved to these directories.