Unsupervised Single-Cell Expression Profile Identification - ICGS

Introduction

A significant challenge in the analysis of single-cell RNA-Seq gene expression profiles is the unbiased identification of the most coherent, correlated gene signatures that are able to segregate distinct developmental states or cell-types without prior knowledge. As a means to do this, we development an algorithm that ultimately identifies those genes and transcription factors that correspond to the predominant expression signatures present in a sample, while excluding signatures that are likely stochastic or isolated to a single sample (Figure 1).

Figure 1

Single cell expression clustering via driver gene analysis: Parameters, PCA stored derived gene-set, positive, top correlated genes (rho>0.4) with driver identification and BioMarker enrichment analysis.


To provide streamlined and automated analysis of single-cell datasets (without prior bioinformatics experiments), the Iterative Clustering and Guide-gene Selection (ICGS) algorithm was implemented as apart of the conventional AltAnalyze workflow in version 2.0.9. Associated source-code is provided with AltAnalyze (RNASeq module, singleCellRNASeqWorkflow function) (Figure 2). In addition to the algorithm details below, additional information is available on the AltAnalyze website and software documentation.

Figure 2

Single cell expression clustering via driver gene analysis: Parameters, PCA stored derived gene-set, positive, top correlated genes (rho>0.4) with driver identification and BioMarker enrichment analysis.


Software Algorithm

ICGS includes multiple options for filtering the expression data based on normalized gene values (e.g., TPM, FPKM), fold change, correlation thresholds for identifying the most coherent gene sets, clustering algorithms (e.g., HOPACH) and optional supervised clustering options using custom or established gene-sets, pathways or Ontology terms. Gene clusters containing multiple cell-cycle-annotated genes can be optionally excluded from the software graphical user interface. Results for each stage of clustering are stored and presented back to the user as a heatmap and HOPACH cluster sample. For the five separate heatmap results generated at each stage in the analysis, the user can select one of these for further analysis as pairwise group comparison between all possible HOPACH clusters using the default AltAnalyze gene and alternative splicing analysis workflows. While the ICGS analysis is being run, the software performs the following steps:

Step 1 - Pioneering Round Analysis

In this first round of analysis, all input gene expression values are analyzed and subsequently filtered to exclude low expression, low-variance transcript and weakly-correlated genes. For single-cell libraries an already normalized gene expression matrix (RPKM, FPKM or TPM) or aligned read counts files can be used as input for this workflow. For aligned read counts, these can be provided as TopHat-produced junction.bed, BedTools exon.bed or TCGA junction expression files, among others. Using the graphical user interface, ICGS can be parameterized to restrict the analysis to protein-coding genes, exclude genes with a maximum expression value (RPKM, FPKM or TPM) and a minimum expression difference (fold change) between the lowest and highest expressing libraries for each gene (number of samples differing must also be indicated by the user). A correlation threshold must also be provided to indicate the minimum relative similarity required to report correlated genes for downstream analyses. Based on the supplied correlation cutoff, ICGS retains genes (initial guide genes) correlated to at least 5 others (protocluster). A secondary filter is to exclude protoclusters in which a single outlier cell for the initial guide gene is driving the protocuster correlations. To this end, at least 50% of the protocluster genes must remain correlated to the initial guide gene after temporary exclusion of the most highly expressed cell for the initial guide gene. Using the HOPACH clustering algorithm, through an optional connection to R and bioconductor, “Pioneering Round” clustering results are obtained for all remaining filtered protocluster genes (Figure 1, left).

Step 2 – Reduce Cluster Complexity

As many gene clusters are produced following the initial round of HOPACH, each with different sizes, the software performs a pruning step to produce clusters of more similar number of genes (3 to 50 genes). This is accomplished by first identifying the genes in each cluster that are correlated to the most members in that cluster (greater than the input Pearson correlation cutoff). This step further improves cluster definitions by removing genes and clusters with lower overall intra-correlation. What is left are a smaller set of genes, highly correlated to each other and largely distinct from other HOPACH clusters. These genes are again clustered with HOPACH to produce more coherent signatures.

Step 3 –Selection of Guide Genes Round 1

Following HOPACH clustering, the algorithm selects the top intra-correlated gene(s) for each gene cluster (correlated to the most members in that cluster with the provided Pearson correlation cutoff), and any gene encoding transcription factors present in that cluster (UniProt or Gene Ontology defined) as drivers for the following round of clustering. At this stage, gene clusters and drivers with 2 or more Gene-Ontology-defined cell-cycle genes will be optionally excluded. All genes from the initially filtered expression dataset are correlated to these guide genes (positive correlations only) using the user provided cutoff (maximum of 140 correlated genes per guide gene), to obtain guide correlated gene sets that are clustered again with HOPACH.

Step 4 – Selection of Guide Genes Round 2

Following HOPACH of the guide-correlated genes, additional genes will be brought into the analysis that contain new potential guide genes. To discover these, the resulting HOPACH clustered genes from Step 3 are subjected to the protocol in Step 3 to select new distinct guide genes. As before, HOPACH clustering is run and the results stored.

Step 5 – Final Clustering Using All Guide Genes

Combining the guide genes identified in Steps 3 and 4, Step 3 is repeated once again and a final clustered heatmap is produced. These clustered results are presented with all other in the AltAnalyze graphical user interface. The user can choose to quit the analysis and work with the produced clustered sets or select specific heatmap and cell-clustering results for downstream comparison analyses in AltAnalyze.

Downstream Comparison Analyses in AltAnalyze (Optional)

In the graphical user interface, the user is presented with the option of selecting any of the resulting cluster outputs from ICGS for further AltAnalyze-based comparison analyses between the defined HOPACH cell cluster groups. These analyses include all pairwise group comparisons using the selected statistical test (e.g., moderated t-tests, Benjamini-Hochberg adjustment), pathway analysis in GO-Elite, PCA, MarkerFinder, pathway visualization and splicing analyses, when exon or junction data is available.

In addition to these AltAnalyze workflows, further analysis of the ICGS clusters can be performed using the built-in hierarchical clustering tool using the advanced options available. These advanced options include the ability to interactively display enriched pathways, ontologies or other gene sets to the left of the heatmap, run Monocle R library, obtain all correlated genes to a manually supplied guide gene or available gene sets. For the current analysis, genes associated with Neutropenia or MDS/AML were input as manual guide genes to obtain correlated gene sets using a supplied Pearson correlation cutoff of 0.4. Step-by-step tutorials and videos illustrating this workflow are available from the AltAnalyze website.