1 Introduction
1.1 Overview
1.1.1 Single-cell/Single nucleus RNA seq technologies
Single-cell/ single nucleus RNA technologies (sc/snRNAseq) allow a detailed and comprehensive analysis of cellular processes in many sub-specialties of life sciences (Alon et al. 2021; Birnbaum 2018; Davie et al. 2018; Soysa et al. 2019; Ding et al. 2020; Ji et al. 2019; Luecken and Theis 2019; Macosko et al. 2015; Majumdar et al. 2021; Mathys et al. 2019; Papalexi and Satija 2018; Polioudakis et al. 2019; Potter 2018; Ravenscroft et al. 2020; Shaw, Tian, and Xu 2021; Suva and Tirosh 2019; Tang et al. 2009; Wen and Tang 2016; Y. Zhang et al. 2018). Sc/snRNAseq involves the generation of single-cell or single-nucleus suspensions from tissues or cultured cells and the capture of RNA molecules from individual cells into small compartments. Within these compartments, the cells are lysed, and the RNA molecules are used to generate cDNA with a DNA barcode. The barcode within each compartment is unique and helps in the identification of all cDNA generated from a single compartment (Macosko et al. 2015; Nayak and Hasija 2021; Saliba et al. 2014; Tang et al. 2009; Zheng et al. 2017). This general theme has given rise to a variety of different technologies differing from each other on the method used to create compartments and the target RNA molecules that are captured. Droplet-based techniques such as drop-seq involve the generation of compartments using microfluidics technology with gel beads delivering barcoded oligos and reverse transcription reagents. This technology has been adapted by 10X Genomics to generate the proprietary Chromium controller and capture the 3’ ends of RNA molecules with an oligo-dT mediated reverse transcription (X. Zhang et al. 2019). SMART-seq (and the subsequent SMART-seq2) technology can use single-cell capture through fluorescence-activated cell sorting (FACS) or C1-Fluidigm platforms to capture cells in microwell plates and permits the detection of full-length transcripts (Picelli et al. 2014; Tang et al. 2009). Massively parallel single-cell RNA-sequencing (MARS-seq) and Cell Expression by Linear amplification and Sequencing (CELseq) technologies also use FACS to capture individual cells in microwell plates (384-well plates) and provide cost-effective, customizable platforms for detecting RNA molecules (Keren-Shaul et al. 2019). The Seq-well technology employs a portable microwell platform with gel-bead mediated delivery of barcoded oligos to capture single-cells and polyA-tailed transcripts (Gierahn et al. 2017). Single-nucleus sequencing technologies allow for the detection of transcripts from the nucleus of individual cells and are especially useful when the initial sample used is formalin-fixed or paraffin-embedded tissue (Bakken et al. 2018; Habib et al. 2017, 2016; Zheng et al. 2017). Single-cell combinatorial indexing (sci-) sequencing technology works with in-situ delivery of barcode labels to fixed cells or isolated nuclei and samples are isolated for second-strand synthesis either using FACS or dilution. The readers are referred to detailed review and benchmarking studies for each of these methodologies elsewhere (Ding et al. 2020; Mereu et al. 2020).
1.1.2 Alignment of sequencing reads
Following the initial barcoding and cDNA preparation, the samples are subject to cDNA library preparation involving ‘sequencing index’ addition in specific single-cell technologies described above and quality assessment of the libraries (Web Page 2021a). The high-throughput sequencing of these libraries results in millions of reads that carry the barcode and the associated RNA sequence information. These reads can be processed using specialized software such as the Cell Ranger pipeline from 10X Genomics (Zheng et al. 2017) or processed using generic enumerators and alignment tools such as Kallisto, Salmon-Alevin, Bowtie, or STARsolo (Brüning et al. 2021; Du et al. 2020).
1.1.3 Post alignment processing
Post-sequencing alignment and counting of reads usually result in a gene versus barcode table with count data for each gene in each cell (the gene-count matrix). Each platform used for alignment and enumeration generates a different format for reporting this data. 10X Genomics Cell Ranger pipeline produces 3 files: the matrix of cell count, the barcode file, and the genes file identifying specific parts of the information. Other methods produce a “dense matrix”, or dgCMatrix format of the gene-count table that can be stored as comma-separated values files (CSV) or tab-delimited text files (.txt) file (Brüning et al. 2021; Du et al. 2020).
Depending on the technology used, the reads are identified using unique-molecular index (UMI) or read-identity specific to the method (Mereu et al. 2020). The total of the UMIs or the reads per cell helps identify outliers in the data. These outliers are cells that underwent poor lysis or cells that were partially lysed before the capture of the cDNA, or two or more cells captured within a compartment (doublets) resulting in a substantially higher number of mRNA molecules from a single compartment (Luecken and Theis 2019; Mereu et al. 2020; Nayak and Hasija 2021).
Except in the case of single-nucleus sequencing, RNA molecules coded by nuclear genes as well as cell organelles such as mitochondria and chloroplast can be captured. Genome annotation that includes mitochondrial genes can be used to align and enumerate the genes from both the nucleus and mitochondrial genes. The relative abundance of mitochondrial/other cell organelle genes to that of nuclear-coded genes is usually low in different cell types and therefore can be used as a quality control metric in the evaluation of cellular integrity before sequencing (Osorio and Cai 2020). The typical output of a single-cell sequencing method ranges from hundreds of cells to thousands of cells. Many single-cell technologies also allow for the sequential and repeated capture of more cells from a single sample source resulting in millions of cells captured and sequenced (Cao et al. 2020, 2017; Mereu et al. 2020). With the thousands of genes that can potentially be expressed in each cell type, the gene-count matrix produced from each of these technologies is a voluminous table of millions of data points. These data points represent the gene-count information in a ‘multi-dimensional data plane’ and the data needs to be projected onto a reduced-dimensional space for extracting meaningful information (Becht et al. 2018). Before such a dimensional reduction, the counts for each of these cells need to be normalized and scaled across cells in several cases (Hafemeister and Satija 2019; McCarthy et al. 2017; Satija et al. 2015; Vieth et al. 2019). Apart from the count information, many of these experiments require additional information to describe the source of the samples, experimental conditions, and treatments. These are considered as ‘meta’ information and help in evaluating the processed data and extracting information such as differential expression of genes under different conditions.
1.2 Seurat and Monocle
Several analytical pipelines have been developed for processing the gene-count tables into such manageable and minable datasets. At the time of writing, the most popular (in terms of the number of publications) methods for processing post-alignment data are Seurat and Monocle (Web Page 2021b; Vieth et al. 2019). Seurat is an R package that enables users to perform quality control, normalization, dimensionality reduction, clustering of cells among several other functionalities (Butler et al. 2018; Satija et al. 2015; Stuart et al. 2019; Vieth et al. 2019). These functionalities are written into facile and versatile functions that are easy to learn and implement for bioinformaticians and computational biologists across life sciences disciplines. Monocle is an R package that has also been used in a large number of studies to process and analyze single-cell data. Additionally, Monocle provides an additional tool to analyze the temporal changes in samples using the pseudotime analysis step (Qiu, Hill, et al. 2017; Qiu, Mao, et al. 2017; Trapnell et al. 2014). The details of each of these steps taken to perform normalization, scaling data, principal component analysis, dimensional reduction, and pseudotime analyses are well documented through several publications and dedicated web-portals (Web Page, n.d.a, n.d.b).
1.3 Need for Ryabhatta and Natian
However, with a large number of life sciences researchers, the analysis of single-cell sequencing data has been limited due to the command-line nature of these tools. While functions and steps used for Seurat and Monocle are easy to learn for researchers with an R background, the functions and data formats are complex enough to be a barrier for the exploration of data by non-computational scientists. Learning of R and the tools associated with single-cell data analysis is the ultimate solution to the researchers’ interested long-term analysis of the scRNAseq data, but the exploration of the use and scope of single-cell datasets should not be limited by computer literacy. To bring a computational and non-computational scientist to a common table, to be able to analyze the data and discuss the result of their analysis, we provide two graphical user interphases (GUIs) ‘Ryabhatta’ and ‘Natian’. These GUIs are based on the Shiny package in R and are provided to users with no R background. The installation and running of these GUIs have been evaluated in Windows, macOS, and Linux operating systems. The inclusion of life sciences researchers with extensive experience in molecular and cellular biology, but limited R experience in single-cell data analysis will provide novel biological insights on existing and future single-cell data sets.