Single Cell RNAseq Data Processing & Analysing Course


Day 2 - Processing scRNAseq reads generated from Nextflow pipeline


Recap from Day 1



  • Yesterday we covered:
    • How to use Unix.
    • How to get onto HAWK.
    • How to fetch the Github repository and download the resources.
    • How to execute the scrnaseq nextflow pipeline on Hawk.

Outputs from the nextflow pipeline



  • The output from the nextflow execution run is stored under results/cellranger_count_1k_mouse_kidney_CNIK_3pv3
  • Login to your Hawk account and access the folder:
cd /scratch/c.123456/scrnaseq-course/scrnaseq/results

cd cellranger_count_1k_mouse_kidney_CNIK_3pv3


Downloading data for analysis in Seurat (for analysis on RStudio on local system)



  • You can skip this step if you are working on RStudio Server.
  • Once the counts are generated using nextflow pipeline, download the data onto your personal computer.
  • For this, you will need to use Filezilla to copy the files from remote onto your local machine.
  • Follow the steps below:
  1. Type in the Host: hawklogin.cf.ac.uk
  2. Username:
  3. Password:
  4. Port: 22
  • Click on Quickconnect
  • Navigate to the folder scrnaseq-course/scrnaseq/results
  • Create a folder on your local machine, called scrnaseq-nextflow.
  • Create 4 folders within scrnaseq-nextflow folder:
    • bin
    • input
    • output
    • resources
  • Now, Drag and Drop the files from the results folder in the remote server to your input folder within scrnaseq-nextflow folder on your local machine.
  • Once the files are copied, these should be visible on your local machine under input folder.

gui1


Processing scrnaseq data (1k_mouse_kidney_CNIK_3pv3) in Seurat



  • Once the counts have been downloaded into the input folder on your local machine, Start RStudio on Mac/Windows.

gui1


  • Click on New File –> R Markdown... gui1

  • This is will open an Untiltled file on the top-left hand side.

gui1


  • Delete lines from 7 until 29 except first 5-6 lines.

gui1



Load libraries


  • Copy the line in your markdown document:
# load libraries
library(Seurat)
  • Press the green arrow key on line 7 which says Run Current Chunk to load the Seurat library.

gui1


Setup the Seurat object


  • We start by reading in the data. The Read10X() function reads in the output of the cellranger pipeline from 10X, returning a unique molecular identified (UMI) count matrix.
  • The values in this matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each cell (column).
# Load the 1k mouse kidney dataset
dat2 <- Read10X(data.dir = "input/filtered_feature_bc_matrix/")
  • We next use the count matrix to create a Seurat object.
# Initialize the Seurat object with the raw (non-normalized data).
dat2Obj <- CreateSeuratObject(counts = dat2, project = "1KmouseKidney", min.cells = 3, min.features = 200)

dat2Obj
## An object of class Seurat 
## 18321 features across 1375 samples within 1 assay 
## Active assay: RNA (18321 features, 0 variable features)
##  1 layer present: counts

QC and filtering cells for further analysis


  • Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria.
  • These includes:
    • The number of unique genes detected in each cell.
      • Low-quality cells or empty droplets will often have very few genes
      • Cell doublets or multiplets may exhibit an aberrantly high gene count
    • The percentage of reads that map to the mitochondrial genome.
      • Low-quality / dying cells often exhibit extensive mitochondrial contamination
      • We calculate mitochondrial QC metrics with the PercentageFeatureSet() function
      • We use the set of all genes starting with MT- or mt- as a set of mitochondrial genes
# The [[ operator can add columns to object metadata.
dat2Obj[["percent.mt"]] <- PercentageFeatureSet(dat2Obj, pattern = "^mt-")
  • To visualize QC metrics, we use VlnPlot() function.
# Visualize QC metrics as a violin plot
VlnPlot(dat2Obj, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

gui1

Try using ^MT- as pattern. What do you get?
dat2Obj[["percent.mt"]] <- PercentageFeatureSet(dat2Obj, pattern = "^MT-")
  • Next, you can plot a FeatureScatter plot for visualising feature-feature relationships.
# Plot FeatureScatter plot
plot1 <- FeatureScatter(dat2Obj, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(dat2Obj, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2

gui1


  • Now, we will filter cells that have unique feature counts over 9000 or less than 200
  • We filter cells that have >5% mitochondrial counts
dat2FilteredObj <- subset(dat2Obj, subset = nFeature_RNA > 200 & nFeature_RNA < 9000 &  percent.mt < 5)

dat2FilteredObj
# An object of class Seurat 
# 18321 features across 1235 samples within 1 assay 
# Active assay: RNA (18321 features, 0 variable features)
#  1 layer present: counts

Lets break it down:

  • nFeature_RNA > 200: Keeps cells that have more than 200 detected genes (features). This removes low-quality cells or empty droplets.
  • Feature_RNA < 9000: Removes cells with too many detected genes, which might indicate doublets (two cells captured in one droplet).
  • percent.mt < 5: Filters out cells where mitochondrial gene percentage is greater than 5%. High mitochondrial content often indicates damaged or dying cells.

  • With this command, the number of cells reduces to 1235.

Normalizing the data


  • After removing unwanted cells from the dataset, the next step is to normalize the data.
  • By default, we employ a global-scaling normalization method LogNormalize that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result.
  • Normalized values are stored in dat2NormalisedObj[["RNA"]]$data
dat2NormalisedObj <- NormalizeData(object = dat2FilteredObj, normalization.method = "LogNormalize", scale.factor = 10000)
  • The use of SCTransform replaces the need to run NormalizeData, FindVariableFeatures, or ScaleData.

Identification of highly variable features (feature selection)


  • We next calculate a subset of features (genes) that exhibit high cell-to-cell variation in the dataset.
  • These are the features those are highly expressed in some cells, and lowly expressed in others.
  • It has been found Brennecke et al. Nat Methods.2013 that focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets.
dat2Normalised2Obj <- FindVariableFeatures(dat2NormalisedObj, selection.method = "vst", nfeatures = 2000)
top10 <- head(VariableFeatures(dat2Normalised2Obj), 10)
plot3 <- VariableFeaturePlot(dat2Normalised2Obj)
plot4 <- LabelPoints(plot = plot3, points = top10, repel = TRUE)
CombinePlots(plots = list(plot3, plot4))

gui1


Scaling the data


  • The ScaleData function in Seurat is used to normalize and scale gene expression values across cells.
  • Additionally, it can regress out unwanted sources of variation, such as sequencing depth (nUMI) or mitochondrial gene percentage (percent.mito), helping to remove technical artifacts.
  • By default, only variable features are scaled.
  • But if you observe, in out script, we have used rownames(dat2Normalised2Obj) which applies scaling to all genes which is stored in all.genes object.
  • This is then applied to ScaleData function which uses all.genes as features for scaling.
all.genes <- rownames(dat2Normalised2Obj)
dat2ScaledObj <- ScaleData(object = dat2Normalised2Obj, features = all.genes)
How can I remove unwanted sources of variation - To remove unwanted sources of variation from a single-cell dataset, such as "percent.mt", use "vars.to.regress" to include these:
dat2ScaledObj <- ScaleData(object = dat2Normalised2Obj, features = all.genes, vars.to.regress = "percent.mt")

Perform linear dimensional reduction


  • Next we perform PCA on the scaled data.
  • By default, only the previously determined variable features are used as input, but can be defined using features argument if you wish to choose a different subset.
dat2Scaled2Obj <- RunPCA(object = dat2ScaledObj, features = VariableFeatures(object = dat2ScaledObj), do.print = TRUE, ndims.print = 1:5, nfeatures.print = 5)
  • This command uses only the highly variable genes (VariableFeatures(object = dat2ScaledObj)) instead of all genes to reduce noise and improve clustering.
  • do.print = TRUE prints the top genes contributing to each principal component (PC).
  • pcs.print = 1:5 displays the results for PC1 to PC5 (i.e., the first five principal components).
  • PC1 (Principal Component 1) is the most important axes of variation.
  • PC2, PC3, PC4, etc. explain additional variation in the data.
  • Top Positive Genes (e.g., Ldb2, Ebf1, Prkg1, Meis2) are more highly expressed in one subset of cells.
  • Top Negative Genes (e.g., Slc27a2, Keg1, Sugct) are more highly expressed in another subset.
  • PC1 may capture a major biological signal, such as cell type differences.
  • If PC2 separates a different cell population, it might represent a different biological process, like a cell cycle state or differentiation.
PC_ 1 
Positive:  Ldb2, Ebf1, Dlc1, Prkg1, Meis2 
Negative:  Slc27a2, Keg1, Gramd1b, Gm42397, Sugct 
PC_ 2 
Positive:  Slit2, Prdm16, Rgs6, Slc16a7, Me3 
Negative:  Slc7a12, Aadat, Ldb2, Gm20400, Slc22a19 
PC_ 3 
Positive:  Sox5, Lama2, Kcnt2, Cfh, Svep1 
Negative:  Adgrl4, Ptprb, Flt1, Emcn, St6galnac3 
PC_ 4 
Positive:  Chrm3, Slc7a12, Cd36, Mettl7a2, Gm30551 
Negative:  Nox4, Bmp6, Bnc2, Slc34a1, Acsm2 
PC_ 5 
Positive:  Abca13, Fli1, Ldb2, Fbxl7, Sgms2 
Negative:  Nphs1, Nphs2, Ptpro, Wt1os, Nebl
  • DimHeatmap() allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses.
DimHeatmap(
    object = dat2Scaled2Obj, 
    dims = 1, 
    balanced = TRUE
)

gui1

  • Now, use dims 1 to 15 to plot 15 PCs to observe which PCs contribute to sources of variation in the dataset.
DimHeatmap(
    object = dat2Scaled2Obj, 
    dims = 1:15, 
    balanced = TRUE
)

gui1


Determining dimensionality


  • After running PCA using RunPCA(), you need to determine how many PCs to retain for clustering and downstream analysis.
  • One common method is using the Elbow Plot.
  • The top principal components represents highest variations in the dataset.
  • So, how many components should we choose to include? 10? 20? 100?
  • In ‘Elbow plot’ it ranks principle components based on the percentage of variance explained by each one.
ElbowPlot(dat2Scaled2Obj)

gui1

  • If we are considering going with the default settings with ndims = 20, then we can clearly see that at the elbow point is at PC 11 which accounts for most variation in the dataset.

  • However, if we choose to include ndims = 50, then the elbow plot changes and includes 50 PCs.

ElbowPlot(dat2Scaled2Obj, ndims = 50)

gui1

  • Now, this gives us a much broader picture and also shows more PCs are contributing to variation beyond PC 11.
  • With this plot, we can keep the value between PC 21-25.
  • In this tutorial, I have chosen PC 25 to stretch the limit and include smaller variance. But the value of 21 will also work fine.

Clustering


  • Seurat applies a graph-based clustering approach, called Shared Nearest Neighbors (SNN).
  • How the SNN Score is Computed:
    • If cell A’s neighbor list and cell B’s neighbor list share many cells, they are likely to be in the same cluster.
    • The SNN score (edge weight) is higher if two cells have more shared neighbors.
  • Example:
    • Cell A’s neighbors: {B, C, D, E}
    • Cell B’s neighbors: {A, C, D, F}
    • Cell A and Cell B share: {C, D}
    • SNN score = Number of shared neighbors / Minimum kNN list size
    • SNN score = 2 / 4 = 0.5
  • If two cells share many of their nearest neighbors, their SNN score is high, and they are more likely to belong to the same cluster.

gui1

  • So, the first step is performed using the FindNeighbors() function, and takes as input the previously defined dimensionality of the dataset (first 25 PCs).
  • To cluster the cells, we next apply modularity optimization techniques such as the Louvain algorithm (default) or SLM to iteratively group cells together.
  • The FindClusters() function implements this procedure, and contains a resolution parameter that sets the ‘granularity’ of the downstream clustering, with increased values leading to a greater number of clusters.

Run the following commands:

seuratFindNeighbors <- FindNeighbors(
    dat2Scaled2Obj,
    reduction = "pca",
    dims = 1:25,
    do.plot = FALSE,
    graph.name = NULL,
    k.param = 30
)

seuratFindClusters <- FindClusters(object = seuratFindNeighbors)

Non-linear dimensional reduction


  • Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP, to visualize and explore these datasets.
  • The goal of these algorithms is to learn underlying structure in the dataset, in order to place similar cells together in low-dimensional space.
  • Therefore, cells that are grouped together within graph-based clusters determined above should co-localize on these dimension reduction plots.
dat2UMAP <- RunUMAP(seuratFindClusters, reduction = "pca", dims = 1:25, n.neighbors = 30L, min.dist = 0.3, seed.use = 123456L)
  • If you see that we have used 25 PCs and n.neighbors as 30
  • n.neighbors defines the number of nearest neighbors considered when constructing the UMAP graph.
  • Larger values capture broader structures, while smaller values focus on finer details.
n.neighbors Value Effect on UMAP When to Use?
Low (5-15) Captures local structure, fine details Small datasets, cluster refinement
Default (30) Balanced between local and global structure General use case
High (50-100) Preserves global relationships, reduces fragmentation Large datasets, broader clusters

To generate UMAP, run the following command:

DimPlot(dat2UMAP, reduction = "umap")

gui1



Doublet detection and removal


  • In scRNA-seq experiments, doublets are artifactual libraries generated from two cells.
  • Doublets are defined when two or more cells are mistakenly treated as a single cell.
  • They typically arise due to errors in cell sorting or capture, especially in droplet-based protocols involving thousands of cells.
  • In this workshop, we will be using DoubletFinder.
  • It is a tool used to detect and remove doublets (artificially merged cells) in scRNA-seq data.
  • Below is a step-by-step guide to doublet detection and removal in Seurat using DoubletFinder.

Step 1: Install and Load Required Packages

remotes::install_github('chris-mcginnis-ucsf/DoubletFinder')

Once the package is installed, the load the library.

library(DoubletFinder)

Step 2: Parameter Sweeping for pK Selection

Sweep a range of parameters to find the optimal pK.

sweep.res.list <- paramSweep(dat2UMAP, PCs = 1:20, sct = TRUE)
sweep.stats <- summarizeSweep(sweep.res.list, GT = FALSE)
bcmvn <- find.pK(sweep.stats)
pk <- as.numeric(as.vector(bcmvn$pK)[which.max(bcmvn$BCmetric)])
  • paramSweep(): Tests multiple pK (parameter K) values using PCs 1 to 20.
  • summarizeSweep(): Summarizes results into a dataframe.
  • find.pK(): Finds the best pK using a “BCmetric” score.
  • which.max(bcmvn$BCmetric): Selects the pK value with the highest BCmetric, as this gives the best separation between doublets and singlets.

Why do we need pK?

  • pK is a clustering resolution parameter that affects doublet detection.
  • A higher BCmetric means better performance of a given pK.

Step 3: Estimate the Expected Number of Doublets

annotations <- dat2UMAP@active.ident
homotypic.prop <- modelHomotypic(annotations)
nExp_poi <- round(0.076 * length(dat2UMAP@active.ident))
nExp_poi.adj <- round(nExp_poi * (1 - homotypic.prop))
  • modelHomotypic(annotations): Estimates homotypic doublets (cells of the same type that are mistakenly identified as a doublet).
  • nExp_poi <- round(0.076 * length(p1tumorFilteredObj@active.ident)):
    • Uses 10X Genomics’ doublet rate formula: ~7.6% (0.076) × number of cells.
    • This gives the expected number of doublets (nExp_poi) in the dataset.
  • nExp_poi.adj <- round(nExp_poi * (1 - homotypic.prop))
    • Adjusts for homotypic doublets to avoid over-filtering.

Why adjust for homotypic proportion?

  • If cells in a cluster are mostly from the same cell type, they might be misclassified as doublets.
  • This adjustment prevents over-filtering of true biological clusters.

Step 4: Run DoubletFinder (First Pass)

seuratDoublet <- doubletFinder(dat2UMAP, PCs = 1:20, pN = 0.25, pK = pk, 
                               nExp = nExp_poi, reuse.pANN = FALSE, sct = TRUE)
  • This step assigns a probability score (pANN) to each cell, indicating how likely it is a doublet.

Step 5: Prepare for Second Pass

pANN_String <- paste("pANN_0.25_", pk, "_", list(nExp_poi), sep="")
  • Creates a string name for the pANN column, which stores doublet probability scores.

Step 6: Run DoubletFinder (Second Pass - Adjusted for Homotypic Doublets)

dat2UMAPDoublet <- doubletFinder(seuratDoublet, PCs = 1:10, pN = 0.25, pK = pk, 
                                        nExp = nExp_poi.adj, reuse.pANN = pANN_String)
  • The first pass identifies all doublets.
  • The second pass refines the classification by removing homotypic doublets.

Step 7: Append doublets column in metadata

# Use this format: DF.classifications_pN_pk_nExp_poi

# Find out pk value
print(pk)

# Find out nExp_poi value
print(nExp_poi)

# Replace the DF.classifications_pN_pk_nExp_poi with the above values

# For example, if pk is 0.24 and nExp_poi is 134, then execute the following command

classificationsCol <- dat2UMAPDoublet@meta.data["DF.classifications_0.25_0.24_134"]

dat2UMAPDoublet@meta.data[,"DF_hi.lo"] <- classificationsCol[1]

  • Now, construct a UMAP displaying the Doublets and Singlets.
  • You can also generate Violin plot displaying the same information.



Remove doublets and pre-process singlets


Remove the doublets and select Singlets by performing the following command:

dat2UMAPSinglets <- subset(dat2UMAPDoublet, subset = DF_hi.lo == "Singlet")

Once the doublets are removed and singlets are selected, then re-run the following steps as performed above:

singletsNormalisedObj <- NormalizeData(object = dat2UMAPSinglets, normalization.method = "LogNormalize", scale.factor = 10000)
singletsNormalised2Obj <- FindVariableFeatures(singletsNormalisedObj, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(singletsNormalised2Obj)
singletsScaledObj <- ScaleData(object = singletsNormalised2Obj, features = all.genes)
singletsScaled2Obj <- RunPCA(object = singletsScaledObj, features = VariableFeatures(object = singletsScaledObj), do.print = TRUE, ndims.print = 1:5, nfeatures.print = 5)
ElbowPlot(singletsScaled2Obj, ndims = 50)
seuratFindNeighbors <- FindNeighbors(
    singletsScaled2Obj,
    reduction = "pca",
    dims = 1:25,
    do.plot = FALSE,
    graph.name = NULL,
    k.param = 30
)
singletsFindClusters <- FindClusters(object = singletsFindNeighbors)
singletsUMAP <- RunUMAP(singletsFindClusters, reduction = "pca", dims = 1:25, n.neighbors = 30L, min.dist = 0.3, seed.use = 123456L)

Once the pre-processing steps are finished, construct the UMAP using the DimPlot function.

DimPlot(singletsUMAP, reduction = "umap")


Data Visualisation


  • There are several data visualisation methods and functions in Seurat which are used for visualising marker feature expression.
  • This can be achieved using:
    • FeaturePlot()
    • DotPlot()
    • VlnPlot()
    • RidgePlot()
    • DimPlot()
    • DoHeatmap()

FeaturePlot function


  • The FeaturePlot() function in Seurat is used to visualize the expression of one or more genes (features) on a UMAP.
  • It helps in identifying the spatial distribution of gene expression across clusters or cell populations.
FeaturePlot(dat2UMAP, features = c("Slc26a4", "Slc34a1", "Slc8a1", "Ptpro", "Csmd1", "Top2a"))

gui1


DotPlot function


  • The DotPlot() function in Seurat is used to visualize the expression levels of multiple genes across different clusters in a dot plot format.
  • It is particularly useful for comparing marker gene expression across cell clusters and assigning cell types.
genes = c("Slc26a4", "Slc34a1", "Slc8a1", "Ptpro", "Csmd1", "Top2a")
DotPlot(dat2UMAP, features = genes) + RotatedAxis()

gui1


VlnPlot function


  • The VlnPlot() function in Seurat creates violin plots to visualize the distribution of gene expression levels across different clusters or groups of cells.
  • It helps in identifying differences in gene expression between clusters.
genes = c("Slc26a4", "Slc34a1", "Slc8a1", "Ptpro", "Csmd1", "Top2a")
VlnPlot(dat2UMAP, features = genes)

gui1


RidgePlot function


  • The RidgePlot() function in Seurat generates ridge plots to visualize the distribution of gene expression across different clusters or groups of cells.
  • It is useful for comparing the expression levels of genes in different cell populations.
genes = c("Slc26a4", "Slc34a1", "Slc8a1", "Ptpro", "Csmd1", "Top2a")
RidgePlot(dat2UMAP, features = genes, ncol = 2)

gui1


DimPlot function


  • The DimPlot() function in Seurat is used to visualize cells in a reduced dimensional space (e.g., UMAP, t-SNE, or PCA) and color them based on their cluster identity or metadata attributes.
  • It is useful for identifying and interpreting cell populations in single-cell RNA-seq analysis.

Syntax:

DimPlot(object, reduction = "umap", group.by = "ident", label = TRUE, pt.size = 1)

For example-

DimPlot(dat2UMAP, reduction = "pca")

gui1


DoHeatmap function


  • The DoHeatmap() function in Seurat is used to generate a heatmap of gene expression across different cell clusters or groups.
  • It is particularly useful for visualizing the expression patterns of top marker genes across clusters.
DoHeatmap(dat2UMAP, features = VariableFeatures(dat2UMAP)[1:100], cells = 1:500, size = 4,
          angle = 90) + NoLegend()

gui1