3 Seurat Object

Metrics and QC

Published

August 18, 2025

suppressPackageStartupMessages({
    library(tidyverse)
    library(Seurat)
    # BiocManager::install('glmGamPoi') 
    library(glmGamPoi)
    library(future)
    library(corrplot)
    library(patchwork)
    })

### helper functions
source("20250725-helper_functions.R")

KK4_464 <- Read10X_h5("~/Dropbox/SyncBriefcase/LAB/UK/scRNA_Akhil/cellRanger_data/464/filtered_feature_bc_matrix.h5")
KK4_465 <- Read10X_h5("~/Dropbox/SyncBriefcase/LAB/UK/scRNA_Akhil/cellRanger_data/465/filtered_feature_bc_matrix.h5")
KK4_492 <- Read10X_h5("~/Dropbox/SyncBriefcase/LAB/UK/scRNA_Akhil/cellRanger_data/492/filtered_feature_bc_matrix.h5")
KK4_496 <- Read10X_h5("~/Dropbox/SyncBriefcase/LAB/UK/scRNA_Akhil/cellRanger_data/496/filtered_feature_bc_matrix.h5")
KK4_497 <- Read10X_h5("~/Dropbox/SyncBriefcase/LAB/UK/scRNA_Akhil/cellRanger_data/497/filtered_feature_bc_matrix.h5")
KK4_502 <- Read10X_h5("~/Dropbox/SyncBriefcase/LAB/UK/scRNA_Akhil/cellRanger_data/502/filtered_feature_bc_matrix.h5")
KK4_504 <- Read10X_h5("~/Dropbox/SyncBriefcase/LAB/UK/scRNA_Akhil/cellRanger_data/504/filtered_feature_bc_matrix.h5")
KK4_505 <- Read10X_h5("~/Dropbox/SyncBriefcase/LAB/UK/scRNA_Akhil/cellRanger_data/505/filtered_feature_bc_matrix.h5")


# List names of all dgCMatrix objects in the global environment
dgCMatrix_names <- ls() |> keep(\(x) inherits(get(x), "dgCMatrix"))

# Create a Seurat objects from the expression matrices
seurat_objects <- map(dgCMatrix_names,
                      \(x) CreateSeuratObject(counts = get(x), project = x, 
                                              min.cells = 10, min.features = 200))




# name the list elements
names(seurat_objects) <- dgCMatrix_names

################################################################################
col_data = read.csv("~/Dropbox/SyncBriefcase/LAB/UK/scRNA_Akhil/ARIA SC Prep Info.csv") |> 
    column_to_rownames("Mouse.ID")


# Add a metadata column  indicating sample type
seurat_objects <- imap(seurat_objects, \(x, ID) { x$type = col_data[ID, "Treatment.Group"]
                       x})

### REMOVE SAMPLES IN THE SALINE CONTROL GROUP ########################################
seurat_objects <- seurat_objects[!names(seurat_objects) %in% c("KK4_497", "KK4_505")] #
#######################################################################################

save(seurat_objects, file = "data/20200620-list_of_Seurat_objects.Rdata")

3.1 Merged Seurat

load ("data/20200620-list_of_Seurat_objects.Rdata")
## merge for QC

seurat_merged <- merge(
  x = seurat_objects[[1]],
  y = seurat_objects[-1],
  add.cell.ids = names(seurat_objects)
)

VlnPlot(seurat_merged, features = c("nFeature_RNA", "nCount_RNA"), 
        alpha = 0.05,ncol = 2)

3.1.1 Inflection Plots

Inflection plots are diagnostic tools used in droplet-based single-cell RNA-seq to distinguish real cells from empty droplets or background ambient RNA. They help identify a threshold between barcodes corresponding to real cells and those that are likely empty droplets or background noise.

What is an inflection plot?

An inflection plot usually shows:

The barcode rank (x-axis, in decreasing order of total UMI counts per barcode)
The total UMI count per barcode (y-axis)

Barcodes are ranked from highest to lowest UMI count. The shape of the curve is typically steep initially (real cells with high counts) and then flattens out (empty droplets or noise).

Key regions in the plot:

Left (steep drop): Real cells with high transcript counts.
Inflection point / knee: A natural drop in total UMI count. This point is used to define a threshold between real cells and background.
Right (flat tail): Low UMI barcodes, likely empty droplets or ambient RNA.

How it’s used:

Empty droplets: Barcodes after the knee point usually correspond to droplets that did not capture a cell, containing only ambient RNA.
Background/ambient RNA: These droplets have very low transcript counts and a high mitochondrial content.
Multiplets (doublets/triplets): These tend to lie before the inflection point, but with abnormally high counts. They’re not detected by the plot directly, but if a barcode has too many UMIs or genes, it may be flagged as a possible doublet. These are detected better with dedicated tools like DoubletFinder or Scrublet.

inflections <- CalculateBarcodeInflections(seurat_merged)
BarcodeInflectionsPlot(inflections) +
  scale_x_continuous(trans = "log10") +
  scale_color_viridis_d()

3.1.2 QC metrics

Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. A few QC metrics commonly used by the community (Ilicic et al. 2016) include:

The number of unique genes detected in each cell.
- Low-quality cells or empty droplets will often have very few genes
- Cell doublets or multiplets may exhibit an aberrantly high gene count
Similarly, the total number of molecules detected within a cell (correlates strongly with unique genes)
The percentage of reads that map to the mitochondrial genome
- Low-quality / dying cells often exhibit extensive mitochondrial contamination
- We calculate mitochondrial QC metrics with the PercentageFeatureSet() function, which calculates the percentage of counts originating from a set of features
- We use the set of all genes starting with MT- as a set of mitochondrial genes

seurat_merged[["percent_mito"]] <- PercentageFeatureSet(seurat_merged, pattern = "^mt")
seurat_merged[["percent_ribo"]] <- PercentageFeatureSet(seurat_merged, pattern = "^Rp[ls]")
seurat_merged[["percent_hemoglobin"]] <- PercentageFeatureSet(seurat_merged, pattern = "^Hb[^PES]")
seurat_merged[["percent_plat"]] <- PercentageFeatureSet(seurat_merged,pattern = "Pecam1|Pf4")



md <- c("percent_mito", "percent_ribo","percent_hemoglobin", "percent_plat", "Malat1" )

VlnPlot(seurat_merged, assay = "RNA", log = F,
        group.by = "orig.ident", features = md, pt.size = 0.1,alpha = 0.01, ncol = 2)

3.1.3 Most expressed genes in the dataset

# Compute the proportion of counts of each gene per cell
# Use sparse matrix operations, if your dataset is large, doing matrix devisions the regular way will take a very long time.

C <- seurat_merged[["RNA"]]$counts
C@x <- C@x / rep.int(colSums(C), diff(C@p)) * 100
most_expressed <- order(Matrix::rowSums(C), decreasing = T)[20:1]
boxplot(as.matrix(t(C[most_expressed, ])),
    cex = 0.1, las = 1, xlab = "Percent counts per cell",
    col = (scales::hue_pal())(20)[20:1], horizontal = TRUE
)

3.1.4 SCTransform on the merged Seurat object for QC.

# Normalize with SCTransform (regress out unwanted sources of variation if needed)
seurat_merged <- SCTransform(seurat_merged)

# Run PCA
seurat_merged <- RunPCA(seurat_merged)

# Run UMAP
seurat_merged <- RunUMAP(seurat_merged, dims = 1:20)

save (seurat_merged, file = "data/20250620-Seurat_merged.Rdata")

# visualize UMAP

load("data/20250620-Seurat_merged.Rdata")
FeaturePlot(seurat_merged, features = c("percent_ribo","percent_mito", "nFeature_RNA","Malat1"), 
            split.by = "type")

In the UMAP, regions with low nFeature_RNA (low gene detection per cell) tend to overlap with areas of high percent.ribo, indicating that cells with poor transcript diversity are enriched in ribosomal RNA, a typical marker of low-quality or dying cells. These same areas also show low expression of Malat1, a gene commonly used as a proxy for cell quality in scRNAseq data (Clarke and Bader 2024; Montserrat-Ayuso and Esteve-Codina 2024), further supporting that these clusters may represent transcriptionally inactive or degraded cells. The convergence of these three features, as shown in the correlation plot below, suggests that these regions likely contain “damaged” cells that should be considered for filtering during quality control.

#| layout-ncol: 2
#| fig-width:10
#| column: page

Malat1 <-LayerData(seurat_merged, layer = "scale.data", assay = "SCT")["Malat1", ]

mat <- bind_cols(seurat_merged[["percent_mito"]], seurat_merged[["percent_ribo"]], seurat_merged[["percent_hemoglobin"]], seurat_merged[["percent_plat"]], Malat1 = Malat1, seurat_merged[["nFeature_RNA"]]) 

corr_mat <- cor(mat, method = "p")
corrplot(corr_mat,
         title = "Correlation plot", type = "upper")

3.2 Problems in downstream analysis

When these Seurat objects were SCTransformed and integrated, regressing out mitochondrial content as a confounding factor, the UMAP shows two distinct blobs for each microglial and astrocyte populations after cell type annotation.

Notably, these differences persisted even when Malat1 expression was used as an additional marker for transcriptional activity or cell quality. Despite normalization, the two blobs within each cell type showed marked differences in ribosomal content and nFeature_RNA, indicating that transcript complexity and ribosomal bias remained as sources of heterogeneity. suggesting that the separation reflects underlying biological or technical variation not fully corrected.

3.3 Subsetting. Filter out cells with less than 2000 features

To avoid including poor-quality cells or technical artifacts, we subset the data to filter out cells with fewer than 2000 features. This step removes cells that likely belong to a lower-quality population, preventing them from biasing downstream analyses such as clustering or differential expression. The threshold of 2000 features is defined based on the distribution shown in the figure below.

features <- imap (seurat_objects, \(x, name) tibble(seurat_object = name,
                                             n_features = x$nFeature_RNA,
                                             ribo = PercentageFeatureSet(x, pattern = "^Rp[ls]"))) |> 
    list_rbind()

ggplot(features,aes(n_features,col=seurat_object))+
    labs(y="", title = "Density Plot", subtitle = "Number of features / cell")+
    geom_vline(xintercept = 2000)+
    density_decor()

ggplot(features,aes(ribo,col=seurat_object))+
    labs(y="", title = "Ribo content ")+
    scale_x_continuous( " ", limits = c(0,30),breaks = c(0,7,10,20,30),
                        labels = scales::label_number(scale = 1/1, suffix = "%"))+
     geom_vline(xintercept = 7)+
    density_decor()