Sometimes I wonder if my stability has anything to do with me; maybe the real constant is the way I’ve been prepared. Maybe an infinite number of different people, put through the same process, would all emerge the same. Have all emerged the same. I don’t know.
Greg Egan, The Infinite Assassin (from his short story collection Axiomatic)
Clustering is a critical piece of any single-cell analysis pipeline, from suspension to spatial, from transcript to protein. Thus, we want to make sure we get this step right before moving on. At this step, especially for tools that pre-determine the number of clusters in the data for you (eg. Louvain or Leiden clustering on a KNN graph, typical of a scRNA seq workflow), the user might wonder how “sure” the tool is that the clusters determined are indeed what is biologically there. How can we be sure that each of these are meaningful clusters, and not some arbitrary partitioning (which you might get if you used k-means, for example).
We address this problem using what will be called cluster stability. The way we go about it is as follows: take whichever clustering algorithm you are going to use, and run it many times (eg. 10 or 20) with different random seeds. Then, look at how similar each clustering scheme is to the others. We do this by visualizing the cluster centroids of each clustering run, and stringing them together into a gif that the user can visualize.
Here, we are going to test the cluster stability of KNN graph based (Seurat) clusters on the scRNA seq PBMC 3k dataset. We are using this dataset because it is a flagship dataset that the community knows well. And because we know that the default clusters and resolution is more or less stable. We will test multiple “resolutions” aside from the default, leading to more clusters or fewer clusters in the output, and see how this affects stability dataset wide, and cell subset wide (eg. are the clusters in the T cells generally less stable than the clusters within the myeloid compartment).
Let’s get started. First, we are going to bring in the data and process it, as per Seurat’s best practices. Then, we will build up each piece of the infrastructure. Finally, we will take everything we’ve done, and turn them into nice functions that you can use on your own data.
library(Seurat)
library(SeuratData)
library(tidyverse)
cells <- SeuratData::LoadData("pbmc3k")
## Warning: Assay RNA changing from Assay to Assay
## Warning: Assay RNA changing from Assay to Assay5
cells
## An object of class Seurat
## 13714 features across 2700 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
## 2 layers present: counts, data
And now we process it.
cells <- NormalizeData(cells)
cells <- FindVariableFeatures(cells)
cells <- ScaleData(cells)
cells <- RunPCA(cells)
Now we will first run the UMAP, just to get that out of the way.
cells <- RunUMAP(cells, dims = 1:10)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
And now we are going to run three different clusterings. Default resolution, low resolution, and high resolution. These will be the clustering schemes that we will run over and over in a loop. But we will first run them here in order to visualize what they look like on the UMAP so we can orient ourselves.
The numer of dimensions we are using is the default for the PBMC 3k dataset, as per the Seurat guided clustering tutorial. We will be changing the resolution.
cells <- FindNeighbors(cells, dims = 1:10)
cells <- FindClusters(cells, resolution = 0.5, cluster.name = "res_0.5", verbose = FALSE)
cells <- FindClusters(cells, resolution = 0.1, cluster.name = "res_0.1", verbose = FALSE)
cells <- FindClusters(cells, resolution = 2, cluster.name = "res_2", verbose = FALSE)
And now we look at what we’ve got.
DimPlot(object = cells, group.by = "res_0.5")
DimPlot(object = cells, group.by = "res_0.1")
DimPlot(object = cells, group.by = "res_2")
And so you can see what the cell subsets are, important for subsequent analysis, we will plot that here.
DimPlot(cells, group.by = "seurat_annotations")
Great. Now we have something to work with. What we are going to do now is get the cluster stability of each of these things by computing the cluster centroids, so you can see how it’s done. Below we are going to pull the centroids from our default clustering resolution and plot them directly on the UMAP coordinates.
dimr <- cells@reductions$umap@cell.embeddings %>% as_tibble()
dimr
## # A tibble: 2,700 × 2
## umap_1 umap_2
## <dbl> <dbl>
## 1 -3.59 -4.23
## 2 -1.60 12.9
## 3 -2.31 -1.03
## 4 9.48 1.85
## 5 -9.31 -5.97
## 6 -1.28 -4.50
## 7 -5.85 -4.29
## 8 -5.83 -3.28
## 9 -5.36 -5.76
## 10 10.4 1.58
## # ℹ 2,690 more rows
clust <- cells@meta.data %>% as_tibble() %>% .[grepl("res_", names(.))]
clust
## # A tibble: 2,700 × 3
## res_0.5 res_0.1 res_2
## <fct> <fct> <fct>
## 1 0 0 7
## 2 3 3 11
## 3 2 0 1
## 4 5 1 8
## 5 6 2 12
## 6 2 0 13
## 7 4 2 3
## 8 4 2 3
## 9 4 0 6
## 10 5 1 8
## # ℹ 2,690 more rows
md <- bind_cols(dimr, clust)
md
## # A tibble: 2,700 × 5
## umap_1 umap_2 res_0.5 res_0.1 res_2
## <dbl> <dbl> <fct> <fct> <fct>
## 1 -3.59 -4.23 0 0 7
## 2 -1.60 12.9 3 3 11
## 3 -2.31 -1.03 2 0 1
## 4 9.48 1.85 5 1 8
## 5 -9.31 -5.97 6 2 12
## 6 -1.28 -4.50 2 0 13
## 7 -5.85 -4.29 4 2 3
## 8 -5.83 -3.28 4 2 3
## 9 -5.36 -5.76 4 0 6
## 10 10.4 1.58 5 1 8
## # ℹ 2,690 more rows
And now we can get to work.
test_cent <- lapply(unique(md$res_0.5), function(i) {
result <- dplyr::filter(md, res_0.5 == i) %>%
dplyr::select(umap_1, umap_2) %>%
apply(., 2, mean)
return(result)
}) %>% do.call(rbind, .) %>% as_tibble()
test_cent
## # A tibble: 9 × 2
## umap_1 umap_2
## <dbl> <dbl>
## 1 -1.85 -5.61
## 2 -1.73 14.6
## 3 -2.14 -2.55
## 4 9.59 0.564
## 5 -10.0 -5.20
## 6 -6.78 -4.20
## 7 9.71 4.08
## 8 6.10 3.32
## 9 10.5 -1.96
Now we will overlay the centroids on the plot.
ggplot() +
geom_point(data = dimr,
aes(x = umap_1, y = umap_2),
color = "black",
alpha = 0.9) +
geom_point(data = test_cent,
aes(x = umap_1, y = umap_2),
color = "yellow",
size = 3)
And that is how we are going to run each individual instance of this.
Now we are going to run the above a large number of time, and convert that into a gif so we can actually see what happened. So we will set up a workflow for the default clustering.
We note that there is a random seed in here that we have to kill. We will do that by simply setting the random seed to i. Whichever iteration we are on.
num_iter <- 10
clust_list <- lapply(seq(num_iter), function(i) {
# Get the clusters
cells <- Seurat::FindClusters(cells, random.seed = i, resolution = 0.5, cluster.name = "tmp_clusters", verbose = FALSE)
clust <- cells@meta.data$tmp_clusters
curr <- bind_cols(dimr, cluster = clust)
# Get the centroids
centroids <- lapply(unique(clust), function(j) {
curr_cent <- dplyr::filter(curr, cluster == j) %>%
dplyr::select(umap_1, umap_2) %>%
apply(., 2, mean)
return(curr_cent)
}) %>% do.call(rbind, .) %>% as_tibble()
return(centroids)
})
clust_list[[1]]
## # A tibble: 9 × 2
## umap_1 umap_2
## <dbl> <dbl>
## 1 -2.42 -2.75
## 2 -1.73 14.6
## 3 9.59 0.564
## 4 -10.0 -5.20
## 5 -6.72 -4.24
## 6 -1.60 -5.56
## 7 9.71 4.08
## 8 6.10 3.32
## 9 10.5 -1.96
So that’s what it’s going to look like. Now we have a list of clusters. What we can do is feed this into a plotting function.
CentPlot <- function(dimr, centroids) {
p <- ggplot() +
geom_point(data = dimr,
aes(x = umap_1, y = umap_2),
color = "black",
alpha = 0.9) +
geom_point(data = centroids,
aes(x = umap_1, y = umap_2),
color = "yellow",
size = 3)
return(p)
}
Now we make a plot to test out the function.
CentPlot(dimr = dimr, centroids = clust_list[[1]])
Perfect. Now we have to convert each of the clusterings into plots. And we will do that now. Remember, they are organized in a nice list.
plot_list <- lapply(clust_list, function(i) {
CentPlot(dimr = dimr, centroids = i)
})
Let’s print a couple.
plot_list[[1]]
plot_list[[2]]
Ok, good. Now we have to convert the list of images into a gif. Here is how you do that.
library(base64enc)
# Install magick if you haven't already
# install.packages("magick")
setwd(here::here("output", "seurat"))
library(magick)
# Set up a graphics device to capture the plots.
# You can adjust width, height, and resolution as needed.
img_list <- image_graph(width = 800, height = 600, res = 96)
# Loop over your list and print each plot.
for (plot in plot_list) {
print(plot)
}
# Turn off the graphics device to record the plots.
dev.off()
quartz_off_screen 2
# Animate the captured images (adjust fps for desired speed).
animation <- image_animate(img_list, fps = 5)
# Write the GIF to a raw vector in memory (without saving to disk)
gif_raw <- image_write(animation, format = "gif")
# Encode the raw GIF data into a Base64 string
gif_base64 <- base64encode(gif_raw)
# Output an HTML image tag with the embedded Base64 GIF
cat(sprintf('<img src="data:image/gif;base64,%s">', gif_base64))
# Save the animated GIF
image_write(animation, path = "res_0.5.gif")
Now that we have a worklow all set up, we can proceed with the experiment. We already see that the T cells are less stable than the rest of the clusters, as shown in the gif. But does this change as we change the resolution of the clustering?
Now we are going to streamline the workflow a bit.
Let’s organize so we have functions first. And then we can put everything through it. We are going to bring ClustCent in here, just for the sake of keeping everything together, despite being a bit repetitive.
MakeCentroidList <- function(seurat_obj, num_iter = 10, clust_res = 0.5, dimr) {
clust_list <- lapply(seq(num_iter), function(i) {
# Get the clusters
seurat_obj <- Seurat::FindClusters(seurat_obj, random.seed = i, resolution = clust_res, cluster.name = "tmp_clusters", verbose = FALSE)
clust <- seurat_obj@meta.data$tmp_clusters
curr <- bind_cols(dimr, cluster = clust)
# Get the centroids
centroids <- lapply(unique(clust), function(j) {
curr_cent <- dplyr::filter(curr, cluster == j) %>%
dplyr::select(umap_1, umap_2) %>%
apply(., 2, mean)
return(curr_cent)
}) %>% do.call(rbind, .) %>% as_tibble()
return(centroids)
})
return(clust_list)
}
CentPlot <- function(dimr, centroids) {
p <- ggplot() +
geom_point(data = dimr,
aes(x = umap_1, y = umap_2),
color = "black",
alpha = 0.9) +
geom_point(data = centroids,
aes(x = umap_1, y = umap_2),
color = "yellow",
size = 3)
return(p)
}
MakeGif <- function(plot_list, fps = 5, outfile) {
# Set up a graphics device to capture the plots.
# You can adjust width, height, and resolution as needed.
img_list <- image_graph(width = 800, height = 600, res = 96)
# Loop over your list and print each plot.
for (plot in plot_list) {
print(plot)
}
# Turn off the graphics device to record the plots.
dev.off()
# Animate the captured images (adjust fps for desired speed).
animation <- image_animate(img_list, fps = fps)
# Write the GIF to a raw vector in memory (without saving to disk)
gif_raw <- image_write(animation, format = "gif")
# Encode the raw GIF data into a Base64 string
gif_base64 <- base64encode(gif_raw)
# Output an HTML image tag with the embedded Base64 GIF
cat(sprintf('<img src="data:image/gif;base64,%s">', gif_base64))
# Save the animated GIF
image_write(animation, path = outfile)
}
Ok, everything above is the code we already wrote, packaged into some nice functions to streamline everything. Now let’s get started with our comparisons.
First, resolution of 0.5 as our positive control.
setwd(here::here("output", "seurat"))
cents <- MakeCentroidList(seurat_obj = cells, num_iter = 20, clust_res = 0.5, dimr = dimr)
cent_plots <- lapply(cents, function(i) {
CentPlot(dimr = dimr, centroids = i)
})
MakeGif(plot_list = cent_plots, fps = 5, outfile = "clust_res_0.5.gif")
Now we go lower.
setwd(here::here("output", "seurat"))
cents <- MakeCentroidList(seurat_obj = cells, num_iter = 20, clust_res = 0.1, dimr = dimr)
cent_plots <- lapply(cents, function(i) {
CentPlot(dimr = dimr, centroids = i)
})
MakeGif(plot_list = cent_plots, fps = 5, outfile = "clust_res_0.1.gif")
Now we go higher.
setwd(here::here("output", "seurat"))
cents <- MakeCentroidList(seurat_obj = cells, num_iter = 20, clust_res = 2, dimr = dimr)
cent_plots <- lapply(cents, function(i) {
CentPlot(dimr = dimr, centroids = i)
})
MakeGif(plot_list = cent_plots, fps = 5, outfile = "clust_res_2.gif")
And now we go much higher.
setwd(here::here("output", "seurat"))
cents <- MakeCentroidList(seurat_obj = cells, num_iter = 20, clust_res = 10, dimr = dimr)
cent_plots <- lapply(cents, function(i) {
CentPlot(dimr = dimr, centroids = i)
})
MakeGif(plot_list = cent_plots, fps = 5, outfile = "clust_res_10.gif")
What we observe in this analysis is that the CD4 T cell clusters are the least stable. This observation makes sense, in that T cells in general are harder to subset in scRNA seq data, as compared to having, for example, a flow/mass cytometry panel with a number of T cell markers. We note that the B cell clusters and the monocyte clusters are more stable, as least by eye. A future direction is of course to make this a quantitative metric, as well as a visual metric.
If we are dealing with a novel dataset, having cluster stability as a metric will allow the user to determine how “sure” the clustering algorithm is that a particular population is what it is, with the borders it has. This is particularly important, when you get to fine-grained clusterings that are within “islands” of phenotypic space (as approximated by tools like UMAP, as imperfect as they are).
We see that the two CD4 T clusters are in the same “island” and while less stable, do cluster in a similar manner across multiple runs. This suggests that there is a level of intra-island organization that scRNA seq can unearth. But this is not necessarily something that a human can uncover with a simple UMAP colored by cluster (something you more or less have to show in publications these days). Thus, cluster stability becomes a valuable way to get at any sort of intra-island organization.
We have already noted that this metric is not quantitative. But we also have noted that there are potential ways to make this quantitative. One of these is using information theory. Shannon Entropy is by definition a measure of “surprisal.” So we can start to simply ask “how predictable is it that we will have a cluster centroid in a particular region across multiple runs.”
Taken together, cluster stability is a good way to evaluate both novel datasets and novel clustering algorithms in terms of “sureness” of cell subsets. Given that a lot of single-cell analysis, from flow/mass cytometry to single-cell sequencing (including spatial for both) rely on proper cell subsetting, then this makes cluster stability a valuable tool across many modern bioinformatic workflows.