Cluster membership purity score for single cells

Home

I am not limited by this form. I am also the whole tree, and when I go back to the soil, I will continue to nourish the tree.

― Thich Nhat Hanh, Peace Is Every Step: The Path of Mindfulness in Everyday Life

Introduction

A backbone of any single-cell or spatial analysis pipeline is that of clustering. Here, you assign cells to specific discrete subsets that are in turn annotated based on differentially expressed features (be it genes or proteins).

However, this analysis has a built in assumption that we often gloss over: that the boundaries are final. That a CD4 T cell located right at the CD4/CD8 boundary in a clustering scheme is still a CD4 T cell. Maybe this is the case, maybe it is not. Either way, these boundary regions deserve scrutiny. How do we do that?

One simple way is to look at the local neighborhood of a given cell in terms of its cluster membership. If all of its neighbors are in the same cluster, then we can be more confident that it is what we say it is. If its local neighborhood consists of an even split of multiple clusters (spatially, think of the four corners region in the southwestern US), then that cell’s identity comes into question.

Here, we will take the flagship PBMC 3k dataset as an example. We will take each cell’s k-nearest neighbors (KNN) and look at the cell identity (pre-existing annotated cluster) of each neighborhood. We will then calculate a “purity” score using Shannon entropy, where the lower the score, the more pure the cluster. The higher the score, the more mixed the cluster. Importantly, we will then have a confidence metric around a cell’s “belongingness” to a cluster.

Data processing

First, we upload the PBMC 3k dataset. You can find this, annotated clusters and all, from the SeuratData package.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(Seurat)

## Loading required package: SeuratObject
## Loading required package: sp
## 
## Attaching package: 'SeuratObject'
## 
## The following objects are masked from 'package:base':
## 
##     intersect, t

library(SeuratData)

## ── Installed datasets ──────────────────────────────── SeuratData v0.2.2.9002 ──
## ✔ cbmc     3.1.4                        ✔ pbmc3k   3.1.4
## ✔ hcabm40k 3.0.0                        ✔ pbmcref  1.0.0
## ✔ ifnb     3.0.0                        ✔ pbmcsca  3.0.0
## ✔ panc8    3.0.2                        ✔ stxBrain 0.1.1
## 
## ────────────────────────────────────── Key ─────────────────────────────────────
## ✔ Dataset loaded successfully
## ❯ Dataset built with a newer version of Seurat than installed
## ❓ Unknown version of Seurat installed

cells <- SeuratData::LoadData("pbmc3k")

## Validating object structure

## Updating object slots

## Ensuring keys are in the proper structure

## Warning: Assay RNA changing from Assay to Assay

## Ensuring keys are in the proper structure

## Ensuring feature names don't have underscores or pipes

## Updating slots in RNA

## Validating object structure for Assay 'RNA'

## Object representation is consistent with the most current Seurat version

## Warning: Assay RNA changing from Assay to Assay5

cells

## An object of class Seurat 
## 13714 features across 2700 samples within 1 assay 
## Active assay: RNA (13714 features, 0 variable features)
##  2 layers present: counts, data

Now let’s do the full workup to get the PCA, UMAP, and so forth. We note that the annotated clusters are built into the dataset itself, so we don’t need to do the clustering. I include the commands commented out just for the sake of showing what the full workflow would otherwise do. For further details, you can go to Seurat’s guided clustering tutorial here, or my version of that here, where I do each step independent of the use of Seurat so you can see the fundamentals under the hood.

cells <- NormalizeData(cells)

## Normalizing layer: counts

cells <- FindVariableFeatures(cells)

## Finding variable features for layer counts

cells <- ScaleData(cells)

## Centering and scaling data matrix

cells <- RunPCA(cells)

## PC_ 1 
## Positive:  MALAT1, LTB, IL32, CD2, ACAP1, STK17A, CTSW, CD247, CCL5, GIMAP5 
##     AQP3, GZMA, CST7, TRAF3IP3, MAL, HOPX, ITM2A, GZMK, MYC, BEX2 
##     GIMAP7, ETS1, LDLRAP1, ZAP70, LYAR, RIC3, TNFAIP8, KLRG1, SAMD3, NKG7 
## Negative:  CST3, TYROBP, LST1, AIF1, FTL, FCN1, LYZ, FTH1, S100A9, FCER1G 
##     TYMP, CFD, LGALS1, CTSS, S100A8, SERPINA1, LGALS2, SPI1, IFITM3, PSAP 
##     CFP, SAT1, IFI30, COTL1, S100A11, NPC2, LGALS3, GSTP1, PYCARD, NCF2 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DRA, HLA-DQB1, LINC00926, CD79B, HLA-DRB1, CD74 
##     HLA-DPB1, HLA-DMA, HLA-DQA2, HLA-DRB5, HLA-DPA1, HLA-DMB, FCRLA, HVCN1, LTB, BLNK 
##     KIAA0125, P2RX5, IRF8, IGLL5, SWAP70, ARHGAP24, SMIM14, PPP1R14A, FCRL2, C16orf74 
## Negative:  NKG7, PRF1, CST7, GZMA, GZMB, FGFBP2, CTSW, GNLY, GZMH, SPON2 
##     CCL4, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX, CTSC 
##     TTC38, S100A4, ANXA1, IL32, IGFBP7, ID2, ACTB, XCL1, APOBEC3G, SAMD3 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, CD74, HLA-DPA1, MS4A1, HLA-DRB1, HLA-DRB5 
##     HLA-DRA, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, HVCN1, FCRLA, IRF8, BLNK 
##     KIAA0125, SMIM14, PLD4, IGLL5, P2RX5, TMSB10, SWAP70, LAT2, MALAT1, IGJ 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU 
##     HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, CA2, PTCRA, ACRBP, MMD, TREML1 
##     NGFRAP1, F13A1, RUFY1, SEPT5, MPP1, CMTM5, TSC22D1, MYL9, RP11-367G6.3, GP1BA 
## PC_ 4 
## Positive:  HLA-DQA1, CD79A, CD79B, HIST1H2AC, HLA-DQB1, PF4, MS4A1, SDPR, CD74, PPBP 
##     HLA-DPB1, GNG11, HLA-DQA2, SPARC, HLA-DRB1, HLA-DPA1, GP9, TCL1A, HLA-DRA, LINC00926 
##     NRGN, RGS18, HLA-DRB5, PTCRA, CD9, AP001189.4, CA2, CLU, TUBB1, ITGA2B 
## Negative:  VIM, S100A8, S100A6, S100A4, S100A9, TMSB10, IL32, GIMAP7, LGALS2, S100A10 
##     RBP7, FCN1, MAL, LYZ, S100A12, MS4A6A, CD2, FYB, S100A11, FOLR3 
##     GIMAP4, AQP3, ANXA1, AIF1, MALAT1, GIMAP5, IL8, IFI6, TRABD2A, TMSB4X 
## PC_ 5 
## Positive:  GZMB, FGFBP2, NKG7, GNLY, PRF1, CCL4, CST7, SPON2, GZMA, CLIC3 
##     GZMH, XCL2, CTSW, TTC38, AKR1C3, CCL5, IGFBP7, XCL1, CCL3, S100A8 
##     TYROBP, HOPX, CD160, HAVCR2, S100A9, FCER1G, PTGDR, LGALS2, RBP7, S100A12 
## Negative:  LTB, VIM, AQP3, PPA1, MAL, KIAA0101, CD2, CYTIP, CORO1B, FYB 
##     IL32, TRADD, ANXA5, TUBA1B, HN1, TYMS, PTGES3, ITM2A, COTL1, GPR183 
##     TNFAIP8, ACTG1, TRAF3IP3, ATP5C1, GIMAP4, ZWINT, PRDX1, LDLRAP1, ABRACL, NGFRAP1

# cells <- FindNeighbors(cells)
# cells <- FindClusters(cells, dims = 1:10)
cells <- RunUMAP(cells, dims = 1:10)

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 10:28:48 UMAP embedding parameters a = 0.9922 b = 1.112

## 10:28:48 Read 2700 rows and found 10 numeric columns

## 10:28:48 Using Annoy for neighbor search, n_neighbors = 30

## 10:28:48 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 10:28:48 Writing NN index file to temp file /var/folders/6y/qgvrbn_j2v35vfy946hlhh9c0000gn/T//RtmpmsUucs/file31d169edbed9
## 10:28:48 Searching Annoy index using 1 thread, search_k = 3000
## 10:28:48 Annoy recall = 100%
## 10:28:48 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 10:28:49 Initializing from normalized Laplacian + noise (using RSpectra)
## 10:28:49 Commencing optimization for 500 epochs, with 107940 positive edges
## 10:28:52 Optimization finished

Now we take each cell’s k-nearest neighbors and store that. We are sticking with a low value of k, so we can accommodate the smallest subsets in the dataset.

library(RANN)
pcs <- cells@reductions$pca@cell.embeddings[,1:10]
nn <- RANN::nn2(pcs, k = 5)$nn.idx
head(nn) # column 1 is cell ID, column 2 is the first nearest neighbor, etc

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1 2512  740  531 1062
## [2,]    2   11 1225 2270 1535
## [3,]    3 1678 1188 1450 2368
## [4,]    4  812  983 1123 1377
## [5,]    5 1534  533  750 1697
## [6,]    6  771 1873 2177 2302

Now we pull out the annotated cluster ID from the Seurat object. As said before, we will use the built-in annotations.

clust <- cells$seurat_annotations

Now we will do a Shannon entropy calculation. Shannon entropy quantifies the average “surprise” when observing outcomes: more unpredictability yields higher entropy.

For example, a fair coin (50/50) has higher entropy than a biased coin (90/10), because its flips are less predictable.

In our single‐cell context, we compute entropy over each cell’s KNN. If all neighbors belong to the same cluster, entropy = 0 (no surprise). If neighbors are evenly spread across multiple clusters, entropy is maximized (maximum surprise).

So for the KNN in the PBMC 3k dataset, are they all the same cluster? As a side note, this is why we have to keep the k as small as the smallest cluster in the data. A larger k than the size of a cluster would lead to the false conclusion that this cluster is not pure.

Here is the Shannon entropy function:

#' Compute Shannon entropy of a vector
#'
#' @param x   A vector of categorical or discrete values
#' @param base The logarithm base (2 for bits, e for nats, 10 for bans)
#' @return Shannon entropy of x
ShannonEntropy <- function(x, base = 2) {
  # Drop missing values
  x <- x[!is.na(x)]
  # Frequency table
  freq <- table(x)
  # Probabilities
  p <- freq / sum(freq)
  # Shannon entropy
  H <- -sum(p * log(p, base = base))
  return(H)
}

And now we run it through our k-nearest neighbors table.

entropy <- apply(nn, 1, function(i) {
    curr <- clust[i] %>% as.character()
    return(ShannonEntropy(curr))
})

Now let’s visualize our UMAP in terms of that. The way we do that is by adding the per-cell entropy as a column in the Seurat object’s metadata.

cells@meta.data$entropy <- entropy

Results

And now we can use the built in plot functions, as done below. First, we look at the built in annotations themselves.

Seurat::DimPlot(cells, group.by = "seurat_annotations")

And now we color by our per-cell entropy.

FeaturePlot(cells, features = "entropy")

Looking at both of these, we can see “bands” of higher entropy in boundary regions, with the big example being between the CD4 T cell subsets. We note that T cells are known to be difficult to subset with single-cell sequencing data as compared to protein-based data (e.g. flow cytometry, CITE-seq). So this is expected. We also observe that we don’t see any spikes in entropy at the edges of “islands.” This suggests that the KNN (computed in the higher dimensional space, not the UMAP space) indeed reflect higher dimensional “islands” that are separate from each other.

Now, we will jump to the per-cluster level and look at the average per-cell entropy in a given cluster. This can give us a cluster purity score. We will visualize this in a bar plot.

BarPlotEntropy <- function(obj,
                           entropy_col = "entropy",
                           annot_col   = "seurat_annotations",
                           fill_colour = "steelblue") {
  stats <- obj[[]] %>%
    dplyr::group_by(.data[[annot_col]]) %>%
    dplyr::summarise(mean_ent = mean(.data[[entropy_col]], na.rm = TRUE),
                     se_ent   = sd(.data[[entropy_col]], na.rm = TRUE) /
                                sqrt(dplyr::n())) %>%
    dplyr::ungroup()
  
  ggplot(stats,
         aes(x = .data[[annot_col]],
             y = mean_ent)) +
    geom_col(fill = fill_colour, width = 0.7) +
    geom_errorbar(aes(ymin = mean_ent - se_ent,
                      ymax = mean_ent + se_ent),
                  width = 0.25) +
    labs(x = annot_col,
         y = "Mean neighbourhood entropy (bits)") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
}

BarPlotEntropy(cells)

We note that the least pure clusters are the T cell subsets. This is something that could be understood by looking at the UMAP. But we note that the Memory CD4 T cells seem to be the most problematic, and this is something that could NOT be understood by looking at the UMAP. This gives us a level of detail that was previously not visible.

We also note that the B cells have the highest purity. The UMAP has the B cells existing by themselves in a single island, so this does indeed make sense.

Discussion

While the PBMC 3k dataset is a flagship dataset, and PBMC-based single-cell datasets are quite common, this cluster purity work is likely quite useful in novel data types (novel organ from a novel organism), or data where trajectories are suspected. In the latter, a low cluster purity score could simply correspond to regions where the subsets are continuous developmental trajectories rather than discrete populations.

We also note that while this per-cell purity score is good for manual understanding of one’s data analysis, it might also be good for downstream algorithmic tools. For example, one could feed the purity score into supervised ML algorithms to lead to higher quality results down the line.

Another downstream algorithmic application might be the task of determining whether clusters should be merged or split. In non-trajectory data, it could be that a group of clusters with low purity scores right next to each other should be merged. On the other hand, it might be that a user splits clusters until the purity hits some “impure” threshold, in which the user stops splitting.

Taken together, clusters are not final, and should not be treated as such. We should get rid of this assumption, that is often baked into clustering tools, especially ones that tell you how many clusters there are by default. The boundary regions should be scrutinized per cell. Purity should be scrutinized per cluster. These fundamental shifts are doable with a simple k-nearest neighbors based analysis, as I have shown here, and could potentially be of great benefit to the single-cell and spatial communities.

Cluster membership purity score for single cells

Tyler J Burns, PhD

2025-05-12 - 2025-05-13

Introduction

Data processing

Results

Discussion