Nearest neighbor comparison between PBMC 3k dataset and its universal cell embeddings (UCE) foundation model output

Home

Universal Cell Embeddings (UCE) takes a single-cell dataset as input and produces a higher dimensional embedding as output. These embeddings are meaningful in that if we throw in two PBMC datasets, for example, the populations will line up in a meaningful way.

But how well does the UCE preserve the neighborhoods of the original data? By original data, we will define it as the nearest neighbors of the principal components. This is because this is typically what goes into clustering and UMAP as input, in a best-practices single-cell analysis pipeline.

Data preparation

We have the PBMC 3k data, and the UCE data. Let’s first get the PBMC 3k dataset loaded.

library(tidyverse)
library(Seurat)
library(SeuratData)
library(here)
set.seed(42)

pbmc <- SeuratData::LoadData("pbmc3k")

## Warning: Assay RNA changing from Assay to Assay

## Warning: Assay RNA changing from Assay to Assay5

pbmc

## An object of class Seurat 
## 13714 features across 2700 samples within 1 assay 
## Active assay: RNA (13714 features, 0 variable features)
##  2 layers present: counts, data

Ok, now we load the UCE data, which we have already ran before (the 33-layer model, which is the highest you can go), and have converted into a csv out of the h5ad format in a separate python script.

setwd(here("data"))
uce <- read.csv("pbmc3k_uce.csv") 

rownames(uce) <- uce$index %>% gsub("-1", "", .)
uce$index <- NULL

Ok, now we have everything we need. Let’s get started. First, we run the PBMC 3k dataset through the general workup, that we get from Seurat’s guided clustering tutorial.

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
pbmc <- ScaleData(pbmc, features = rownames(pbmc))
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

Great. Now let’s pull out the top 50 principal components. This will be the data from which we pull the nearest neighbors.

pc <- pbmc@reductions$pca@cell.embeddings
pc[1:5, 1:5]

##                       PC_1        PC_2       PC_3       PC_4       PC_5
## AAACATACAACCAC   4.6060466 -0.60371951 -0.6052429 -1.7231935 -0.7443433
## AAACATTGAGCTAC   0.1670809  4.54421712  6.4518867  6.8597974 -0.8011412
## AAACATTGATCAGC   2.6455614 -4.00971883 -0.3723479 -0.9960236 -4.9837032
## AAACCGTGCTTCCG -11.8569587  0.06340912  0.6226992 -0.2431955  0.2919980
## AAACCGTGTATGCG   3.0531940 -6.00216498  0.8234015  2.0463393  8.2465179

Do the row names line up? This is important, just in case the indexing got mixed up somehow when I was running UCE. You never know.

all.equal(rownames(uce), rownames(pc))

## [1] TRUE

So it’s fine. That’s good.

Optional: Last thing we are going to do. To control for the fact that we are dealing with PCA of the original data, we are going to do the same thing for UCE.

# We can comment this out
uce <- prcomp(uce)$x %>% as.data.frame() %>% .[,1:50]

Nearest neighbor exploration, single k

Ok, and now what we have to do is take the KNN from each. We will start with an arbitrary value, and then build. I just want to get a rough idea of the percentage. Remember, if we are near zero, then that means that the data were not indexed the same between the Seurat and Scanpy environments.

library(RANN)

num_nn <- 300
nn_pc <- RANN::nn2(data = pc, k = num_nn)$nn.idx
nn_uce <- RANN::nn2(data = uce, k = num_nn)$nn.idx

And now we do the comparisons.

nn_sim <- lapply(seq(nrow(nn_pc)), function(i) {
  curr_pc <- nn_pc[i,]
  curr_uce <- nn_uce[i,]
  result <- length(intersect(curr_uce, curr_pc))/num_nn
  return(result)
}) %>% unlist()

And if we look at it.

hist(nn_sim)

And the answer is…for the most part there are no shared nearest neighbors. This is a bit of a shock. It suggests that there is a sort of jumbling that takes place between them.

Let’s visualize the similarities on a UMAP. This might help.

pbmc <- RunUMAP(pbmc, dims = 1:10)

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

dimr <- pbmc@reductions$umap@cell.embeddings %>% as_tibble()
names(dimr) <- c("umap1", "umap2")

Plotting our UMAP colored by nearest neighbor similarities, we get:

cytobank_colors <- colorRampPalette(c("blue", "cyan", "green", "yellow", "orange", "red"))
ggplot(dimr, aes(x = umap1, y = umap2, color = nn_sim)) + geom_point() + scale_color_gradientn(colors = cytobank_colors(100))

Not great. But it seems like more of the signal is being preserved in the monocytes (right island). We can do a quick umap of the UCE space.

library(umap)
dimr_uce <- umap::umap(uce)$layout %>% as_tibble()

## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

names(dimr_uce) <- c("umap1", "umap2")

ggplot(dimr_uce, aes(x = umap1, y = umap2, color = nn_sim)) + geom_point() + scale_color_gradientn(colors = cytobank_colors(100))

We note that the subsets appear to be separted into three high-level islands like before. But the UMAP does not look as clean. We will just keep this in mind for now, as we want to move to more quantitative metrics.

This output here suggests that UCE takes a pretty big hit in terms of the within-dataset quality. And perhaps does better between datasets. That is an interesting finding.

Nearest neighbor exploration, multiple k

Anyway, let’s do a titration curve. We will do the same thing we did in the previous section but across a number of different values for the number of nearest neighbors.

nn_vec <- c(10, 20, 50, 100, 200, 500, 1000, 2000)

nn_list <- lapply(nn_vec, function(n) {
  nn_pc <- RANN::nn2(data = pc, k = n + 1)$nn.idx[,-1]
  nn_uce <- RANN::nn2(data = uce, k = n + 1)$nn.idx[,-1]
  nn_sim <- lapply(seq(nrow(nn_pc)), function(i) {
    curr_pc <- nn_pc[i,]
    curr_uce <- nn_uce[i,]
    result <- length(intersect(curr_uce, curr_pc))/n
    return(result)
  }) %>% unlist()
  return(nn_sim)
})

nn_dat <- do.call(cbind, nn_list) %>% as_tibble()
names(nn_dat) <- nn_vec

## Warning: The `value` argument of `names<-()` must be a character vector as of tibble
## 3.0.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

And we need a control of course. So we will permute the rows of the principal components and run the same block of code again.

pc_permute <- pc[sample(nrow(pc)), ]

nn_permute <- lapply(nn_vec, function(n) {
  nn_pc <- RANN::nn2(data = pc_permute, k = n + 1)$nn.idx[,-1]
  nn_uce <- RANN::nn2(data = uce, k = n + 1)$nn.idx[,-1]
  nn_sim <- lapply(seq(nrow(nn_pc)), function(i) {
    curr_pc <- nn_pc[i,]
    curr_uce <- nn_uce[i,]
    result <- length(intersect(curr_uce, curr_pc))/n
    return(result)
  }) %>% unlist()
  return(nn_sim)
})

nn_dat_permute <- do.call(cbind, nn_permute) %>% as_tibble()
names(nn_dat_permute) <- nn_vec

Now we will throw in a plotting function from some previous nearest neighbors based work that I did.

#' @title Make plot
#' @description Wrapper around the plotting function to handle multiple
#' settings
#' @param dat tibble of nn performance, with columns method, nn, mean, sd. Note
#' that nn could also be the special case of "distant neighbors"
#' @param to_log10 boolean indicating whether to log10 scale
#' @param error_bars boolean indicating whether to print error bars corresponding
#' to the standard deviation
#' @param plot_title either NULL or the name of the title of the plot to be used
#' @return ggplot object to print or save
MakeLinePlot <- function(dat, to_log10 = TRUE, error_bars = FALSE, plot_title = NULL) {
  
  p <- ggplot(dat, aes(x = nn, y = mean, group = method, color = method)) + geom_line() +
    geom_point() + 
    lims(y = c(0, 1)) + 
    labs(x = "neighbors", y = "mean similarity") + 
    theme(text = element_text(size = 20))
  
  if(!is.null(plot_title)) {
    p <- p + ggtitle(plot_title)
  }
  
  if(to_log10) {
    p <- p + scale_x_log10()
  }
  
  if(error_bars) {
    p <- p + geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd), width=.2,
                           position=position_dodge(0.05))  
  }
  
  return(p)
}

We prepare the data for the above function below. Note that we are including both the experimental data and the permuted data (which we label as “random”).

nn_summary <- tibble(method = "UCE", nn = nn_vec, mean = apply(nn_dat, 2, mean), sd = apply(nn_dat, 2, sd))
nn_permute_summary <- tibble(method = "Random", nn = nn_vec, mean = apply(nn_dat_permute, 2, mean), sd = apply(nn_dat_permute, 2, sd))

toplot <- rbind(nn_summary, nn_permute_summary)

And we plot it.

MakeLinePlot(dat = toplot, error_bars = TRUE)

We see that the KNN between our PCA and UCE do indeed produce signal that is above the lower limit random threshold, with a higher standard deviation as well, as denoted by the error bars. This suggest that UCE is indeed preserving information, and this is useful to know.

But where are the error bars coming from exactly? To answer that, we are going to color each KNN similarity (one per cell) on a UMAP as we did in the previous section.

Accordingly, we will make a patchwork of UMAPs, colored by each of the things.

# Load necessary libraries
library(ggplot2)
library(patchwork)

data <- bind_cols(dimr, nn_dat)

# Initialize an empty list to store plots
plot_list <- list()

# Loop through each column in the tibble to create a UMAP plot
for (col_name in colnames(nn_dat)) {
  p <- ggplot(data, aes(x = umap1, y = umap2, color = .data[[col_name]])) +
    geom_point(size = 2, alpha = 0.8) +
    scale_color_gradientn(colors = cytobank_colors(100)) +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 10),  # Smaller title font
      axis.title = element_text(size = 8),  # Smaller axis title font
      axis.text = element_text(size = 6),   # Smaller axis tick font
      legend.title = element_text(size = 8), # Smaller legend title font
      legend.text = element_text(size = 6)   # Smaller legend text font
    ) +
    guides(
      color = guide_colorbar(barwidth = 0.5, barheight = 4) # Adjust legend dimensions
    ) +
    labs(
      title = paste("UMAP Colored by", col_name),
      x = "UMAP1",
      y = "UMAP2",
      color = ""
    )
  plot_list[[col_name]] <- p
}

# Combine all plots using patchwork
final_plot <- wrap_plots(plot_list, ncol = 3)  # Adjust ncol as needed
print(final_plot)

We note that when k gets to 500, we have very high nearest neighbor preservation across the monocytes, while its still low for the B and T cells. For the T cells, it seems to peak around 1000. We also see a small population of myeloid cells (east island, bottom) that peaks around k = 10. I am guessing these are the platelettes. Then we have another myeloid population just above it that peaks at 200. I am guessing these are the FCGR3A+ monocytes.

And now let’s do the same thing for the control.

# Initialize an empty list to store plots
plot_list <- list()

# This is what we change for the control. The rest of the code remains the same.
data <- bind_cols(dimr, nn_dat_permute)

# Loop through each column in the tibble to create a UMAP plot
for (col_name in colnames(nn_dat)) {
  p <- ggplot(data, aes(x = umap1, y = umap2, color = .data[[col_name]])) +
    geom_point(size = 2, alpha = 0.8) +
    scale_color_gradientn(colors = cytobank_colors(100)) +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 10),  # Smaller title font
      axis.title = element_text(size = 8),  # Smaller axis title font
      axis.text = element_text(size = 6),   # Smaller axis tick font
      legend.title = element_text(size = 8), # Smaller legend title font
      legend.text = element_text(size = 6)   # Smaller legend text font
    ) +
    guides(
      color = guide_colorbar(barwidth = 0.5, barheight = 4) # Adjust legend dimensions
    ) +
    labs(
      title = paste("UMAP Colored by", col_name),
      x = "UMAP1",
      y = "UMAP2",
      color = ""
    )
  plot_list[[col_name]] <- p
}

# Combine all plots using patchwork
final_plot <- wrap_plots(plot_list, ncol = 3)  # Adjust ncol as needed
print(final_plot)

We can see there is no population-level variation. It looks similar across. So this gives us a nice sanity check that again UCE’s manifold is giving us something.

Discussion

What it looks like to me is that UCE is preserving information at the cell subset level. In other words, the “peaks” you see in the UMAPs above (where they are colored bright orange or red) denote instances where the given cell’s KNN are essentially the size of the cell subset. In other words, if there is small subset of monocytes that are 200 cells in this dataset, then when k = 200, then every cell in this subset will have the same KNN between UCE and the original data. Thus a high KNN similarity.

This would suggest that UCE is good for getting at cell subset-level information, but not necessarily getting at information within subsets. Given that UCE has integration of multiple datasets in mind, from the way the paper reads, then that is fine. But the work in this markdown raises questions around how well UCE will preserve trajectory information.

There are of course plenty of other questions around how UCE preserves neighbors from the original dataset, as opposed to embeddings in other foundation models, like geneformer. Making these comparisons might provide some insights as to what information these models are attempting to preserve. Accordingly, this is a low hanging fruit in terms of future directions.

Of course, looking at K-nearest neighbors of cancatenated datasets (eg. PBMC 3k + other PBMC datasets) might show that the UCE embedding has a layout whereby the populations are already “integrated” and the KNN of a given T cell will have T cells from every dataset in it. Other work I have done (go here) seems to suggest this is the case. But followup work must be done before reaching any conclusions in this regard.

Overall, single-cell foundation models are increasing in popularity and are likely to only improve from here. Rigorous studies asking basic questions around things like nearest neighbor preservation across these embeddings are needed so we can maintain as good of an understanding as possible around what these models are actually outputting.

Nearest neighbor comparison between PBMC 3k dataset and its universal cell embeddings (UCE) foundation model output

2025-01-21

Data preparation

Nearest neighbor exploration, single k

Nearest neighbor exploration, multiple k

Discussion