Running UMAP on scRNA seq data without PCA reveals importance of cosine distnace in high dimensional data

Introduction

What would happen if we ran the flagship PBMC 3k dataset through Seurat, but we skipped the PCA step entirely, and rather just clusterd and did our UMAP on the top 2000 most variable genes? I started out as a wet-lab biologist before I got into bioinformatics, so I’m always happy to do a knockout experiment.

Let’s get started with our control: the PBMC 3k dataset, run normally.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(Seurat)

## Loading required package: SeuratObject
## Loading required package: sp
## 
## Attaching package: 'SeuratObject'
## 
## The following objects are masked from 'package:base':
## 
##     intersect, t

library(SeuratData)

## ── Installed datasets ──────────────────────────────── SeuratData v0.2.2.9001 ──
## ✔ cbmc     3.1.4                        ✔ pbmc3k   3.1.4
## ✔ ifnb     3.0.0                        ✔ stxBrain 0.1.1
## ✔ panc8    3.0.2                        
## 
## ────────────────────────────────────── Key ─────────────────────────────────────
## ✔ Dataset loaded successfully
## ❯ Dataset built with a newer version of Seurat than installed
## ❓ Unknown version of Seurat installed

set.seed(42)

In another markdown, we manually annotated the PBMC 3k dataset using Seurat’s guided clustering tutorial plus some additional code on my end to make the annotation process a bit easier. You can find that here. So we read ths in now.

ctl <- readr::read_rds("annotated_pbmc3k.rds")

We have already run UMAP on it, and we do our visualization as a sanity check below.

DimPlot(ctl)

From here, we are going to omit PCA from the workflow, and run UMAP again. In other words, UMAP will be run on the top 2000 most variable genes, rather than the first ten principal components (the default from Seurat, for the PBMC 3k dataset).

Omit PCA and run UMAP within Seurat

Ok, and now we omit PCA and we run again. You do that using the “features” parameter.

ko <- RunUMAP(ctl, features = VariableFeatures(ctl), dims = NULL, graph = NULL, reduction = NULL)

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 09:38:15 UMAP embedding parameters a = 0.9922 b = 1.112

## 09:38:15 Read 2700 rows and found 2000 numeric columns

## 09:38:15 Using Annoy for neighbor search, n_neighbors = 30

## 09:38:15 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 09:38:15 Writing NN index file to temp file /var/folders/6y/qgvrbn_j2v35vfy946hlhh9c0000gn/T//RtmpEQgSx7/file70665b996769
## 09:38:15 Searching Annoy index using 1 thread, search_k = 3000
## 09:38:24 Annoy recall = 100%
## 09:38:25 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 09:38:25 Initializing from normalized Laplacian + noise (using RSpectra)
## 09:38:25 Commencing optimization for 500 epochs, with 141974 positive edges
## 09:38:28 Optimization finished

DimPlot(ko)

It’s quite well resolved, given we are feeding it 2000 dimensions rather than 10. We note that the default metric is cosine distance, which of course is good for high dimensions. Let’s replicate this again with Euclidean distance.

ko <- RunUMAP(ctl, features = VariableFeatures(ctl), dims = NULL, graph = NULL, reduction = NULL, metric = "euclidean")

## 09:38:29 UMAP embedding parameters a = 0.9922 b = 1.112

## 09:38:29 Read 2700 rows and found 2000 numeric columns

## 09:38:29 Using FNN for neighbor search, n_neighbors = 30

## 09:38:40 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 09:38:40 Initializing from normalized Laplacian + noise (using RSpectra)
## 09:38:40 Commencing optimization for 500 epochs, with 146072 positive edges
## 09:38:44 Optimization finished

DimPlot(ko)

It’s different. Things are a bit farther away from each other in UMAP space. The populations in the T/NK compartment are a bit mixed, though still somewhat resolved. The myeloid island is well-resolved, and the B cells form their own separate island as before, when UMAP was run on the top 10 PCs.

Omit PCA and run UMAP using the umap package

We don’t know if Seurat is doing extra stuff under the hood. So let’s run it outside of Seurat and see what kinds of results we get accordingly.

Below, we run the UMAP on our own. We replicate the conditions from Seurat.

library(umap)

# Create a custom dimensionality reduction matrix using scaled data
# (This replaces PCA in the pipeline)
ko_scale <- as.matrix(ctl[["RNA"]]$scale.data[VariableFeatures(ctl), ]) %>% t() %>% as_data_frame()

## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` (with slightly different semantics) to convert to a
##   tibble, or `as.data.frame()` to convert to a data frame.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ko_scale

## # A tibble: 2,700 × 2,000
##      PPBP S100A9  IGLL5     LYZ   GNLY     FTL    PF4   FTH1 FCER1A  GNG11
##     <dbl>  <dbl>  <dbl>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 -0.143 -0.645 -0.184 -0.110  -0.407  0.0639 -0.110 -1.30  -0.136 -0.105
##  2 -0.143 -0.645 -0.184  0.0619 -0.407 -0.596  -0.110 -0.134 -0.136 -0.105
##  3 -0.143 -0.645 -0.184  0.0792  0.759 -0.434  -0.110 -0.127 -0.136 -0.105
##  4  2.91   1.47  -0.184  1.41   -0.407  1.42   -0.110  1.90  -0.136 -0.105
##  5 -0.143 -0.645 -0.184 -0.973   2.41   1.06   -0.110 -0.239 -0.136 -0.105
##  6 -0.143 -0.645 -0.184 -0.0624 -0.407  0.0826 -0.110 -0.892 -0.136 -0.105
##  7 -0.143 -0.645 -0.184 -0.0647  2.19  -0.508  -0.110 -0.896 -0.136 -0.105
##  8 -0.143 -0.645 -0.184 -0.0813 -0.407 -0.536  -0.110 -1.76  -0.136 -0.105
##  9 -0.143 -0.645 -0.184 -0.973   1.37  -2.94   -0.110  0.255 -0.136 -0.105
## 10 -0.143 -0.645 -0.184  0.787  -0.407  1.65   -0.110  2.09  -0.136 -0.105
## # ℹ 2,690 more rows
## # ℹ 1,990 more variables: S100A8 <dbl>, `HLA-DRA` <dbl>, CD74 <dbl>, CLU <dbl>,
## #   GZMB <dbl>, NKG7 <dbl>, C1QA <dbl>, CST3 <dbl>, CCL4 <dbl>,
## #   `HLA-DPB1` <dbl>, SDPR <dbl>, C1QB <dbl>, AL928768.3 <dbl>, TYMS <dbl>,
## #   TUBB1 <dbl>, RRM2 <dbl>, GUSB <dbl>, STMN1 <dbl>, MZB1 <dbl>,
## #   C10orf32 <dbl>, GP9 <dbl>, IGJ <dbl>, LYPD2 <dbl>, HAGH <dbl>, TK1 <dbl>,
## #   FUS <dbl>, DSCR3 <dbl>, LYAR <dbl>, SLA <dbl>, SMIM7 <dbl>, KIF5B <dbl>, …

# Run UMAP using the custom embedding
set.seed(123)
ko_umap <- umap::umap(
  ko_scale,
  n_neighbors = 30,
  metric = "cosine"
)$layout %>% 
  as.data.frame()

names(ko_umap) <- c("umap1", "umap2")

# Visualize UMAP results
toplot <- cbind(ko_umap, subset = ctl$seurat_annotations) %>% as_tibble()

toplot <- dplyr::filter(toplot, umap1 < 20) # get rid of weird outlier

ggplot(toplot, aes(x = umap1, y = umap2, color = subset)) +
  geom_point(alpha = 0.7) +  
  theme_minimal() +
  labs(title = "UMAP Visualization", x = "UMAP1", y = "UMAP2")

Not bad. Similar to the original issue within Seurat, the populations appear to be resolved, except for the CD4 and CD8 T cells.

As a final experiment, we run UMAP with Euclidean distance.

# Run UMAP using the custom embedding
set.seed(123)
ko_umap <- umap::umap(
  ko_scale,
  n_neighbors = 30,
  metric = "euclidean"
)$layout %>% 
  as.data.frame()

names(ko_umap) <- c("umap1", "umap2")

# Visualize UMAP results
toplot <- cbind(ko_umap, subset = ctl$seurat_annotations) %>% as_tibble()

toplot <- dplyr::filter(toplot, umap1 < 20) # get rid of weird outlier

ggplot(toplot, aes(x = umap1, y = umap2, color = subset)) +
  geom_point(alpha = 0.7) +  
  theme_minimal() +
  labs(title = "UMAP Visualization", x = "UMAP1", y = "UMAP2")

A bit of a surprise in comparison to what we saw with Seurat’s built in UMAP.

Here we note that the B cells have merged back in. We have basically a myeloid and lymphoid island. The monocytes are relatively resolved but the lymphocytes not. It is not clear as to why this looks so bad compared to the one with Seurat. But the results suggest that despite me setting so much stuff to NULL (to try to get rid of any extra bells and whistles Seurat might be doing), Seurat is nonetheless doing additional stuff under the hood.

Omit PCA and run UMAP using the uwot package (similar to Seurat)

Let’s try using the uwot package and see if this resolves the issues, given that this is what Seurat is using under the hood. In other words, will uwot’s results look more like Seurat’s results or that of the umap package?

library(uwot)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## 
## Attaching package: 'uwot'

## The following object is masked from 'package:umap':
## 
##     umap

# 2. For reproducibility, set a seed:
set.seed(123)

# 3. Run UMAP via uwot with parameters that approximate Seurat's defaults
# (values can vary slightly by Seurat version, but these are typical):
ko_umap_uwot <- uwot::umap(
  X = ko_scale,
  n_neighbors     = 30,
  metric          = "cosine",
  n_components    = 2,      # 2D UMAP
  fast_sgd        = TRUE,   # usually set by Seurat
  n_threads       = 1       # set to number of cores you want or 1 for reproducibility
)

# 4. Convert the result into a data frame for plotting
colnames(ko_umap_uwot) <- c("umap1", "umap2")
toplot <- data.frame(
  umap1  = ko_umap_uwot[, 1],
  umap2  = ko_umap_uwot[, 2],
  subset  = ctl$seurat_annotations
)

# (Optional) Remove outlier if desired
#df_umap <- df_umap %>% filter(UMAP_2 < 30)

# 5. Plot the UMAP
ggplot(toplot, aes(x = umap1, y = umap2, color = subset)) +
  geom_point(alpha = 0.7) +
  theme_minimal() +
  labs(title = "UMAP via uwot (Seurat-like defaults), cosine distance",
       x = "UMAP 1", y = "UMAP 2")

We see that the islands are relatively well resolved, with the issue being primarily the CD4 T cells again.

Now we do Euclidean:

library(uwot)


# 2. For reproducibility, set a seed:
set.seed(123)

# 3. Run UMAP via uwot with parameters that approximate Seurat's defaults
# (values can vary slightly by Seurat version, but these are typical):
ko_umap_uwot <- uwot::umap(
  X = ko_scale,
  n_neighbors     = 30,
  metric          = "Euclidean",
  n_components    = 2,      # 2D UMAP
  fast_sgd        = TRUE,   # usually set by Seurat
  n_threads       = 1       # set to number of cores you want or 1 for reproducibility
)

# 4. Convert the result into a data frame for plotting
colnames(ko_umap_uwot) <- c("umap1", "umap2")
toplot <- data.frame(
  umap1  = ko_umap_uwot[, 1],
  umap2  = ko_umap_uwot[, 2],
  subset  = ctl$seurat_annotations
)

# (Optional) Remove outlier if desired
#df_umap <- df_umap %>% filter(UMAP_2 < 30)

# 5. Plot the UMAP
ggplot(toplot, aes(x = umap1, y = umap2, color = subset)) +
  geom_point(alpha = 0.7) +
  theme_minimal() +
  labs(title = "UMAP via uwot (Seurat-like defaults), euclidean distance",
       x = "UMAP 1", y = "UMAP 2")

We are similar to that of the umap package, and not the Seurat package. This is good to know. So we have agreement between umap and uwot packages, suggesting further that Seurat has something going on under the hood.

But either way, we get a sense of the curse of dimensionality. That UMAP does a better job resolving principal components than 2000 dimensions, but importantly cosine distance does a better job than Euclidean distance when you have these 2000 dimensions as input.

Let’s see if that holds for PCA space again. In other words, when we’re looking at ten dimensions again, will Euclidean distance lead to dramatically poorer UMAP performance in comparison to cosine distance?

Bring PCA back in and run UMAP using the umap package

Ok, and the last thing we will do is a control, whereby we toggle the Euclidean and Cosine distances within PCA. We will use the default UMAP package again, now that we’ve got agreement between the umap and uwot packages.

# 1. Choose the number of PCs to use (e.g., the top 20)
num_pc <- 10

# 2. Extract the PCA embeddings from your Seurat object
#    (Make sure you actually have run PCA on 'ctl' already)
pca_embed <- Embeddings(ctl, "pca")[, 1:num_pc]
as_tibble(pca_embed)

## # A tibble: 2,700 × 10
##       PC_1    PC_2    PC_3   PC_4   PC_5   PC_6   PC_7    PC_8    PC_9  PC_10
##      <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl>
##  1   4.61  -0.604  -0.605  -1.72  -0.744  1.29   1.05  -0.952   0.969   0.878
##  2   0.167  4.54    6.45    6.86  -0.801  0.503 -2.56  -1.68   -0.101   1.71 
##  3   2.65  -4.01   -0.372  -0.996 -4.98  -0.251 -1.52   0.0246 -3.77    1.77 
##  4 -11.9    0.0634  0.623  -0.243  0.292 -1.51   3.39   0.145   0.540  -0.469
##  5   3.05  -6.00    0.823   2.05   8.25   1.15   0.749  3.06    0.229   1.68 
##  6   2.68   1.37   -0.587  -2.21  -2.53   0.260 -1.46   0.299  -0.0938 -0.657
##  7   5.09  -2.41   -0.171  -1.87   1.19   0.471 -0.833 -3.16    0.244  -0.835
##  8   4.41  -3.34   -0.289  -0.659 -0.828 -0.782  0.574 -3.59   -1.09    0.276
##  9   5.46   1.06   -0.0264 -0.271  4.14   0.597  3.07  -3.76    0.952  -1.08 
## 10  -6.57   1.25   -1.53   -2.25   1.42  -5.18   4.74   1.11    1.69    2.46 
## # ℹ 2,690 more rows

# 3. Run UMAP using the top N PCs as input
#    (Here we specify Euclidean distance; adjust as needed)
set.seed(123)  # for reproducible layout
ko_umap_pcs <- umap::umap(
  pca_embed,
  n_neighbors = 30,
  metric = "cosine"
)$layout %>%
  as.data.frame()

# 4. Rename the UMAP columns for clarity
colnames(ko_umap_pcs) <- c("umap1", "umap2")

# 5. Create a plotting data frame
#    (Add your cell annotations from the Seurat metadata)
toplot <- cbind(
  ko_umap_pcs,
  subset = ctl$seurat_annotations
) %>%
  as_tibble()

# 6. (Optional) Remove any outliers if needed
toplot <- dplyr::filter(toplot, umap1 < 20)

# 7. Plot the UMAP
ggplot(toplot, aes(x = umap1, y = umap2, color = subset)) +
  geom_point(alpha = 0.7) +
  theme_minimal() +
  labs(
    title = paste0("UMAP of Top ", num_pc, " PCs"),
    x = "UMAP1",
    y = "UMAP2"
  )

Note again that when PC space is used as input, the CD4 T cells are much better resolved, along with the remaining populations.

And again with Euclidean distance.

# 3. Run UMAP using the top N PCs as input
#    (Here we specify Euclidean distance; adjust as needed)
set.seed(123)  # for reproducible layout
ko_umap_pcs <- umap::umap(
  pca_embed,
  n_neighbors = 30,
  metric = "euclidean"
)$layout %>%
  as.data.frame()

# 4. Rename the UMAP columns for clarity
colnames(ko_umap_pcs) <- c("umap1", "umap2")

# 5. Create a plotting data frame
#    (Add your cell annotations from the Seurat metadata)
toplot <- cbind(
  ko_umap_pcs,
  subset = ctl$seurat_annotations
) %>%
  as_tibble()

# 6. (Optional) Remove any outliers if needed
toplot <- dplyr::filter(toplot, umap1 < 20)

# 7. Plot the UMAP
ggplot(toplot, aes(x = umap1, y = umap2, color = subset)) +
  geom_point(alpha = 0.7) +
  theme_minimal() +
  labs(
    title = paste0("UMAP of Top ", num_pc, " PCs"),
    x = "UMAP1",
    y = "UMAP2"
  )

Where we can see that there is no dramatic difference between cosine and Euclidean distance this time around. This includes the CD4 T cells. This is direct evidence that the use of cosine distance shows its benefits in higher dimensions (thousands) in compared to lower dimensions (tens).

Discussion

In this markdown, we initially showed that if you omit PCA from the standard single-cell sequencing workflow, that UMAP cannot resolve cell subsets as well. This is seen especially in the T cells.

We later found that there is a dramatic difference between running UMAP using Euclidean distance vs cosine distance, that only really shows itself when you are dealing with the 2000 dimensional space that we get when we omit PCA. It is uncertain at what point this difference between cosine and euclidean distance starts to show itself, and further experimentation is needed. It may also depend on the type of data.

But this suggests that cosine as a default is a good idea. In other words, when you run UMAP, check to see what the default distance metric is and change it to cosine if you can.

We found that on the technical side, there was inconsistency between Seurat’s UMAP and the UMAP run in both the umap and uwot package. Specifically, we did not see the dramatic difference between Euclidean and cosine distance when we ran UMAP within the Seurat package. Seurat’s code base is rather complicated, so we decided, after confirming consistency between the umap and uwot packages, to go with the umap package for the final experiment, where we were running cosine vs Euclidean distance based umap on PCA space.

Overall, the primary action that comes out of this markdown is to default to cosine distance if you can, but note that if you’re dealing with lower dimensional data (in the tens) then Euclidean distance is more likely to be ok: for a lot of the CyTOF work I did in the 2010s, Euclidean distance was the default (with the exception of SPADE, which was Manhattan distance, and X-Shift, which was cosine distance). CyTOF data ranges between 20 to 50 dimensions, to give the reader some context.

This said, even if you are dealing with lower dimensional data and you want to use Euclidean distance for whatever reason, I would run your UMAP with Euclidean distance and cosine distance side by side just to be sure they don’t give you differing performance. You never know.