CyTOF mutual nearest neighbors experiment

Introduction

The purpose of this markdown is to explore what mutual nearest neighbors are in the context of CyTOF data. I have a fair amount of experience dealing with CyTOF data in terms of k-nearest neighbors, so when I came across the concept of mutual nearest neighbors in scRNA seq analysis, I became curious. So let’s see what we see.

Data preprocessing

First, we read in and process CyTOF data. The data analyzed in this markdown is healthy PBMCs from a Stanford HIMC multicenter study, and can be found here.

I’m going to breeze through the data processing part, but if you want a more in-depth look at this, please go here.

library(flowCore)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks flowCore::filter(), stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

set.seed(42)
setwd(here::here("data"))

ff <- flowCore::read.FCS('Center_1_A_1_MATLAB_LiveIntactSing_Linneg_CD45pos.fcs')

## Warning in readFCSdata(con, offsets, txt, transformation, which.lines, scale, : Some data values of 'Nd142Di' channel exceed its $PnR value 1509 and will be truncated!
##  To avoid truncation, either fix $PnR before generating FCS or set 'truncate_max_range = FALSE'

## Warning in readFCSdata(con, offsets, txt, transformation, which.lines, scale, : Some data values of 'Yb176Di' channel exceed its $PnR value 1349 and will be truncated!
##  To avoid truncation, either fix $PnR before generating FCS or set 'truncate_max_range = FALSE'

ff

## flowFrame object 'Center_1_A_1_MATLAB_LiveIntactSing_Linneg_CD45pos.fcs'
## with 93816 cells and 30 observables:
##         name   desc     range  minRange  maxRange
## $P1  Ce140Di beads1      2911         0         7
## $P2  Ce142Di beads2      1509         0         1
## $P3  Er167Di   CD27       781         0        17
## $P4  Er170Di    CD3      1664         0        27
## $P5  Eu151Di  CD123      2814         0        16
## ...      ...    ...       ...       ...       ...
## $P26 Tm169Di CD45RA     12000         0        23
## $P27 Yb172Di   CD38      5629         0        18
## $P28 Yb174Di  HLADR      6990         0        11
## $P29 Yb176Di   CD56      1349         0        12
## $P30    Time   Time         0         0         1
## 261 keywords are stored in the 'description' slot

mass_tags <- ff@parameters@data$name
params <- ff@parameters@data$desc

# All NA values in marker names will revert to mass tags names
params <- ifelse(is.na(params), mass_tags, params)
colnames(ff) <- params

Here, I’m just going to bring the data into tibble format and do away with the flow frame entirely.

dat <- flowCore::exprs(ff) %>% as_tibble()

dat <- asinh(dat/5)

And we’re going to pull out the surface markers we care about.

params <- ff@parameters@data$desc

# Filter out markers we're not using
to_remove <- c("CD45", "CD66", "CD61", "CD235")
to_remove <- which(params %in% to_remove)

# Markers to use
cd <- grep("CD", params)
ig <- grep("Ig", params)
hla <- grep("HLA", params)
ccr <- c(grep("CCR", params), grep("CXC", params))

surface <- c(cd, ig, hla, ccr)
surface <- surface[!(surface %in% to_remove)]

dat <- dat[surface]

Now we’re going to subsample for the sake of speed of processing.

dat <- dat[sample(nrow(dat), 20000),]
dim(dat)

## [1] 20000    15

Make the KNN graph

Ok, first we need to make a KNN graph of our data. This is typically done in the context of Phenograph clustering, under the hood. But we’re going to bring it to the forefront here.

Note that I have a BioConductor package that does this, but I want to refactor it with some of the new methods and modern contexts, so I’ll show you the guts of everything instead.

Whatever we set our K to, note that “what is the k” is a heated question for this kind of analysis. The answer: it depends.

library(RANN)
nn <- RANN::nn2(dat, k = 20)

We have:

nn$nn.idx[1:10, 1:10]

##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
##  [1,]    1  7965 11014 12069  4562 10890 13380 14545 19939  1718
##  [2,]    2 10871  2629   656 14074 16519  4064  6797 18292  5079
##  [3,]    3  2750 16884  1967  6403 17746   296 13159 16755  3300
##  [4,]    4 11963 19293 15578 18619  8802 18189  1528   858 19926
##  [5,]    5 11879  9807 18604 16489  2683 13684  9263 19516  7512
##  [6,]    6  9653 12931  1447 12568  5650 12217 10348 17936 17972
##  [7,]    7 12515 18753  4958  9232  6360  4499 16439  3609   988
##  [8,]    8  9390 13314  1923  3691 16345  7999 12887  9162  2805
##  [9,]    9  5911 17445  7653   494  2175 14403  6025 12459  6924
## [10,]   10  8300  8488 10582  5099 17036  5854  1476  9162 19692

Now we make an adjacency matrix (1 if 2 cells are connected, 0 if they’re not).

n <- nrow(dat)
adj_matrix <- matrix(0, n, n)
for (i in 1:n) {
  adj_matrix[i, nn$nn.idx[i, ]] <- 1
}

adj_matrix[1:10, 1:10]

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1    0    0    0    0    0    0    0    0     0
##  [2,]    0    1    0    0    0    0    0    0    0     0
##  [3,]    0    0    1    0    0    0    0    0    0     0
##  [4,]    0    0    0    1    0    0    0    0    0     0
##  [5,]    0    0    0    0    1    0    0    0    0     0
##  [6,]    0    0    0    0    0    1    0    0    0     0
##  [7,]    0    0    0    0    0    0    1    0    0     0
##  [8,]    0    0    0    0    0    0    0    1    0     0
##  [9,]    0    0    0    0    0    0    0    0    1     0
## [10,]    0    0    0    0    0    0    0    0    0     1

Now let’s make the graph. It’s a directed graph because I want to know whether 1 -> 2 AND 2 -> 1.

library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:lubridate':
## 
##     %--%, union

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:purrr':
## 
##     compose, simplify

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following object is masked from 'package:tibble':
## 
##     as_data_frame

## The following object is masked from 'package:flowCore':
## 
##     normalize

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

g <- graph.adjacency(adj_matrix, mode = "directed")
g

## IGRAPH 682e590 D--- 20000 4e+05 -- 
## + edges from 682e590:
##  [1] 1->    1 1-> 1718 1-> 3087 1-> 4562 1-> 6764 1-> 6860 1-> 7126 1-> 7346
##  [9] 1-> 7795 1-> 7800 1-> 7965 1->10666 1->10890 1->11014 1->11282 1->12069
## [17] 1->13380 1->14545 1->14593 1->19939 2->    2 2->  656 2-> 1020 2-> 2629
## [25] 2-> 4064 2-> 4669 2-> 5079 2-> 5492 2-> 5735 2-> 6797 2->10264 2->10871
## [33] 2->12670 2->12936 2->14074 2->16519 2->16635 2->18292 2->18690 2->19642
## [41] 3->    3 3->  170 3->  296 3->  727 3-> 1967 3-> 2750 3-> 3300 3-> 4412
## [49] 3-> 5825 3-> 6403 3-> 6832 3-> 9191 3->11799 3->12824 3->12847 3->13159
## [57] 3->16755 3->16884 3->17231 3->17746 4->    4 4->  858 4-> 1528 4-> 2257
## [65] 4-> 2513 4-> 8802 4->11251 4->11963 4->12070 4->12913 4->15129 4->15159
## + ... omitted several edges

Let’s have a look at this. The graph embedding should look something like a t-SNE or a UMAP.

plot(g)

It does!

Making the mutual nearest neighbor graph

Leveraging igraph, now we can find out who is mutual. Specifically are going to look at what relationships are mutual. Cell 1 and Cell 2 are mutual neighbors if Cell 1 -> Cell 2 and Cell 2 -> Cell 1. In other words, Cell 2 is a KNN of Cell 1, and Cell 1 is a KNN of Cell 2.

mnn_edges <- igraph::which_mutual(g)

# Is it mutual?
mnn_edges[1:5]

## [1]  TRUE FALSE FALSE FALSE FALSE

# What's the edge
E(g)[1:5]

## + 5/4e+05 edges from 682e590:
## [1] 1->   1 1->1718 1->3087 1->4562 1->6764

Ok, now we have to somehow turn this into node information that we can color. How are we going to do this?

Well, we have our K-neareest neighbors. Some of them are mutual. Some of them are not. So why don’t we go with what fraction of KNN are mutual. We can color a UMAP by that.

So let’s first make a subgraph of only mutual edges.

g_mutual <- igraph::subgraph.edges(g, E(g)[mnn_edges])
g_mutual

## IGRAPH 63bb269 D--- 20000 218508 -- 
## + edges from 63bb269:
##  [1] 1->    1 1-> 7965 1->10890 1->11014 1->12069 2->    2 2->  656 2-> 5079
##  [9] 2->10871 3->    3 3->  170 3->  296 3-> 1967 3-> 2750 3-> 3300 3-> 5825
## [17] 3-> 6403 3->11799 3->13159 3->16755 3->16884 3->17746 4->    4 4->  858
## [25] 4-> 1528 4->11963 4->15129 4->15578 4->16004 4->18086 4->18189 4->18619
## [33] 4->19293 5->    5 5-> 5090 5-> 7512 5-> 9263 5->11879 5->13684 5->14507
## [41] 5->18604 6->    6 6-> 5650 6-> 9653 6->12217 6->12568 6->12931 6->17936
## [49] 6->19525 6->19697 7->    7 7-> 3609 7-> 4499 7-> 4958 7-> 7585 7-> 8826
## [57] 7-> 9232 7->11362 7->12515 7->12844 7->15285 7->16439 7->18753 8->    8
## [65] 8->  905 8-> 1476 8-> 1923 8-> 2805 8-> 3691 8-> 7999 8-> 8272 8-> 8576
## + ... omitted several edges

Now let’s plot the graph again and see if it looks any different.

plot(g_mutual)

It looks quite different!

Looks like there are a handful of “hanging” cells that have very few if any mutual neighbors. Let’s remove them and see if that brings the structure back.

g_mutual2 <- igraph::subgraph(graph = g_mutual, vids = igraph::degree(g_mutual) > 20)

plot(g_mutual2)

Ok, yes we do. A bit more nuance in this one. Note that we got rid of a huge swath of cells to visualize this, suggesting that while MNN could reduce noise and outliers, one might lose a lot of signal as well. Again, remember the original question: what is the k?

To get some further intuition as to what these mutual nearest neighbors are, let’s now do something we’re familiar with. We’re going to make a UMAP, and we’re going to color the UMAP by the number of mutual nearest neighbors per cell. Remember, the highest value is the k we originally set.

First, let’s pull out the number of mutual nearest neighbors per cell as a single vector. We can do this by calculating the degree of each node in the subgraph. This works because this graph has retained all cells, so it’s going to be in the proper order.

igraph::degree(g_mutual)[1:5]

## [1] 10  8 26 22 16

Let’s compare that for a minute to the degree of the original graph.

igraph::degree(g)[1:5]

## [1] 25 25 41 40 33

Note that the degree values are much higher than k. This is because the degree of the original graph is going to depend not only on its KNN, but which other cells consider that cell to be its KNN. The upper bound of that is the total number of cells in the dataset minus one (that cell in question).

Confusing? That’s ok. I have a PhD thesis on KNN, and my intuition still gets jolted from time to time.

Now for that “fraction mutual” number. We are going to divide the degree of the original graph by the degree of the mutual graph, and multiply by 100 to express it as a percentage. That’s what we’re going to color UMAP by.

pct_mutual <- (igraph::degree(g_mutual)/igraph::degree(g))*100
pct_mutual[1:5]

## [1] 40.00000 32.00000 63.41463 55.00000 48.48485

Visualization

Let’s make a UMAP.

library(umap)

dat_umap <- umap::umap(dat)$layout %>% as_tibble()

## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

names(dat_umap) <- c("umap1", "umap2")

And now to plot. We’ll go through the degree of the regular graph, then the mutual neighbor graph, and then the percent mutual.

library(ggplot2)
ggplot(dat_umap, aes(x = umap1, y = umap2, color = igraph::degree(g))) +
  geom_point() +
  scale_color_viridis_c() +
  theme_minimal()

library(ggplot2)
ggplot(dat_umap, aes(x = umap1, y = umap2, color = igraph::degree(g_mutual))) +
  geom_point() +
  scale_color_viridis_c() +
  theme_minimal()

library(ggplot2)
ggplot(dat_umap, aes(x = umap1, y = umap2, color = pct_mutual)) +
  geom_point() +
  scale_color_viridis_c() +
  theme_minimal()

It’s messy! I don’t see any signal here, and I’m not going to stare at it long enough to start hallucinating signal. Why is it so messy? One possibility is that we’re running into the limits of what UMAP can show. We have seen this before in terms of KNN, by visualizing each cell’s KNN calculated in the high-dimensional space, in my KNN Sleepwalk project.

If there’s anything here that we can actually see, I think it’s going to be in the structure of the mutual neighbor graph and how we can utilize that.

Clustering

Let’s try something different. We’re going to do phenograph clustering on both the regular graph and the mutual neraest neighbor graph and see if there are any differences.

clust <- igraph::cluster_louvain(graph = igraph::as.undirected(g), resolution = 0.5)$membership
clust_mut <- igraph::cluster_louvain(graph = igraph::as.undirected(g_mutual), resolution = 0.5)$membership

We check:

table(clust) %>% sort(decreasing = TRUE)

## clust
##    5    2    3    1    7    9    6    8   11   10    4 
## 3554 3498 3212 2840 1645 1496 1494 1210  694  301   56

table(clust_mut) %>% sort(decreasing = TRUE)

## clust_mut
##    3    2    5    1   14   11   10    7    8   15   12   16   13    9   17   18 
## 2825 2649 2562 2542 1632 1407 1237 1165  938  712  670  539  220  200  199  120 
##    6    4   80   35  132   20   21   31   36   37   49   52   58   89  119  144 
##   84   51    4    3    3    2    2    2    2    2    2    2    2    2    2    2 
##   19   22   23   24   25   26   27   28   29   30   32   33   34   38   39   40 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##   41   42   43   44   45   46   47   48   50   51   53   54   55   56   57   59 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##   60   61   62   63   64   65   66   67   68   69   70   71   72   73   74   75 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##   76   77   78   79   81   82   83   84   85   86   87   88   90   91   92   93 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##   94   95   96   97   98   99  100  101  102  103  104  105  106  107  108  109 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  110  111  112  113  114  115  116  117  118  120  121  122  123  124  125  126 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  127  128  129  130  131  133  134  135  136  137  138  139  140  141  142  143 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  145  146  147  148  149  150  151  152  153  154  155  156  157  158  159  160 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  161  162  163  164  165  166  167  168  169  170  171  172  173  174  175  176 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  177  178  179  180  181  182  183  184  185  186  187  188  189  190  191  192 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  193  194  195  196  197  198  199  200  201  202  203  204  205  206  207  208 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  209  210  211  212  213  214  215  216  217  218  219  220  221  222  223  224 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  225  226  227  228  229  230  231  232  233  234  235  236  237  238  239  240 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  241  242  243  244  245  246  247  248 
##    1    1    1    1    1    1    1    1

Ok, tons of clusters in the mutual graph that have one cell within them. Looks like there’s some optimization to do. But let’s remove them for now just to see what we see.

counts <- table(clust_mut)
# select the values that occur less than 5 times
values_to_na <- as.numeric(names(counts[counts < 20]))
# replace these values with NA in the original vector
clust_mut <- ifelse(clust_mut %in% values_to_na, NA, clust_mut)

Ok, let’s color our maps accordingly.

library(ggplot2)
ggplot(dat_umap, aes(x = umap1, y = umap2, color = as.factor(clust))) +
  geom_point() +
  theme_minimal()

library(ggplot2)
ggplot(dat_umap, aes(x = umap1, y = umap2, color = as.factor(clust_mut))) +
  geom_point() +
  theme_minimal()

It looks like at default, the mutual nearest neighbors provide a higher resolution. Whether these clusters are objectively better than the previous ones is a question that takes quite a lot of work.

Why don’t we re-cluster the original graph with higher resolution so we can do a comparison.

clust2 <- igraph::cluster_louvain(graph = igraph::as.undirected(g), resolution = 1)$membership

library(ggplot2)
ggplot(dat_umap, aes(x = umap1, y = umap2, color = as.factor(clust2))) +
  geom_point() +
  theme_minimal()

At first glance, it looks like the mutual neighbor graph is hashing out the smaller populations better. But a lot of work still has yet to be done here.

This said, I remain skeptical because the mutual neighbor graph, when visualized, did not look great.

Concluding remarks

In terms of KNN graphs, I encourage you to get your CyTOF data into an igraph KNN graph using the code I provide here. From there, you have access to all kinds of clustering tools that go beyond the standard Lovuain clustering that we see in Phenograph. Outside of clustering, through igraph you have a large swath of graph theory at your disposal. What is an example of this? Random walks on a KNN graph is what gives you trajectory detection tools like Wanderlust.

Remember that mutual nearest neighbors at default has its use in batch effect removal, so that might be one avenue for CyTOF datasets (though it didn’t work in my hands). But now hopefully you have a better understanding of what mutual nearest neighbors are, and also how to get your CyTOF data into an igraph format.