The purpose of this markdown is to explore what mutual nearest neighbors are in the context of CyTOF data. I have a fair amount of experience dealing with CyTOF data in terms of k-nearest neighbors, so when I came across the concept of mutual nearest neighbors in scRNA seq analysis, I became curious. So let’s see what we see.
First, we read in and process CyTOF data. The data analyzed in this markdown is healthy PBMCs from a Stanford HIMC multicenter study, and can be found here.
I’m going to breeze through the data processing part, but if you want a more in-depth look at this, please go here.
library(flowCore)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.2 âś” readr 2.1.4
## âś” forcats 1.0.0 âś” stringr 1.5.0
## âś” ggplot2 3.4.2 âś” tibble 3.2.1
## âś” lubridate 1.9.2 âś” tidyr 1.3.0
## âś” purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks flowCore::filter(), stats::filter()
## âś– dplyr::lag() masks stats::lag()
## â„ą Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
set.seed(42)
setwd(here::here("data"))
ff <- flowCore::read.FCS('Center_1_A_1_MATLAB_LiveIntactSing_Linneg_CD45pos.fcs')
## Warning in readFCSdata(con, offsets, txt, transformation, which.lines, scale, : Some data values of 'Nd142Di' channel exceed its $PnR value 1509 and will be truncated!
## To avoid truncation, either fix $PnR before generating FCS or set 'truncate_max_range = FALSE'
## Warning in readFCSdata(con, offsets, txt, transformation, which.lines, scale, : Some data values of 'Yb176Di' channel exceed its $PnR value 1349 and will be truncated!
## To avoid truncation, either fix $PnR before generating FCS or set 'truncate_max_range = FALSE'
ff
## flowFrame object 'Center_1_A_1_MATLAB_LiveIntactSing_Linneg_CD45pos.fcs'
## with 93816 cells and 30 observables:
## name desc range minRange maxRange
## $P1 Ce140Di beads1 2911 0 7
## $P2 Ce142Di beads2 1509 0 1
## $P3 Er167Di CD27 781 0 17
## $P4 Er170Di CD3 1664 0 27
## $P5 Eu151Di CD123 2814 0 16
## ... ... ... ... ... ...
## $P26 Tm169Di CD45RA 12000 0 23
## $P27 Yb172Di CD38 5629 0 18
## $P28 Yb174Di HLADR 6990 0 11
## $P29 Yb176Di CD56 1349 0 12
## $P30 Time Time 0 0 1
## 261 keywords are stored in the 'description' slot
mass_tags <- ff@parameters@data$name
params <- ff@parameters@data$desc
# All NA values in marker names will revert to mass tags names
params <- ifelse(is.na(params), mass_tags, params)
colnames(ff) <- params
Here, I’m just going to bring the data into tibble format and do away with the flow frame entirely.
dat <- flowCore::exprs(ff) %>% as_tibble()
dat <- asinh(dat/5)
And we’re going to pull out the surface markers we care about.
params <- ff@parameters@data$desc
# Filter out markers we're not using
to_remove <- c("CD45", "CD66", "CD61", "CD235")
to_remove <- which(params %in% to_remove)
# Markers to use
cd <- grep("CD", params)
ig <- grep("Ig", params)
hla <- grep("HLA", params)
ccr <- c(grep("CCR", params), grep("CXC", params))
surface <- c(cd, ig, hla, ccr)
surface <- surface[!(surface %in% to_remove)]
dat <- dat[surface]
Now we’re going to subsample for the sake of speed of processing.
dat <- dat[sample(nrow(dat), 20000),]
dim(dat)
## [1] 20000 15
Ok, first we need to make a KNN graph of our data. This is typically done in the context of Phenograph clustering, under the hood. But we’re going to bring it to the forefront here.
Note that I have a BioConductor package that does this, but I want to refactor it with some of the new methods and modern contexts, so I’ll show you the guts of everything instead.
Whatever we set our K to, note that “what is the k” is a heated question for this kind of analysis. The answer: it depends.
library(RANN)
nn <- RANN::nn2(dat, k = 20)
We have:
nn$nn.idx[1:10, 1:10]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 7965 11014 12069 4562 10890 13380 14545 19939 1718
## [2,] 2 10871 2629 656 14074 16519 4064 6797 18292 5079
## [3,] 3 2750 16884 1967 6403 17746 296 13159 16755 3300
## [4,] 4 11963 19293 15578 18619 8802 18189 1528 858 19926
## [5,] 5 11879 9807 18604 16489 2683 13684 9263 19516 7512
## [6,] 6 9653 12931 1447 12568 5650 12217 10348 17936 17972
## [7,] 7 12515 18753 4958 9232 6360 4499 16439 3609 988
## [8,] 8 9390 13314 1923 3691 16345 7999 12887 9162 2805
## [9,] 9 5911 17445 7653 494 2175 14403 6025 12459 6924
## [10,] 10 8300 8488 10582 5099 17036 5854 1476 9162 19692
Now we make an adjacency matrix (1 if 2 cells are connected, 0 if they’re not).
n <- nrow(dat)
adj_matrix <- matrix(0, n, n)
for (i in 1:n) {
adj_matrix[i, nn$nn.idx[i, ]] <- 1
}
adj_matrix[1:10, 1:10]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 0 0 0 0 0 0 0 0 0
## [2,] 0 1 0 0 0 0 0 0 0 0
## [3,] 0 0 1 0 0 0 0 0 0 0
## [4,] 0 0 0 1 0 0 0 0 0 0
## [5,] 0 0 0 0 1 0 0 0 0 0
## [6,] 0 0 0 0 0 1 0 0 0 0
## [7,] 0 0 0 0 0 0 1 0 0 0
## [8,] 0 0 0 0 0 0 0 1 0 0
## [9,] 0 0 0 0 0 0 0 0 1 0
## [10,] 0 0 0 0 0 0 0 0 0 1
Now let’s make the graph. It’s a directed graph because I want to know whether 1 -> 2 AND 2 -> 1.
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:lubridate':
##
## %--%, union
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following object is masked from 'package:flowCore':
##
## normalize
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
g <- graph.adjacency(adj_matrix, mode = "directed")
g
## IGRAPH 682e590 D--- 20000 4e+05 --
## + edges from 682e590:
## [1] 1-> 1 1-> 1718 1-> 3087 1-> 4562 1-> 6764 1-> 6860 1-> 7126 1-> 7346
## [9] 1-> 7795 1-> 7800 1-> 7965 1->10666 1->10890 1->11014 1->11282 1->12069
## [17] 1->13380 1->14545 1->14593 1->19939 2-> 2 2-> 656 2-> 1020 2-> 2629
## [25] 2-> 4064 2-> 4669 2-> 5079 2-> 5492 2-> 5735 2-> 6797 2->10264 2->10871
## [33] 2->12670 2->12936 2->14074 2->16519 2->16635 2->18292 2->18690 2->19642
## [41] 3-> 3 3-> 170 3-> 296 3-> 727 3-> 1967 3-> 2750 3-> 3300 3-> 4412
## [49] 3-> 5825 3-> 6403 3-> 6832 3-> 9191 3->11799 3->12824 3->12847 3->13159
## [57] 3->16755 3->16884 3->17231 3->17746 4-> 4 4-> 858 4-> 1528 4-> 2257
## [65] 4-> 2513 4-> 8802 4->11251 4->11963 4->12070 4->12913 4->15129 4->15159
## + ... omitted several edges
Let’s have a look at this. The graph embedding should look something like a t-SNE or a UMAP.
plot(g)
It does!
Leveraging igraph, now we can find out who is mutual. Specifically are going to look at what relationships are mutual. Cell 1 and Cell 2 are mutual neighbors if Cell 1 -> Cell 2 and Cell 2 -> Cell 1. In other words, Cell 2 is a KNN of Cell 1, and Cell 1 is a KNN of Cell 2.
mnn_edges <- igraph::which_mutual(g)
# Is it mutual?
mnn_edges[1:5]
## [1] TRUE FALSE FALSE FALSE FALSE
# What's the edge
E(g)[1:5]
## + 5/4e+05 edges from 682e590:
## [1] 1-> 1 1->1718 1->3087 1->4562 1->6764
Ok, now we have to somehow turn this into node information that we can color. How are we going to do this?
Well, we have our K-neareest neighbors. Some of them are mutual. Some of them are not. So why don’t we go with what fraction of KNN are mutual. We can color a UMAP by that.
So let’s first make a subgraph of only mutual edges.
g_mutual <- igraph::subgraph.edges(g, E(g)[mnn_edges])
g_mutual
## IGRAPH 63bb269 D--- 20000 218508 --
## + edges from 63bb269:
## [1] 1-> 1 1-> 7965 1->10890 1->11014 1->12069 2-> 2 2-> 656 2-> 5079
## [9] 2->10871 3-> 3 3-> 170 3-> 296 3-> 1967 3-> 2750 3-> 3300 3-> 5825
## [17] 3-> 6403 3->11799 3->13159 3->16755 3->16884 3->17746 4-> 4 4-> 858
## [25] 4-> 1528 4->11963 4->15129 4->15578 4->16004 4->18086 4->18189 4->18619
## [33] 4->19293 5-> 5 5-> 5090 5-> 7512 5-> 9263 5->11879 5->13684 5->14507
## [41] 5->18604 6-> 6 6-> 5650 6-> 9653 6->12217 6->12568 6->12931 6->17936
## [49] 6->19525 6->19697 7-> 7 7-> 3609 7-> 4499 7-> 4958 7-> 7585 7-> 8826
## [57] 7-> 9232 7->11362 7->12515 7->12844 7->15285 7->16439 7->18753 8-> 8
## [65] 8-> 905 8-> 1476 8-> 1923 8-> 2805 8-> 3691 8-> 7999 8-> 8272 8-> 8576
## + ... omitted several edges
Now let’s plot the graph again and see if it looks any different.
plot(g_mutual)
It looks quite different!
Looks like there are a handful of “hanging” cells that have very few if any mutual neighbors. Let’s remove them and see if that brings the structure back.
g_mutual2 <- igraph::subgraph(graph = g_mutual, vids = igraph::degree(g_mutual) > 20)
plot(g_mutual2)
Ok, yes we do. A bit more nuance in this one. Note that we got rid of a huge swath of cells to visualize this, suggesting that while MNN could reduce noise and outliers, one might lose a lot of signal as well. Again, remember the original question: what is the k?
To get some further intuition as to what these mutual nearest neighbors are, let’s now do something we’re familiar with. We’re going to make a UMAP, and we’re going to color the UMAP by the number of mutual nearest neighbors per cell. Remember, the highest value is the k we originally set.
First, let’s pull out the number of mutual nearest neighbors per cell as a single vector. We can do this by calculating the degree of each node in the subgraph. This works because this graph has retained all cells, so it’s going to be in the proper order.
igraph::degree(g_mutual)[1:5]
## [1] 10 8 26 22 16
Let’s compare that for a minute to the degree of the original graph.
igraph::degree(g)[1:5]
## [1] 25 25 41 40 33
Note that the degree values are much higher than k. This is because the degree of the original graph is going to depend not only on its KNN, but which other cells consider that cell to be its KNN. The upper bound of that is the total number of cells in the dataset minus one (that cell in question).
Confusing? That’s ok. I have a PhD thesis on KNN, and my intuition still gets jolted from time to time.
Now for that “fraction mutual” number. We are going to divide the degree of the original graph by the degree of the mutual graph, and multiply by 100 to express it as a percentage. That’s what we’re going to color UMAP by.
pct_mutual <- (igraph::degree(g_mutual)/igraph::degree(g))*100
pct_mutual[1:5]
## [1] 40.00000 32.00000 63.41463 55.00000 48.48485
Let’s make a UMAP.
library(umap)
dat_umap <- umap::umap(dat)$layout %>% as_tibble()
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## â„ą Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
names(dat_umap) <- c("umap1", "umap2")
And now to plot. We’ll go through the degree of the regular graph, then the mutual neighbor graph, and then the percent mutual.
library(ggplot2)
ggplot(dat_umap, aes(x = umap1, y = umap2, color = igraph::degree(g))) +
geom_point() +
scale_color_viridis_c() +
theme_minimal()
library(ggplot2)
ggplot(dat_umap, aes(x = umap1, y = umap2, color = igraph::degree(g_mutual))) +
geom_point() +
scale_color_viridis_c() +
theme_minimal()
library(ggplot2)
ggplot(dat_umap, aes(x = umap1, y = umap2, color = pct_mutual)) +
geom_point() +
scale_color_viridis_c() +
theme_minimal()
It’s messy! I don’t see any signal here, and I’m not going to stare at it long enough to start hallucinating signal. Why is it so messy? One possibility is that we’re running into the limits of what UMAP can show. We have seen this before in terms of KNN, by visualizing each cell’s KNN calculated in the high-dimensional space, in my KNN Sleepwalk project.
If there’s anything here that we can actually see, I think it’s going to be in the structure of the mutual neighbor graph and how we can utilize that.
Let’s try something different. We’re going to do phenograph clustering on both the regular graph and the mutual neraest neighbor graph and see if there are any differences.
clust <- igraph::cluster_louvain(graph = igraph::as.undirected(g), resolution = 0.5)$membership
clust_mut <- igraph::cluster_louvain(graph = igraph::as.undirected(g_mutual), resolution = 0.5)$membership
We check:
table(clust) %>% sort(decreasing = TRUE)
## clust
## 5 2 3 1 7 9 6 8 11 10 4
## 3554 3498 3212 2840 1645 1496 1494 1210 694 301 56
table(clust_mut) %>% sort(decreasing = TRUE)
## clust_mut
## 3 2 5 1 14 11 10 7 8 15 12 16 13 9 17 18
## 2825 2649 2562 2542 1632 1407 1237 1165 938 712 670 539 220 200 199 120
## 6 4 80 35 132 20 21 31 36 37 49 52 58 89 119 144
## 84 51 4 3 3 2 2 2 2 2 2 2 2 2 2 2
## 19 22 23 24 25 26 27 28 29 30 32 33 34 38 39 40
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 41 42 43 44 45 46 47 48 50 51 53 54 55 56 57 59
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 76 77 78 79 81 82 83 84 85 86 87 88 90 91 92 93
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 110 111 112 113 114 115 116 117 118 120 121 122 123 124 125 126
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 127 128 129 130 131 133 134 135 136 137 138 139 140 141 142 143
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 241 242 243 244 245 246 247 248
## 1 1 1 1 1 1 1 1
Ok, tons of clusters in the mutual graph that have one cell within them. Looks like there’s some optimization to do. But let’s remove them for now just to see what we see.
counts <- table(clust_mut)
# select the values that occur less than 5 times
values_to_na <- as.numeric(names(counts[counts < 20]))
# replace these values with NA in the original vector
clust_mut <- ifelse(clust_mut %in% values_to_na, NA, clust_mut)
Ok, let’s color our maps accordingly.
library(ggplot2)
ggplot(dat_umap, aes(x = umap1, y = umap2, color = as.factor(clust))) +
geom_point() +
theme_minimal()
library(ggplot2)
ggplot(dat_umap, aes(x = umap1, y = umap2, color = as.factor(clust_mut))) +
geom_point() +
theme_minimal()
It looks like at default, the mutual nearest neighbors provide a higher resolution. Whether these clusters are objectively better than the previous ones is a question that takes quite a lot of work.
Why don’t we re-cluster the original graph with higher resolution so we can do a comparison.
clust2 <- igraph::cluster_louvain(graph = igraph::as.undirected(g), resolution = 1)$membership
library(ggplot2)
ggplot(dat_umap, aes(x = umap1, y = umap2, color = as.factor(clust2))) +
geom_point() +
theme_minimal()
At first glance, it looks like the mutual neighbor graph is hashing out the smaller populations better. But a lot of work still has yet to be done here.
This said, I remain skeptical because the mutual neighbor graph, when visualized, did not look great.
In terms of KNN graphs, I encourage you to get your CyTOF data into an igraph KNN graph using the code I provide here. From there, you have access to all kinds of clustering tools that go beyond the standard Lovuain clustering that we see in Phenograph. Outside of clustering, through igraph you have a large swath of graph theory at your disposal. What is an example of this? Random walks on a KNN graph is what gives you trajectory detection tools like Wanderlust.
Remember that mutual nearest neighbors at default has its use in batch effect removal, so that might be one avenue for CyTOF datasets (though it didn’t work in my hands). But now hopefully you have a better understanding of what mutual nearest neighbors are, and also how to get your CyTOF data into an igraph format.