“You never climb the same mountain twice, not even in memory. Memory rebuilds the mountain, changes the weather, retells the jokes, remakes all the moves.”
The purpose of this document is to get a massively simplified and hyper-tailored version of X-shift working in R, mainly to show how it works, and how might one go about writing something similar in R. In my opinion, the algorithm is underexplored, especially with respect to newer high dimensional data types that are emerging since the publication of this paper.
X-shift is a clustering algorithm developed by Nikolay Samusik when he was doing his postdoc in the Nolan Lab. It is based off of mean-shift, which you can think of as assigning clusters by determining which density “peak” a given cell belongs to, if you “climb the hill” from where the cell is.
Why does X-shift matter? Because in a landmark benchmarking study by Weber and Robinson in 2016, X-shift was found to be the best clustering algorithm for identifying rare cell subsets. Thus, if that is one of your goals at any point, then you should absolutely know how to use X-shift.
For various reasons, X-shift was written in Java. Some of this was because the algorithm is computationally expensive, and Dr. Samusik needed full control over the computer (that he also built by hand) to, among other things, make sure each core was doing the right thing at the right time.
This markdown will show you how X-shift works, through the expressive but less verbose language of R, with my writing filling in the gaps as needed.
We’re going to do the standard data import and processing. If you want more details on this step, please go to my markdown here.
library(flowCore)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks flowCore::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
set.seed(42)
setwd(here::here("data"))
ff <- flowCore::read.FCS('Center_1_A_1_MATLAB_LiveIntactSing_Linneg_CD45pos.fcs')
## Warning in readFCSdata(con, offsets, txt, transformation, which.lines, scale, : Some data values of 'Nd142Di' channel exceed its $PnR value 1509 and will be truncated!
## To avoid truncation, either fix $PnR before generating FCS or set 'truncate_max_range = FALSE'
## Warning in readFCSdata(con, offsets, txt, transformation, which.lines, scale, : Some data values of 'Yb176Di' channel exceed its $PnR value 1349 and will be truncated!
## To avoid truncation, either fix $PnR before generating FCS or set 'truncate_max_range = FALSE'
ff
## flowFrame object 'Center_1_A_1_MATLAB_LiveIntactSing_Linneg_CD45pos.fcs'
## with 93816 cells and 30 observables:
## name desc range minRange maxRange
## $P1 Ce140Di beads1 2911 0 7
## $P2 Ce142Di beads2 1509 0 1
## $P3 Er167Di CD27 781 0 17
## $P4 Er170Di CD3 1664 0 27
## $P5 Eu151Di CD123 2814 0 16
## ... ... ... ... ... ...
## $P26 Tm169Di CD45RA 12000 0 23
## $P27 Yb172Di CD38 5629 0 18
## $P28 Yb174Di HLADR 6990 0 11
## $P29 Yb176Di CD56 1349 0 12
## $P30 Time Time 0 0 1
## 261 keywords are stored in the 'description' slot
mass_tags <- ff@parameters@data$name
params <- ff@parameters@data$desc
# All NA values in marker names will revert to mass tags names
params <- ifelse(is.na(params), mass_tags, params)
colnames(ff) <- params
Here, I’m just going to bring the data into tibble format and do away with the flow frame entirely.
dat <- flowCore::exprs(ff) %>% as_tibble()
dat <- asinh(dat/5)
And we’re going to pull out the surface markers we care about. Note that the code below works for this particular dataset, with respect to what I want to show you here. But you will have to tailor your own analysis pipelines to distinguish between markers you want to cluster on (type), markers whose expression you want to measure within clusters (state), and markers that are not relevant for the analysis.
params <- ff@parameters@data$desc
# Filter out markers we're not using
to_remove <- c("CD45", "CD66", "CD61", "CD235")
to_remove <- which(params %in% to_remove)
# Markers to use
cd <- grep("CD", params)
ig <- grep("Ig", params)
hla <- grep("HLA", params)
ccr <- c(grep("CCR", params), grep("CXC", params))
surface <- c(cd, ig, hla, ccr)
surface <- surface[!(surface %in% to_remove)]
dat <- dat[surface]
Now we’re going to subsample for the sake of speed of processing.
dat <- dat[sample(nrow(dat), 10000),]
dim(dat)
## [1] 10000 15
We’re going to do peak finding in higher dimensional space. To do that, we need to know the density of each region. To that end, we’re going to calculate the average distance to k-nearest neighbors (KNN) for each cell.
So let’s first compute the KNN. We’re not going to worry about the k for now, but if you were going to really implement this for your data, you would try a handful of k’s, and then look for an elbow point in terms of the number of clusters that given k computed. From what I’ve seen with trying a number of k’s here: a smaller k leads to more clusters, but oftentimes in a messy way, in that clusters need to be merged downstream. A larger k is cleaner, in that there are fewer clusters, but that might come at the expense of rare subset discovery, which is something that X-shift is good at.
Of note, X-shift uses cosine distance. If you want more of a deep dive on distance metrics, please go to my poster here.
And now we compute the KNN.
library(FNN)
num_neighbors <- 100
dist_matrix <- proxy::dist(x = dat, method = "cosine")
nn <- FNN::get.knn(data = dist_matrix, k = num_neighbors)
This gives us both nearest neighbor IDs and nearest neighbor distances.
nn$nn.index[1:5, 1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 8464 2419 7965 7800 252
## [2,] 656 5735 8679 3308 3260
## [3,] 5302 296 7981 1548 3598
## [4,] 5246 3046 5145 592 7012
## [5,] 8291 2970 2683 7471 5296
nn$nn.dist[1:5, 1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.519795 1.743521 1.772668 1.828717 1.877079
## [2,] 2.459011 2.677880 2.906997 2.926899 3.139863
## [3,] 1.816496 2.523467 2.740823 2.917900 2.966542
## [4,] 1.935983 2.013207 2.017506 2.033546 2.040030
## [5,] 1.893180 2.144126 2.368645 2.402282 2.428539
Great. So what is the per-cell density? It’s the average value of the rows of the second matrix. So let’s calculate that now.
We’re going to weight by distance, so that if a KNN is in another “island” it gets downweighted.
dens <- apply(nn$nn.dist, 1, function(i) {
curr <- i[-1] # first number is 0
result <- sum(1/curr) # Weighting
return(result)
})
Let’s take a brief interlude and plot the density esimtate on a UMAP, so you can get some intution around what was computed exactly.
First, we make the UMAP. To be consistent with X-shift, we are going to use cosine as the distance metric, rather than the default Euclidean distance.
library(umap)
dat_umap <- umap::umap(dat, metric = "cosine")$layout %>% as_tibble()
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
names(dat_umap) <- c("umap1", "umap2")
Now, we plot.
library(ggplot2)
ggplot(dat_umap, aes(x = umap1, y = umap2, color = dens)) +
geom_point() +
scale_color_viridis_c() +
theme_minimal()
We’ll color by the log of the distance so we can see a bit more nuance.
library(ggplot2)
ggplot(dat_umap, aes(x = umap1, y = umap2, color = log(dens))) +
geom_point() +
scale_color_viridis_c() +
theme_minimal()
Ok, we can start to see that even within the islands themselves, there are regions of high density and regions of low density, that aren’t always obvious to the eye.
The next step is to make the KNN graph.
First we make an adjacency matrix (1 if 2 cells are connected, 0 if they’re not).
n <- nrow(dat)
adj_matrix <- matrix(0, n, n)
for (i in 1:n) {
adj_matrix[i, nn$nn.index[i, ]] <- 1
}
adj_matrix[1:10, 1:10]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 0 0 0 0 1 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0 0 0 0 1
## [6,] 1 0 0 0 0 0 0 0 0 0
## [7,] 0 0 0 0 0 0 0 0 0 0
## [8,] 0 0 0 0 0 0 0 0 0 0
## [9,] 0 0 0 0 0 0 0 0 0 0
## [10,] 0 0 0 0 0 0 0 1 0 0
Now let’s make the graph. It’s a directed graph because I want to know whether 1 -> 2 AND 2 -> 1.
library(igraph)
##
## Attaching package: 'igraph'
## The following object is masked from 'package:FNN':
##
## knn
## The following objects are masked from 'package:lubridate':
##
## %--%, union
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following object is masked from 'package:flowCore':
##
## normalize
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
g <- graph.adjacency(adj_matrix, mode = "directed")
## Warning: `graph.adjacency()` was deprecated in igraph 2.0.0.
## ℹ Please use `graph_from_adjacency_matrix()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
g
## IGRAPH 401ef98 D--- 10000 1e+06 --
## + edges from 401ef98:
## [1] 1-> 6 1-> 18 1-> 204 1-> 215 1-> 247 1-> 252 1-> 401 1-> 669 1-> 676
## [10] 1-> 853 1->1203 1->1299 1->1447 1->1464 1->1590 1->1660 1->1718 1->1748
## [19] 1->1799 1->1942 1->1992 1->2070 1->2328 1->2419 1->2483 1->2496 1->2584
## [28] 1->2760 1->3172 1->3234 1->3334 1->3614 1->3823 1->3860 1->3873 1->3955
## [37] 1->4024 1->4046 1->4048 1->4081 1->4182 1->4295 1->4330 1->4494 1->4562
## [46] 1->4986 1->5014 1->5027 1->5120 1->5150 1->5198 1->5241 1->5422 1->5594
## [55] 1->5625 1->5650 1->5810 1->5882 1->5939 1->5969 1->6224 1->6374 1->6526
## [64] 1->6546 1->6612 1->6663 1->6685 1->6714 1->6784 1->6833 1->6860 1->7071
## [73] 1->7126 1->7309 1->7330 1->7368 1->7430 1->7440 1->7538 1->7592 1->7673
## + ... omitted several edges
Great. Now we have it in an actual graph format. From here, we need to do peak detection. To do this, let’s take a single cell at random, and walk it up density until its at a peak.
First, let’s add density as a node attribute of the graph.
V(g)$density <- dens
g
## IGRAPH 401ef98 D--- 10000 1e+06 --
## + attr: density (v/n)
## + edges from 401ef98:
## [1] 1-> 6 1-> 18 1-> 204 1-> 215 1-> 247 1-> 252 1-> 401 1-> 669 1-> 676
## [10] 1-> 853 1->1203 1->1299 1->1447 1->1464 1->1590 1->1660 1->1718 1->1748
## [19] 1->1799 1->1942 1->1992 1->2070 1->2328 1->2419 1->2483 1->2496 1->2584
## [28] 1->2760 1->3172 1->3234 1->3334 1->3614 1->3823 1->3860 1->3873 1->3955
## [37] 1->4024 1->4046 1->4048 1->4081 1->4182 1->4295 1->4330 1->4494 1->4562
## [46] 1->4986 1->5014 1->5027 1->5120 1->5150 1->5198 1->5241 1->5422 1->5594
## [55] 1->5625 1->5650 1->5810 1->5882 1->5939 1->5969 1->6224 1->6374 1->6526
## [64] 1->6546 1->6612 1->6663 1->6685 1->6714 1->6784 1->6833 1->6860 1->7071
## + ... omitted several edges
Now we set up a single instance of peak detection.
FindPeak <- function(g, start_node, verbose = FALSE) {
current_node <- start_node
neighbors <- igraph::neighbors(g, current_node)
# Check if there are any neighbors with higher density
while(any(V(g)$density[neighbors] > V(g)$density[current_node])) {
if(verbose) {
print(current_node)
}
# Find the neighbor with the highest density
next_node <- neighbors[which.max(V(g)$density[neighbors])]
# Move to the next node
current_node <- next_node
# Update the list of neighbors
neighbors <- neighbors(g, current_node)
}
return(current_node)
}
And now we test it.
# Call the function with the start node
peak_node <- FindPeak(g, start_node = 10, verbose = TRUE)
## [1] 10
## + 1/10000 vertex, from 401ef98:
## [1] 1476
## + 1/10000 vertex, from 401ef98:
## [1] 9663
peak_node
## + 1/10000 vertex, from 401ef98:
## [1] 7233
Ok, now you can see what it’s doing. We’re walking along the graph until we get to a peak. I note that with a lower k, the landscape is more rugged, so there are more peaks (which means more clusters downstream). This is worth considering.
Now we have to do this peak detection for every cell. The cluster ID is going to be the peak.
clusters <- lapply(V(g), function(i) {
FindPeak(g = g, start_node = i, verbose = FALSE)
}) %>% unlist()
Great. Let’s see what we have.
table(clusters) %>% sort(decreasing = TRUE)
## clusters
## 7233 1921 5625 6192 7527 9099 6172 6089 8270 7710 5023
## 1718 1549 1434 1041 1016 880 763 676 522 294 107
Alright, now let’s view this on the UMAP to see if it makes sense.
library(ggplot2)
ggplot(dat_umap, aes(x = umap1, y = umap2, color = as.factor(clusters))) +
geom_point() +
theme_minimal()
So we can hammer the point home, let’s color the map by the peaks themselves. First we make a is_peak boolean.
peaks <- unique(clusters)
is_peak <- lapply(seq(nrow(dat)), function(i) {
return(i %in% peaks)
}) %>% unlist()
And now we color the map and superimpose the peaks on top.
# then plot as before
ggplot(dat_umap, aes(x = umap1, y = umap2, color = as.factor(clusters))) +
geom_point() +
geom_point(data = dat_umap[is_peak,], color = "black") +
theme_minimal()
Ok, so we have clusters that make sense. Now, X-shift has many automated optimizations embedded within it to properly optimize k, for example. But hopefully what I’ve done here gives you an idea of what X-shift is actually doing.
What is the difference between this simplified implementation of X-shift and the full-fledged software? A lot. I’ll list a few things here, some of which are from direct correspondence with Samusik after showing him the above.
For starters, X-shift is written in Java, not R, which gave Samusik much more control over the machine. A lot of X-shift was an engineering problem, given that although the overall concept is easy to understand, it it very computationally expensive to actually run, especially on multi-million cell datasets. There, a lot of clever low-level programming went into turning the concept into reality.
In the implementation above, I use a given k for density estimation, and k = 1 for hill climbing. The X-shift implementation does hill climbing across a separate computed k, with an embedded statistical test. There is also a cluster merging step in the end, based on a modified version of the mahalanobis distance. I plan to implement these in later drafts if I have time, which may make my clusters look a bit less messy.
The software that runs X-shift, called VorteX, also contains visualization tools that Samusik developed. Among them is a force directed cell embedding that, while computationally intensive, produces output that is more comparable to UMAP than t-SNE (the primary dimension reduction method at the time of publication) in that trajectories can be inferred.
Overall, I hope my simplified version of X-shift, and the remaining explanation of the rest of it, gave you an idea of how X-shift works. If you don’t use the tool, I think it is important to understand the concept behind it, as this was the best method for rare cell subset detection for high-dimensional flow and mass cytometry data. This suggests that there are unexplored uses in newer high-dimensional data types, like spatial transcriptomics, high-dimensional imaging (eg. MIBI, CODEX), and single-cell sequencing.