Home


“You never climb the same mountain twice, not even in memory. Memory rebuilds the mountain, changes the weather, retells the jokes, remakes all the moves.”

Lito Tejada-Flores


Introduction

The purpose of this document is to get a massively simplified and hyper-tailored version of X-shift working in R, mainly to show how it works, and how might one go about writing something similar in R. In my opinion, the algorithm is underexplored, especially with respect to newer high dimensional data types that are emerging since the publication of this paper.

X-shift is a clustering algorithm developed by Nikolay Samusik when he was doing his postdoc in the Nolan Lab. It is based off of mean-shift, which you can think of as assigning clusters by determining which density “peak” a given cell belongs to, if you “climb the hill” from where the cell is.

Why does X-shift matter? Because in a landmark benchmarking study by Weber and Robinson in 2016, X-shift was found to be the best clustering algorithm for identifying rare cell subsets. Thus, if that is one of your goals at any point, then you should absolutely know how to use X-shift.

For various reasons, X-shift was written in Java. Some of this was because the algorithm is computationally expensive, and Dr. Samusik needed full control over the computer (that he also built by hand) to, among other things, make sure each core was doing the right thing at the right time.

This markdown will show you how X-shift works, through the expressive but less verbose language of R, with my writing filling in the gaps as needed.

Data pre-processing

We’re going to do the standard data import and processing. If you want more details on this step, please go to my markdown here.

library(flowCore)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks flowCore::filter(), stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
set.seed(42)
setwd(here::here("data"))

ff <- flowCore::read.FCS('Center_1_A_1_MATLAB_LiveIntactSing_Linneg_CD45pos.fcs')
## Warning in readFCSdata(con, offsets, txt, transformation, which.lines, scale, : Some data values of 'Nd142Di' channel exceed its $PnR value 1509 and will be truncated!
##  To avoid truncation, either fix $PnR before generating FCS or set 'truncate_max_range = FALSE'
## Warning in readFCSdata(con, offsets, txt, transformation, which.lines, scale, : Some data values of 'Yb176Di' channel exceed its $PnR value 1349 and will be truncated!
##  To avoid truncation, either fix $PnR before generating FCS or set 'truncate_max_range = FALSE'
ff
## flowFrame object 'Center_1_A_1_MATLAB_LiveIntactSing_Linneg_CD45pos.fcs'
## with 93816 cells and 30 observables:
##         name   desc     range  minRange  maxRange
## $P1  Ce140Di beads1      2911         0         7
## $P2  Ce142Di beads2      1509         0         1
## $P3  Er167Di   CD27       781         0        17
## $P4  Er170Di    CD3      1664         0        27
## $P5  Eu151Di  CD123      2814         0        16
## ...      ...    ...       ...       ...       ...
## $P26 Tm169Di CD45RA     12000         0        23
## $P27 Yb172Di   CD38      5629         0        18
## $P28 Yb174Di  HLADR      6990         0        11
## $P29 Yb176Di   CD56      1349         0        12
## $P30    Time   Time         0         0         1
## 261 keywords are stored in the 'description' slot
mass_tags <- ff@parameters@data$name
params <- ff@parameters@data$desc

# All NA values in marker names will revert to mass tags names
params <- ifelse(is.na(params), mass_tags, params)
colnames(ff) <- params

Here, I’m just going to bring the data into tibble format and do away with the flow frame entirely.

dat <- flowCore::exprs(ff) %>% as_tibble()
dat <- asinh(dat/5)

And we’re going to pull out the surface markers we care about. Note that the code below works for this particular dataset, with respect to what I want to show you here. But you will have to tailor your own analysis pipelines to distinguish between markers you want to cluster on (type), markers whose expression you want to measure within clusters (state), and markers that are not relevant for the analysis.

params <- ff@parameters@data$desc

# Filter out markers we're not using
to_remove <- c("CD45", "CD66", "CD61", "CD235")
to_remove <- which(params %in% to_remove)

# Markers to use
cd <- grep("CD", params)
ig <- grep("Ig", params)
hla <- grep("HLA", params)
ccr <- c(grep("CCR", params), grep("CXC", params))

surface <- c(cd, ig, hla, ccr)
surface <- surface[!(surface %in% to_remove)]
dat <- dat[surface]

Now we’re going to subsample for the sake of speed of processing.

dat <- dat[sample(nrow(dat), 10000),]
dim(dat)
## [1] 10000    15

K-nearest neighbor density estimation

We’re going to do peak finding in higher dimensional space. To do that, we need to know the density of each region. To that end, we’re going to calculate the average distance to k-nearest neighbors (KNN) for each cell.

So let’s first compute the KNN. We’re not going to worry about the k for now, but if you were going to really implement this for your data, you would try a handful of k’s, and then look for an elbow point in terms of the number of clusters that given k computed. From what I’ve seen with trying a number of k’s here: a smaller k leads to more clusters, but oftentimes in a messy way, in that clusters need to be merged downstream. A larger k is cleaner, in that there are fewer clusters, but that might come at the expense of rare subset discovery, which is something that X-shift is good at.

Of note, X-shift uses cosine distance. If you want more of a deep dive on distance metrics, please go to my poster here.

And now we compute the KNN.

library(FNN)

num_neighbors <- 100
dist_matrix <- proxy::dist(x = dat, method = "cosine")
nn <- FNN::get.knn(data = dist_matrix, k = num_neighbors)

This gives us both nearest neighbor IDs and nearest neighbor distances.

nn$nn.index[1:5, 1:5]
##      [,1] [,2] [,3] [,4] [,5]
## [1,] 8464 2419 7965 7800  252
## [2,]  656 5735 8679 3308 3260
## [3,] 5302  296 7981 1548 3598
## [4,] 5246 3046 5145  592 7012
## [5,] 8291 2970 2683 7471 5296
nn$nn.dist[1:5, 1:5]
##          [,1]     [,2]     [,3]     [,4]     [,5]
## [1,] 1.519795 1.743521 1.772668 1.828717 1.877079
## [2,] 2.459011 2.677880 2.906997 2.926899 3.139863
## [3,] 1.816496 2.523467 2.740823 2.917900 2.966542
## [4,] 1.935983 2.013207 2.017506 2.033546 2.040030
## [5,] 1.893180 2.144126 2.368645 2.402282 2.428539

Great. So what is the per-cell density? It’s the average value of the rows of the second matrix. So let’s calculate that now.

We’re going to weight by distance, so that if a KNN is in another “island” it gets downweighted.

dens <- apply(nn$nn.dist, 1, function(i) {
    curr <- i[-1] # first number is 0
    result <- sum(1/curr) # Weighting
    return(result)
})

Let’s take a brief interlude and plot the density esimtate on a UMAP, so you can get some intution around what was computed exactly.

First, we make the UMAP. To be consistent with X-shift, we are going to use cosine as the distance metric, rather than the default Euclidean distance.

library(umap)

dat_umap <- umap::umap(dat, metric = "cosine")$layout %>% as_tibble()
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
names(dat_umap) <- c("umap1", "umap2")

Now, we plot.

library(ggplot2)

ggplot(dat_umap, aes(x = umap1, y = umap2, color = dens)) +
  geom_point() +
  scale_color_viridis_c() +
  theme_minimal()

We’ll color by the log of the distance so we can see a bit more nuance.

library(ggplot2)

ggplot(dat_umap, aes(x = umap1, y = umap2, color = log(dens))) +
  geom_point() +
  scale_color_viridis_c() +
  theme_minimal()

Ok, we can start to see that even within the islands themselves, there are regions of high density and regions of low density, that aren’t always obvious to the eye.

The next step is to make the KNN graph.

First we make an adjacency matrix (1 if 2 cells are connected, 0 if they’re not).

n <- nrow(dat)
adj_matrix <- matrix(0, n, n)
for (i in 1:n) {
  adj_matrix[i, nn$nn.index[i, ]] <- 1
}

adj_matrix[1:10, 1:10]
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    0    0    0    0    0    1    0    0    0     0
##  [2,]    0    0    0    0    0    0    0    0    0     0
##  [3,]    0    0    0    0    0    0    0    0    0     0
##  [4,]    0    0    0    0    0    0    0    0    0     0
##  [5,]    0    0    0    0    0    0    0    0    0     1
##  [6,]    1    0    0    0    0    0    0    0    0     0
##  [7,]    0    0    0    0    0    0    0    0    0     0
##  [8,]    0    0    0    0    0    0    0    0    0     0
##  [9,]    0    0    0    0    0    0    0    0    0     0
## [10,]    0    0    0    0    0    0    0    1    0     0

Now let’s make the graph. It’s a directed graph because I want to know whether 1 -> 2 AND 2 -> 1.

library(igraph)
## 
## Attaching package: 'igraph'
## The following object is masked from 'package:FNN':
## 
##     knn
## The following objects are masked from 'package:lubridate':
## 
##     %--%, union
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## The following object is masked from 'package:tidyr':
## 
##     crossing
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## The following object is masked from 'package:flowCore':
## 
##     normalize
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
g <- graph.adjacency(adj_matrix, mode = "directed")
## Warning: `graph.adjacency()` was deprecated in igraph 2.0.0.
## ℹ Please use `graph_from_adjacency_matrix()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
g
## IGRAPH 401ef98 D--- 10000 1e+06 -- 
## + edges from 401ef98:
##  [1] 1->   6 1->  18 1-> 204 1-> 215 1-> 247 1-> 252 1-> 401 1-> 669 1-> 676
## [10] 1-> 853 1->1203 1->1299 1->1447 1->1464 1->1590 1->1660 1->1718 1->1748
## [19] 1->1799 1->1942 1->1992 1->2070 1->2328 1->2419 1->2483 1->2496 1->2584
## [28] 1->2760 1->3172 1->3234 1->3334 1->3614 1->3823 1->3860 1->3873 1->3955
## [37] 1->4024 1->4046 1->4048 1->4081 1->4182 1->4295 1->4330 1->4494 1->4562
## [46] 1->4986 1->5014 1->5027 1->5120 1->5150 1->5198 1->5241 1->5422 1->5594
## [55] 1->5625 1->5650 1->5810 1->5882 1->5939 1->5969 1->6224 1->6374 1->6526
## [64] 1->6546 1->6612 1->6663 1->6685 1->6714 1->6784 1->6833 1->6860 1->7071
## [73] 1->7126 1->7309 1->7330 1->7368 1->7430 1->7440 1->7538 1->7592 1->7673
## + ... omitted several edges

Great. Now we have it in an actual graph format. From here, we need to do peak detection. To do this, let’s take a single cell at random, and walk it up density until its at a peak.

First, let’s add density as a node attribute of the graph.

V(g)$density <- dens
g
## IGRAPH 401ef98 D--- 10000 1e+06 -- 
## + attr: density (v/n)
## + edges from 401ef98:
##  [1] 1->   6 1->  18 1-> 204 1-> 215 1-> 247 1-> 252 1-> 401 1-> 669 1-> 676
## [10] 1-> 853 1->1203 1->1299 1->1447 1->1464 1->1590 1->1660 1->1718 1->1748
## [19] 1->1799 1->1942 1->1992 1->2070 1->2328 1->2419 1->2483 1->2496 1->2584
## [28] 1->2760 1->3172 1->3234 1->3334 1->3614 1->3823 1->3860 1->3873 1->3955
## [37] 1->4024 1->4046 1->4048 1->4081 1->4182 1->4295 1->4330 1->4494 1->4562
## [46] 1->4986 1->5014 1->5027 1->5120 1->5150 1->5198 1->5241 1->5422 1->5594
## [55] 1->5625 1->5650 1->5810 1->5882 1->5939 1->5969 1->6224 1->6374 1->6526
## [64] 1->6546 1->6612 1->6663 1->6685 1->6714 1->6784 1->6833 1->6860 1->7071
## + ... omitted several edges

Now we set up a single instance of peak detection.

FindPeak <- function(g, start_node, verbose = FALSE) {
  current_node <- start_node
  neighbors <- igraph::neighbors(g, current_node)
  
  # Check if there are any neighbors with higher density
  while(any(V(g)$density[neighbors] > V(g)$density[current_node])) {
    
    if(verbose) {
        print(current_node)
    }
    
    # Find the neighbor with the highest density
    next_node <- neighbors[which.max(V(g)$density[neighbors])]
    
    # Move to the next node
    current_node <- next_node
    
    # Update the list of neighbors
    neighbors <- neighbors(g, current_node)
  }
  
  return(current_node)
}

And now we test it.

# Call the function with the start node
peak_node <- FindPeak(g, start_node = 10, verbose = TRUE)
## [1] 10
## + 1/10000 vertex, from 401ef98:
## [1] 1476
## + 1/10000 vertex, from 401ef98:
## [1] 9663
peak_node
## + 1/10000 vertex, from 401ef98:
## [1] 7233

Ok, now you can see what it’s doing. We’re walking along the graph until we get to a peak. I note that with a lower k, the landscape is more rugged, so there are more peaks (which means more clusters downstream). This is worth considering.

Now we have to do this peak detection for every cell. The cluster ID is going to be the peak.

clusters <- lapply(V(g), function(i) {
    FindPeak(g = g, start_node = i, verbose = FALSE)
}) %>% unlist()

Great. Let’s see what we have.

table(clusters) %>% sort(decreasing = TRUE)
## clusters
## 7233 1921 5625 6192 7527 9099 6172 6089 8270 7710 5023 
## 1718 1549 1434 1041 1016  880  763  676  522  294  107

Alright, now let’s view this on the UMAP to see if it makes sense.

library(ggplot2)

ggplot(dat_umap, aes(x = umap1, y = umap2, color = as.factor(clusters))) +
  geom_point() +
  theme_minimal()

So we can hammer the point home, let’s color the map by the peaks themselves. First we make a is_peak boolean.

peaks <- unique(clusters)
is_peak <- lapply(seq(nrow(dat)), function(i) {
    return(i %in% peaks)
}) %>% unlist()

And now we color the map and superimpose the peaks on top.

# then plot as before
ggplot(dat_umap, aes(x = umap1, y = umap2, color = as.factor(clusters))) +
  geom_point() +
  geom_point(data = dat_umap[is_peak,], color = "black") +
  theme_minimal()

Ok, so we have clusters that make sense. Now, X-shift has many automated optimizations embedded within it to properly optimize k, for example. But hopefully what I’ve done here gives you an idea of what X-shift is actually doing.

What full-fleged X-shift has

What is the difference between this simplified implementation of X-shift and the full-fledged software? A lot. I’ll list a few things here, some of which are from direct correspondence with Samusik after showing him the above.

For starters, X-shift is written in Java, not R, which gave Samusik much more control over the machine. A lot of X-shift was an engineering problem, given that although the overall concept is easy to understand, it it very computationally expensive to actually run, especially on multi-million cell datasets. There, a lot of clever low-level programming went into turning the concept into reality.

In the implementation above, I use a given k for density estimation, and k = 1 for hill climbing. The X-shift implementation does hill climbing across a separate computed k, with an embedded statistical test. There is also a cluster merging step in the end, based on a modified version of the mahalanobis distance. I plan to implement these in later drafts if I have time, which may make my clusters look a bit less messy.

The software that runs X-shift, called VorteX, also contains visualization tools that Samusik developed. Among them is a force directed cell embedding that, while computationally intensive, produces output that is more comparable to UMAP than t-SNE (the primary dimension reduction method at the time of publication) in that trajectories can be inferred.

Overall, I hope my simplified version of X-shift, and the remaining explanation of the rest of it, gave you an idea of how X-shift works. If you don’t use the tool, I think it is important to understand the concept behind it, as this was the best method for rare cell subset detection for high-dimensional flow and mass cytometry data. This suggests that there are unexplored uses in newer high-dimensional data types, like spatial transcriptomics, high-dimensional imaging (eg. MIBI, CODEX), and single-cell sequencing.