The purpose of this markdown is to determine what the effective dimensionality of a CyTOF dataset. We’re going to use some functionality within the Seurat package, that is typically used for single-cell sequencing.
Let’s get started by reading in our file. You’ll recognize this as my analysis pipeline for one file.
The data analyzed in this markdown is healthy PBMCs from the Stanford HIMC, and can be found here. Note the PDF at the bottom, which explains more about the dataset.
Here, we read in the fcs file and pull out the expression matrix.
library(flowCore)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.0
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks flowCore::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
set.seed(1)
setwd(here::here("data"))
ff <- flowCore::read.FCS('Center_1_A_1_MATLAB_LiveIntactSing_Linneg_CD45pos.fcs')
## Warning in readFCSdata(con, offsets, txt, transformation, which.lines, scale, : Some data values of 'Nd142Di' channel exceed its $PnR value 1509 and will be truncated!
## To avoid truncation, either fix $PnR before generating FCS or set 'truncate_max_range = FALSE'
## Warning in readFCSdata(con, offsets, txt, transformation, which.lines, scale, : Some data values of 'Yb176Di' channel exceed its $PnR value 1349 and will be truncated!
## To avoid truncation, either fix $PnR before generating FCS or set 'truncate_max_range = FALSE'
ff
## flowFrame object 'Center_1_A_1_MATLAB_LiveIntactSing_Linneg_CD45pos.fcs'
## with 93816 cells and 30 observables:
## name desc range minRange maxRange
## $P1 Ce140Di beads1 2911 0 7
## $P2 Ce142Di beads2 1509 0 1
## $P3 Er167Di CD27 781 0 17
## $P4 Er170Di CD3 1664 0 27
## $P5 Eu151Di CD123 2814 0 16
## ... ... ... ... ... ...
## $P26 Tm169Di CD45RA 12000 0 23
## $P27 Yb172Di CD38 5629 0 18
## $P28 Yb174Di HLADR 6990 0 11
## $P29 Yb176Di CD56 1349 0 12
## $P30 Time Time 0 0 1
## 261 keywords are stored in the 'description' slot
This is to get the actual marker names rather than the marker channels, and have those be the column names on the expression matrix. The default expression matrix column names are the marker channels.
mass_tags <- ff@parameters@data$name
params <- ff@parameters@data$desc
# All NA values in marker names will revert to mass tags names
params <- ifelse(is.na(params), mass_tags, params)
colnames(ff) <- params
And here’s the expression matrix.
exp <- flowCore::exprs(ff)
as_tibble(exp)
## # A tibble: 93,816 × 30
## beads1 beads2 CD27 CD3 CD123 beads3 `Event #` Event_le…¹ CD33 CD14
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 1.74 2.35 2.09 0 24.3 3.41 11.7
## 2 0 0 0 56.1 3.39 16.7 0 25.2 0.686 8.97
## 3 0 0 0.437 2.93 6.23 228. 0 17.7 2.23 6.48
## 4 0 0 108. 288. 5.61 91.3 0 15.9 2.84 2.30
## 5 2.48 0 9.69 7.18 3.35 22.2 0 20.5 5.29 15.9
## 6 1.49 0 27.6 191. 2.44 2.25 0 24.3 0 0.0995
## 7 1.57 0 26.0 160. 2.30 46.6 0 27.1 0.894 1.97
## 8 0 0 1.29 42.2 2.56 22.9 0 23.3 0 11.9
## 9 0 0 4.12 9.30 0 14.1 0 19.6 8.39 22.4
## 10 0 0 0.715 0 3.28 1.21 0 19.6 70.8 18.2
## # … with 93,806 more rows, 20 more variables: CD61 <dbl>, cells1 <dbl>,
## # cells2 <dbl>, beads4 <dbl>, beads5 <dbl>, CD19 <dbl>, CD4 <dbl>,
## # CD8A <dbl>, CD16 <dbl>, CD235 <dbl>, livedead <dbl>, CD20 <dbl>,
## # CD66 <dbl>, CD45 <dbl>, CD11C <dbl>, CD45RA <dbl>, CD38 <dbl>, HLADR <dbl>,
## # CD56 <dbl>, Time <dbl>, and abbreviated variable name ¹Event_length
Let’s go ahead and do the asinh transform before we put the data into Seurat. We’ll do the asinh on the whole thing.
exp <- asinh(exp/5)
as_tibble(exp)
## # A tibble: 93,816 × 30
## beads1 beads2 CD27 CD3 CD123 beads3 `Event #` Event…¹ CD33 CD14 CD61
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 0.342 0.454 0.407 0 2.28 0.638 1.59 0.199
## 2 0 0 0 3.11 0.635 1.92 0 2.32 0.137 1.35 0.0704
## 3 0 0 0.0872 0.557 1.05 4.51 0 1.98 0.432 1.08 1.03
## 4 0 0 3.76 4.75 0.965 3.60 0 1.87 0.542 0.446 0.527
## 5 0.478 0 1.42 1.16 0.627 2.19 0 2.12 0.922 1.88 0.594
## 6 0.294 0 2.41 4.34 0.470 0.435 0 2.28 0 0.0199 0.716
## 7 0.308 0 2.35 4.16 0.444 2.93 0 2.39 0.178 0.384 0.662
## 8 0 0 0.255 2.83 0.492 2.22 0 2.24 0 1.60 0.573
## 9 0 0 0.752 1.38 0 1.76 0 2.07 1.29 2.20 0
## 10 0 0 0.142 0 0.617 0.240 0 2.07 3.34 2.01 1.34
## # … with 93,806 more rows, 19 more variables: cells1 <dbl>, cells2 <dbl>,
## # beads4 <dbl>, beads5 <dbl>, CD19 <dbl>, CD4 <dbl>, CD8A <dbl>, CD16 <dbl>,
## # CD235 <dbl>, livedead <dbl>, CD20 <dbl>, CD66 <dbl>, CD45 <dbl>,
## # CD11C <dbl>, CD45RA <dbl>, CD38 <dbl>, HLADR <dbl>, CD56 <dbl>, Time <dbl>,
## # and abbreviated variable name ¹Event_length
Let’s only keep the features we care about. The surface markers.
# Filter out markers we're not using
to_remove <- c("CD45", "CD66", "CD61", "CD235")
to_remove <- which(params %in% to_remove)
# Markers to use
cd <- grep("CD", params)
ig <- grep("Ig", params)
hla <- grep("HLA", params)
ccr <- c(grep("CCR", params), grep("CXC", params))
surface <- c(cd, ig, hla, ccr)
surface <- surface[!(surface %in% to_remove)]
exp <- exp[,surface]
Ok, now it’s time to read it into Seurat. We have to transpose the matrix, because Seurat’s convention is that the cells are columns and the features are rows.
library(Seurat)
## Attaching SeuratObject
# Some subsampling first
exp <- exp[sample(nrow(exp), 10000),]
rownames(exp) <- 1:nrow(exp)
exp <- t(exp)
dat <- Seurat::CreateSeuratObject(counts = exp)
dat
## An object of class Seurat
## 15 features across 10000 samples within 1 assay
## Active assay: RNA (15 features, 0 variable features)
Ok, let’s get to work. We have to do a couple of functions that pretend that this is RNA.
We trick Seurat into finding the top n variable features of n parameters (in other words, do nothing). This is because we don’t need to find the most variable features. We’re going to use all of them.
We typically don’t scale our CyTOF data, but Seurat requires it. Thus, we run ScaleData, but we actually say in the code to not scale the data, so the function does nothing. Again, we have to trick Seurat into thinking this is single-cell sequencing data.
dat <- Seurat::FindVariableFeatures(dat, nfeatures = nrow(dat))
## Warning in sqrt(sum.squares/one.delta): NaNs produced
dat <- ScaleData(object = dat, do.scale = FALSE, do.center = FALSE) # Do nothing.
We perform PCA, which will help us determine the effective dimensionality later, something that is very important for single-cell sequencing analysis, as you’re dealing with thousands of dimensions. To deal with that many dimensions, Seurat figures out how many principal components describe most of the variance, and then does clustering and dimension reduction on those principal components themselves.
This is not extremely relevant for CyTOF data, aside from the question we have about effective dimensionality, but we’ll explore PCA anyway.
# Perform PCA
dat <- RunPCA(dat, npcs = 15)
## Warning in irlba(A = t(x = object), nv = npcs, ...): You're computing too large
## a percentage of total singular values, use a standard svd instead.
## Warning: Requested number is larger than the number of available items (15).
## Setting to 15.
## Warning: Requested number is larger than the number of available items (15).
## Setting to 15.
## Warning: Requested number is larger than the number of available items (15).
## Setting to 15.
## Warning: Requested number is larger than the number of available items (15).
## Setting to 15.
## Warning: Requested number is larger than the number of available items (15).
## Setting to 15.
## PC_ 1
## Positive: CD45RA, CD3, CD8A, CD16, CD38, CD20, CD4, CD27
## Negative: CD19, CD123, CD33, CD11C, CD56, CD14, HLADR
## PC_ 2
## Positive: CD3, CD27, CD4, CD8A, CD123, CD14, CD20, CD16
## Negative: HLADR, CD11C, CD56, CD33, CD38, CD19, CD45RA
## PC_ 3
## Positive: HLADR, CD33, CD11C, CD4, CD27, CD38, CD3, CD14
## Negative: CD45RA, CD8A, CD56, CD16, CD20, CD123, CD19
## PC_ 4
## Positive: CD19, CD45RA, HLADR, CD20, CD27, CD38, CD3, CD4
## Negative: CD11C, CD56, CD8A, CD16, CD14, CD33, CD123
## PC_ 5
## Positive: CD8A, HLADR, CD20, CD19, CD33, CD14, CD123, CD3
## Negative: CD38, CD56, CD45RA, CD27, CD4, CD11C, CD16
The positive and negative loadings for each PC are the markers that have respective positive and negative covariances with the given PC. The magnitude of the loading describes the strength of the relationship between the variable and the given PC.
# Examine and visualize PCA results a few different ways
print(dat[["pca"]], dims = 1:5, nfeatures = 5)
## PC_ 1
## Positive: CD45RA, CD3, CD8A, CD16, CD38
## Negative: CD19, CD123, CD33, CD11C, CD56
## PC_ 2
## Positive: CD3, CD27, CD4, CD8A, CD123
## Negative: HLADR, CD11C, CD56, CD33, CD38
## PC_ 3
## Positive: HLADR, CD33, CD11C, CD4, CD27
## Negative: CD45RA, CD8A, CD56, CD16, CD20
## PC_ 4
## Positive: CD19, CD45RA, HLADR, CD20, CD27
## Negative: CD11C, CD56, CD8A, CD16, CD14
## PC_ 5
## Positive: CD8A, HLADR, CD20, CD19, CD33
## Negative: CD38, CD56, CD45RA, CD27, CD4
VizDimLoadings(dat, dims = 1:2, reduction = "pca")
## Warning: Requested number is larger than the number of available items (15).
## Setting to 15.
## Warning: Requested number is larger than the number of available items (15).
## Setting to 15.
DimPlot(dat, reduction = "pca")
The plot above demonstrates why we use t-SNE or UMAP rather than the first two principal components to visualize the “map” of our CyTOF data.
Now for one of the big points of this markdown. What is the effective dimensionality of CyTOF data. Think of this like information content. What is the minimum number of columns you would need to squeeze most of the dataset’s information into? For example, single-cell RNA sequencing PBMC data can be upwards of 20,000 dimensions, from which we choose 2000 most variable features, from which the effective dimensionality is within the CyTOF range of 25-50, or less.
We would expect the CyTOF PBMC effective dimensionality to be comparable to the number of surface markers in the data (in other words, no less than the data). This is because the markers we choose are carefully curated and titrated. We do feature selection before the experiment even starts, as opposed to after the experiment on the dataset.
The JackStraw plot is not working for whatever reason (hyp: Seurat is not used to datasets with this few dimensions), so we’re going to use a simple PCA elbow plot instead.
# NOTE: This process can take a long time for big datasets, comment out for expediency. More
# approximate techniques such as those implemented in ElbowPlot() can be used to reduce
# computation time
# dat <- JackStraw(dat, num.replicate = 10)
# dat <- ScoreJackStraw(dat, dims = 1:10)
# JackStrawPlot(dat, dims = 1:10)
For the plot below, each point is a PC. We should see an elbow point, which in single-cell sequencing data, would correspond to the effective dimensionality. If you wanted to quantify it, you could ask how many principal components make up 95% of the variance.
ElbowPlot(dat)
## Warning in ElbowPlot(dat): The object only has information for 14 reductions
# How many PCs explain 95% of the variance?
dat_variance <- dat@reductions$pca@stdev**2
dat_variance <- cumsum(dat_variance)/sum(dat_variance)
dat_variance
## [1] 0.7205933 0.8151199 0.8608261 0.9012838 0.9330023 0.9520762 0.9646187
## [8] 0.9738233 0.9813223 0.9863756 0.9910232 0.9945194 0.9973815 1.0000000
which(dat_variance > 0.95)[1]
## [1] 6
Kinda weird. This suggests that the majority of signal is in the first 5 PCs, with a little more signal in the next 5 PCs. Nothing after 10. This is of 15 dimensions in total. Specifically, 6 PCs explain 95% of the variance. This is very counter-intuitive.
Let’s do a quick sanity check to make sure our CyTOF data is still what it was going in. First, we’re going to pull out the data and do a simple biaxial plot. This will also show you how to pull the data out of Seurat.
test <- dat@assays$RNA@scale.data %>% t() %>% as_tibble
all(test == t(exp))
## [1] TRUE
ggplot(test, aes(x = CD3, y = CD33)) + geom_point()
Ok, that looks fine. Now we’ll go back to Seurat and do the rest of the work-up. The following is basically PhenoGraph.
dat <- FindNeighbors(dat, dims = 1:14)
## Computing nearest neighbor graph
## Computing SNN
dat <- FindClusters(dat, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 10000
## Number of edges: 347927
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9164
## Number of communities: 15
## Elapsed time: 0 seconds
Seurat also has a nice wrapper around UMAP, and respective visualizations.
# If you haven't installed UMAP, you can do so via reticulate::py_install(packages =
# 'umap-learn')
dat <- RunUMAP(dat, dims = 1:14)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 21:12:47 UMAP embedding parameters a = 0.9922 b = 1.112
## 21:12:47 Read 10000 rows and found 14 numeric columns
## 21:12:47 Using Annoy for neighbor search, n_neighbors = 30
## 21:12:47 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 21:12:48 Writing NN index file to temp file /var/folders/6y/qgvrbn_j2v35vfy946hlhh9c0000gn/T//Rtmpc19d2W/file1f2e59150ec
## 21:12:48 Searching Annoy index using 1 thread, search_k = 3000
## 21:12:50 Annoy recall = 100%
## 21:12:50 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 21:12:50 Initializing from normalized Laplacian + noise (using irlba)
## 21:12:50 Commencing optimization for 500 epochs, with 428936 positive edges
## 21:13:01 Optimization finished
# note that you can set `label = TRUE` or use the LabelClusters function to help label
# individual clusters
DimPlot(dat, reduction = "umap")
The table below is pretty cool, and a nice example of something that could be imported over to a CyTOF pipeline. What markers statistically define a cluster relative to other markers. Usually, I have to eyeball a heatmap or color UMAPs by each marker. This is a more systematic way to do it.
# find markers for every cluster compared to all remaining cells, report only the positive
# ones
pbmc.markers <- FindAllMarkers(dat, only.pos = TRUE,
min.pct = 0.25,
logfc.threshold = 0.25,
test.use = "wilcox")
## Calculating cluster 0
## For a more efficient implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the limma package
## --------------------------------------------
## install.packages('BiocManager')
## BiocManager::install('limma')
## --------------------------------------------
## After installation of limma, Seurat will automatically use the more
## efficient implementation (no further action necessary).
## This message will be shown once per session
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
## Calculating cluster 9
## Calculating cluster 10
## Calculating cluster 11
## Calculating cluster 12
## Calculating cluster 13
## Calculating cluster 14
pbmc.markers %>%
group_by(cluster) %>%
slice_max(n = 2, order_by = avg_log2FC) %>%
print(n = nrow(pbmc.markers))
## # A tibble: 28 × 7
## # Groups: cluster [15]
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 0 0.843 1 0.982 0 0 CD8A
## 2 6.61e- 78 0.377 1 0.902 9.92e- 77 0 CD45RA
## 3 0 0.871 1 0.836 0 1 CD27
## 4 0 0.586 0.999 0.938 0 1 CD3
## 5 1.96e-290 0.734 0.999 0.837 2.94e-289 2 CD27
## 6 0 0.575 1 0.971 0 2 CD4
## 7 0 1.38 0.997 0.862 0 3 CD56
## 8 0 0.745 0.999 0.995 0 3 CD16
## 9 7.55e-274 0.738 1 0.984 1.13e-272 4 CD8A
## 10 9.16e- 29 0.277 0.957 0.849 1.37e- 27 4 CD27
## 11 0 2.00 1 0.579 0 5 CD19
## 12 0 1.30 1 0.866 0 5 HLADR
## 13 0 1.62 1 0.753 0 6 CD33
## 14 0 1.42 1 0.732 0 6 CD11C
## 15 1.05e-129 0.426 0.997 0.974 1.58e-128 7 CD4
## 16 7.42e-252 0.820 1 0.984 1.11e-250 8 CD8A
## 17 1.92e-173 0.770 1 0.85 2.88e-172 8 CD27
## 18 4.80e-201 1.48 1 0.741 7.20e-200 9 CD11C
## 19 1.11e-144 1.10 1 0.872 1.66e-143 9 HLADR
## 20 2.10e- 56 0.449 1 0.945 3.16e- 55 10 CD3
## 21 3.30e- 48 1.14 1 0.879 4.94e- 47 11 CD56
## 22 1.14e- 50 0.692 1 0.996 1.71e- 49 11 CD16
## 23 7.13e- 45 1.73 1 0.608 1.07e- 43 12 CD19
## 24 1.29e- 36 1.19 1 0.876 1.93e- 35 12 HLADR
## 25 3.12e- 47 1.63 1 0.879 4.68e- 46 13 CD56
## 26 2.25e- 27 0.969 1 0.748 3.37e- 26 13 CD11C
## 27 4.86e- 16 1.60 1 0.916 7.28e- 15 14 CD123
## 28 4.95e- 13 1.26 1 0.876 7.42e- 12 14 HLADR
Let’s see what their per-cell heatmap looks like. It’s built to look at a subset of cells, in order for us to see dropout, but this is not relevant for CyTOF data.
pbmc.markers %>%
group_by(cluster) %>%
top_n(n = 10, wt = avg_log2FC) -> top10
DoHeatmap(dat, features = top10$gene) + NoLegend()
Interestingly, we do see instances where clusters have individual cells with varying levels of a marker (the stripey looking boxes). If this were a single-cell sequencing dataset, I’d interpret that as dropout. Maybe this is just natural variance of expression, and we need a better coloring scheme.