The following is a data analysis pipeline for a single fcs file. This pipeline will give the user the ability to import a fcs file, turn it into an expression matrix, cluster it, make a frequency table, perform dimensionality reduction, and do various visualizations. This is a small piece of the CyTOF analysis scripts I have optimized over the past several years and it is grounded in best practices in the field.
This pipeline is to be used after the input file has been pre-gated. This is typically on live/dead and DNA by event_length (analogous to the scatter gates on flow cytometry).
Now let’s get acquainted with how CyTOF data are imported and processed. The data are stored as fcs files, which is a file format that has been used for flow cytometry long before CyTOF was developed. We are going to use the flowCore package to import the fcs file into a data structure called a flow frame. In reality, all we need is the expression matrix, which has each cell as a row and the markers as columns. Here is how you do that.
First, we set the working directory to the folder named “data” that is located in the same folder as the R script. The working directory is where R looks for files and saves outputs. The here::here() function is from the here package, which makes it easier to find files and folders. There’s only one file in there. The data analyzed in this markdown is healthy PBMCs from the Stanford HIMC, and can be found here. Note the PDF at the bottom, which explains more about the dataset.
We use read.FCS() function from the flowCore package. The read.FCS() function returns an object of class flowFrame, which contains the data and metadata of the fcs file.
library(flowCore)
library(tidyverse)
set.seed(1)
setwd(here::here("data", 'mike_LISLinnegCD45pos-101917-relabeled-CORRECT'))
ff <- flowCore::read.FCS('Center_1_A_1_MATLAB_LiveIntactSing_Linneg_CD45pos.fcs')
ff
## flowFrame object 'Center_1_A_1_MATLAB_LiveIntactSing_Linneg_CD45pos.fcs'
## with 93816 cells and 30 observables:
## name desc range minRange maxRange
## $P1 Ce140Di beads1 2911 0 7
## $P2 Ce142Di beads2 1509 0 1
## $P3 Er167Di CD27 781 0 17
## $P4 Er170Di CD3 1664 0 27
## $P5 Eu151Di CD123 2814 0 16
## ... ... ... ... ... ...
## $P26 Tm169Di CD45RA 12000 0 23
## $P27 Yb172Di CD38 5629 0 18
## $P28 Yb174Di HLADR 6990 0 11
## $P29 Yb176Di CD56 1349 0 12
## $P30 Time Time 0 0 1
## 261 keywords are stored in the 'description' slot
There’s a little bit we have to add here so that the column names of the expression matrix are the marker names and not the mass channel names.
mass_tags <- ff@parameters@data$name
params <- ff@parameters@data$desc
# All NA values in marker names will revert to mass tags names
params <- ifelse(is.na(params), mass_tags, params)
colnames(ff) <- params
We need to transform the data. In general, single cell data are transformed either logarithmically or log-like. For CyTOF, the best practices is the asinh transform with a scale argument of 5. This means that you divide all the values of the matrix by 5 and then you perform an asinh transform on that. This is how CyTOF data have always been scaled from the beginning, though it doesn’t look radically different than log(x + 1), which is used for single-cell sequencing. The goal is to make the data look as flow cytometry-like as possible (eg. you can gate it).
Below, I show you what the expression matrix looks like before and after the transform, and then I set the new expression matrix values to that which is stored in the flow frame.
exp <- flowCore::exprs(ff)
as_tibble(exp)
## # A tibble: 93,816 Ă— 30
## beads1 beads2 CD27 CD3 CD123 beads3 `Event #` Event_le…¹ CD33 CD14
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 1.74 2.35 2.09 0 24.3 3.41 11.7
## 2 0 0 0 56.1 3.39 16.7 0 25.2 0.686 8.97
## 3 0 0 0.437 2.93 6.23 228. 0 17.7 2.23 6.48
## 4 0 0 108. 288. 5.61 91.3 0 15.9 2.84 2.30
## 5 2.48 0 9.69 7.18 3.35 22.2 0 20.5 5.29 15.9
## 6 1.49 0 27.6 191. 2.44 2.25 0 24.3 0 0.0995
## 7 1.57 0 26.0 160. 2.30 46.6 0 27.1 0.894 1.97
## 8 0 0 1.29 42.2 2.56 22.9 0 23.3 0 11.9
## 9 0 0 4.12 9.30 0 14.1 0 19.6 8.39 22.4
## 10 0 0 0.715 0 3.28 1.21 0 19.6 70.8 18.2
## # … with 93,806 more rows, 20 more variables: CD61 <dbl>, cells1 <dbl>,
## # cells2 <dbl>, beads4 <dbl>, beads5 <dbl>, CD19 <dbl>, CD4 <dbl>,
## # CD8A <dbl>, CD16 <dbl>, CD235 <dbl>, livedead <dbl>, CD20 <dbl>,
## # CD66 <dbl>, CD45 <dbl>, CD11C <dbl>, CD45RA <dbl>, CD38 <dbl>, HLADR <dbl>,
## # CD56 <dbl>, Time <dbl>, and abbreviated variable name ¹​Event_length
exp <- asinh(exp/5)
as_tibble(exp)
## # A tibble: 93,816 Ă— 30
## beads1 beads2 CD27 CD3 CD123 beads3 `Event #` Event…¹ CD33 CD14 CD61
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 0.342 0.454 0.407 0 2.28 0.638 1.59 0.199
## 2 0 0 0 3.11 0.635 1.92 0 2.32 0.137 1.35 0.0704
## 3 0 0 0.0872 0.557 1.05 4.51 0 1.98 0.432 1.08 1.03
## 4 0 0 3.76 4.75 0.965 3.60 0 1.87 0.542 0.446 0.527
## 5 0.478 0 1.42 1.16 0.627 2.19 0 2.12 0.922 1.88 0.594
## 6 0.294 0 2.41 4.34 0.470 0.435 0 2.28 0 0.0199 0.716
## 7 0.308 0 2.35 4.16 0.444 2.93 0 2.39 0.178 0.384 0.662
## 8 0 0 0.255 2.83 0.492 2.22 0 2.24 0 1.60 0.573
## 9 0 0 0.752 1.38 0 1.76 0 2.07 1.29 2.20 0
## 10 0 0 0.142 0 0.617 0.240 0 2.07 3.34 2.01 1.34
## # … with 93,806 more rows, 19 more variables: cells1 <dbl>, cells2 <dbl>,
## # beads4 <dbl>, beads5 <dbl>, CD19 <dbl>, CD4 <dbl>, CD8A <dbl>, CD16 <dbl>,
## # CD235 <dbl>, livedead <dbl>, CD20 <dbl>, CD66 <dbl>, CD45 <dbl>,
## # CD11C <dbl>, CD45RA <dbl>, CD38 <dbl>, HLADR <dbl>, CD56 <dbl>, Time <dbl>,
## # and abbreviated variable name ¹​Event_length
flowCore::exprs(ff) <- exp
The next step is to cluster the data. For that, we’re going to use FlowSOM, which is in general the best practices for clustering in the CyTOF field both in terms of run time and accurracy in comparison to manual gating. For rare subset discovery, consider using X-shift. But for the projects I have done, especially in pharama, FlowSOM has gotten the job done.
FlowSOM builds what is called a self-organizing map with a default of 100 clusters evenly spaced in your data. Then similar clusters are merged until you have a specific number of pre-set clusters. How do you know how many clusters are in your data? The spoiler alert is you’re going to have to run this tool a number of times. I would think about how many subset you expect given your panel, and how many islands you see on your UMAP (stay tuned). It’s better that you overcluster than undercluster.
We use the high-level FlowSOM command, which performs the whole process.
We need to specify which columns to use. I’m going to do a shortcut and simply select for CD markers. The commands below simply look for the expressions in quotes in the vector of markers.
params <- ff@parameters@data$desc
# Filter out markers we're not using
to_remove <- c("CD45", "CD66", "CD61", "CD235")
to_remove <- which(params %in% to_remove)
# Markers to use
cd <- grep("CD", params)
ig <- grep("Ig", params)
hla <- grep("HLA", params)
ccr <- c(grep("CCR", params), grep("CXC", params))
surface <- c(cd, ig, hla, ccr)
surface <- surface[!(surface %in% to_remove)]
The arguments to pay attention to are xdim and ydim, which specify the size of the SOM. Right now I have it set at 10x10, which gives you 100 clusters to start with. You can go higher, but the runtime goes down. Sofie VanGassen (author) has the default at 7x7. The other one is colsToUse, which specifies the specific columns of the input flowFrame that will be used as clustering. Which do you want to cluster with? In general, surface markers rather than intracellular markers.
What is the final number of clusters? That’s the nClus argument. The algorithm will produce 100 clusters, and then merge clusters until we have that number of clusters, which needs to be less than xdim*ydim. We’ll be able to see what this looks like later.
library(FlowSOM)
fsom <- FlowSOM(ff,
# Input options:
compensate = FALSE,
transform = FALSE,
scale = FALSE,
# SOM options:
colsToUse = surface, xdim = 10, ydim = 10,
# Metaclustering options:
nClus = 22)
fsom
## FlowSOM model trained on 93816 cells and 15 markers,
## using a 10x10 grid (100 clusters) and 22 metaclusters.
##
## Markers used: CD27, CD3, CD123, CD33, CD14, CD19, CD4, CD8A, CD16, CD20, CD11C, CD45RA, CD38, CD56, HLADR
##
## Metacluster cell count:
## MC 1 MC 2 MC 3 MC 4 MC 5 MC 6 MC 7 MC 8 MC 9 MC 10 MC 11 MC 12 MC 13
## 18057 5913 269 2438 876 7312 151 6799 12408 1858 252 20851 347
## MC 14 MC 15 MC 16 MC 17 MC 18 MC 19 MC 20 MC 21 MC 22
## 653 342 487 161 1019 12047 121 787 668
##
## 717 cells (0.76%) are further than 4 MAD from their cluster center.
The next question you’re going to have when you cluster the data is what clusters express what markers, and what clusters appear at what frequencies. There are great tools for this, and in later markdowns, I’m going to show you how to use them. What I’m going to do here is show you how to do it from scratch.
First, let’s look at percent of cells per cluster. To do that, we have to better understand what is inside the fsom object we just made.
We have the data here.
cells <- fsom$data %>% as_tibble()
cells
## # A tibble: 93,816 Ă— 30
## beads1 beads2 CD27 CD3 CD123 beads3 `Event #` Event…¹ CD33 CD14 CD61
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 0.342 0.454 0.407 0 2.28 0.638 1.59 0.199
## 2 0 0 0 3.11 0.635 1.92 0 2.32 0.137 1.35 0.0704
## 3 0 0 0.0872 0.557 1.05 4.51 0 1.98 0.432 1.08 1.03
## 4 0 0 3.76 4.75 0.965 3.60 0 1.87 0.542 0.446 0.527
## 5 0.478 0 1.42 1.16 0.627 2.19 0 2.12 0.922 1.88 0.594
## 6 0.294 0 2.41 4.34 0.470 0.435 0 2.28 0 0.0199 0.716
## 7 0.308 0 2.35 4.16 0.444 2.93 0 2.39 0.178 0.384 0.662
## 8 0 0 0.255 2.83 0.492 2.22 0 2.24 0 1.60 0.573
## 9 0 0 0.752 1.38 0 1.76 0 2.07 1.29 2.20 0
## 10 0 0 0.142 0 0.617 0.240 0 2.07 3.34 2.01 1.34
## # … with 93,806 more rows, 19 more variables: cells1 <dbl>, cells2 <dbl>,
## # beads4 <dbl>, beads5 <dbl>, CD19 <dbl>, CD4 <dbl>, CD8A <dbl>, CD16 <dbl>,
## # CD235 <dbl>, livedead <dbl>, CD20 <dbl>, CD66 <dbl>, CD45 <dbl>,
## # CD11C <dbl>, CD45RA <dbl>, CD38 <dbl>, HLADR <dbl>, CD56 <dbl>, Time <dbl>,
## # and abbreviated variable name ¹​Event_length
We have the cluster and metacluster ID per cell here.
clusters <- FlowSOM::GetClusters(fsom)
clusters[1:10]
## [1] 40 46 78 41 100 2 4 5 99 10
metaclusters <- FlowSOM::GetMetaclusters(fsom) # Note that this is a vector of factors
metaclusters[1:10]
## [1] 8 6 19 9 19 1 1 1 19 2
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
cells$cluster <- clusters
cells$metacluster <- metaclusters
Now we tabulate the cell frequencies per cluster. Note that the variation between percent of cells per cluster widens with the metaclusters. This corresponds to the clusters representing the frequencies of actual cell subsets in our data. We’re expressing these as percentages.
table(clusters) %>% sort() %>% `/`(nrow(ff)) %>% `*`(100)
## clusters
## 28 20 38 80 60 69 21
## 0.05755948 0.06395498 0.10339388 0.12897587 0.17161252 0.18013985 0.25155624
## 61 44 18 58 98 59 50
## 0.25262215 0.26861090 0.28673147 0.28673147 0.31764305 0.33896137 0.36454336
## 48 7 97 53 12 17 74
## 0.36987294 0.38053211 0.39438902 0.40611410 0.40824593 0.41890509 0.41890509
## 63 77 36 72 94 68 34
## 0.48179415 0.50524431 0.54148546 0.55427646 0.57666070 0.58092436 0.58731986
## 86 24 96 85 31 67 49
## 0.61290185 0.63422018 0.64487934 0.67898866 0.68218641 0.68218641 0.69604332
## 22 25 33 76 35 83 15
## 0.70350473 0.71416390 0.72482306 0.74081180 0.76319604 0.76319604 0.77705295
## 55 75 27 57 54 19 64
## 0.77918479 0.79623945 0.79943720 0.80689861 0.80903044 0.81436002 0.83248060
## 95 16 43 45 73 1 46
## 0.83887610 0.84527160 0.84953526 0.88897416 0.90389699 0.92201757 0.92521532
## 81 93 87 29 47 90 66
## 0.92947898 0.98170888 0.99769762 0.99982945 1.00302720 1.01155453 1.04140019
## 39 84 14 37 23 56 26
## 1.04992752 1.09043234 1.18316705 1.20981496 1.22793553 1.23539695 1.25031977
## 78 71 88 13 92 32 10
## 1.25031977 1.28123135 1.36224098 1.38036156 1.40381172 1.46883261 1.49334868
## 42 89 79 11 65 3 41
## 1.55303999 1.55836957 1.56050141 1.59461073 1.64364288 1.66815895 1.68094994
## 82 6 5 91 4 70 99
## 1.68094994 1.69160911 1.73211393 1.77261874 1.77688241 1.78434382 1.81845314
## 2 100 51 62 8 40 52
## 1.90266053 2.02417498 2.07000938 2.11371195 2.13716210 2.24481965 2.25121514
## 9 30
## 2.60829709 2.95258804
table(metaclusters) %>% sort() %>% `/`(nrow(ff)) %>% `*`(100)
## metaclusters
## 20 7 17 11 3 15 13
## 0.1289759 0.1609534 0.1716125 0.2686109 0.2867315 0.3645434 0.3698729
## 16 14 22 21 5 18 10
## 0.5191012 0.6960433 0.7120321 0.8388761 0.9337426 1.0861687 1.9804724
## 4 2 8 6 19 9 1
## 2.5987038 6.3027629 7.2471647 7.7939797 12.8410932 13.2258890 19.2472499
## 12
## 22.2254200
In later markdowns, this will take the form of a matrix, where each column is a different fcs file.
Now, let’s look at average expression. We’re going to create a matrix of markers by cluster ID and view it as a heatmap. This will allow us to determine which clusters express which markers. Again, we’re going to do this from scratch.
exp_mat <- lapply(sort(unique(metaclusters)), function(i) {
result <- dplyr::filter(cells, metaclusters == i)[,surface]
result <- apply(result, 2, mean)
return(result) # You don't have to return, but I do it to be explicit
}) %>% dplyr::bind_rows()
exp_mat
## # A tibble: 22 Ă— 15
## CD27 CD3 CD123 CD33 CD14 CD19 CD4 CD8A CD16 CD20 CD11C CD45RA
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2.03 3.50 0.418 0.333 0.727 0.145 2.48 1.28 1.53 1.28 0.289 0.534
## 2 0.153 0.488 0.783 3.81 2.44 0.207 1.05 1.14 1.50 1.21 4.00 1.09
## 3 0.288 0.644 4.07 0.689 0.670 0.168 1.54 1.08 1.39 1.21 0.382 4.57
## 4 0.285 0.698 1.50 1.98 1.27 0.299 1.39 1.85 3.89 1.92 4.31 4.21
## 5 3.15 4.00 0.644 0.450 1.10 0.197 2.83 1.70 2.01 1.76 0.386 1.14
## 6 1.75 3.04 0.682 0.456 1.12 0.280 1.23 4.28 2.27 1.90 0.388 0.765
## 7 1.36 2.93 0.540 0.522 0.886 2.06 1.51 2.69 1.89 2.48 0.583 2.35
## 8 0.364 0.532 0.493 0.758 0.676 4.09 0.556 1.03 1.59 3.59 0.338 4.30
## 9 3.05 4.16 0.528 0.371 0.855 0.186 2.52 1.35 1.68 1.44 0.316 3.98
## 10 1.37 3.43 0.347 0.287 0.558 0.144 1.85 1.04 1.35 1.04 0.320 3.01
## # … with 12 more rows, and 3 more variables: CD38 <dbl>, CD56 <dbl>,
## # HLADR <dbl>
Now let’s visualize it. I’ve been using the pheatmap package, which has worked well for me over the years.
library(pheatmap)
pheatmap::pheatmap(exp_mat)
With the heatmap, you can figure out what clusters express what, and are therefore what cell type (or perhaps you have found a novel cell type). This heatmap matters when you’re doing per-cluster statistics on multiple fcs files. When you find out that the frequency of cluster 2 changes in the control vs the treatment condition, you want to be able to quickly know what cluster 2 is.
From here, we’re going to do dimensionality reduction. We’re going to use UMAP. It is a non-linear dimensionality reduction algorithm similar to t-SNE but with better global structure preservation (note that it is by no means perfect). It takes your original marker space and compresses it down to two dimensions, so we can visualize the data on a XY plot. Then, we can get an idea of what subsets are there at what frequencies. We don’t want to over-interpret our map beyond that. See my work here for more.
First, we’re going to subsample our data so it doesn’t take UMAP too long to compute. We don’t need the whole dataset to get the insights we need.
to_sample <- 20000
cells_sub <- cells[sample(seq(nrow(cells)), to_sample),]
Now we run the R implementation of UMAP via the umap package.
library(umap)
dimr <- umap::umap(d = cells_sub[,surface])$layout %>% as_tibble()
names(dimr) <- c("umap1", "umap2")
Now we can color the map by various markers. We’re going to use a helper function I wrote to do that easily.
#' @title Visualize dimension reduction plots
#' @description Plot dimension reduction map of interest colored by dimension
#' reduction comparison object of interest, with a blue to red color gradient.
#' Color palette is an attempt at the Cytobank rainbow palette.
#' @param cells Tibble of cells by features that the dimr was computed from
#' @param dimr Tibble of dimension reduction axis 1 and axis 2
#' @param marker vector of values for a particular marker
#' @param lim_min Minimum value to be colored
#' @param lim_max Maximum value to be colored
#' @return ggplot2 object
PlotViz <- function(cells, dimr, marker, lim_min = NULL, lim_max = NULL) {
p <- ggplot(data = cells, aes(x = dimr[[1]],
y = dimr[[2]],
color = .data[[marker]])) +
geom_point(shape = ".") +
xlab(names(dimr[1])) +
ylab(names(dimr[2])) +
theme(legend.title = element_blank()) +
ggtitle(marker)
if(is.factor(cells[[marker]])) {
# Add labels to the plot using geom_text
p <- p + geom_point(size = 1.2)
return(p)
}
if(!is.null(lim_min)) {
p <- p + scale_color_gradientn(colors = c("blue", "cyan", "springgreen", "yellow", "orange", "red", "red4"), limits = c(lim_min, lim_max))
} else {
p <- p + scale_color_gradientn(colors = c("blue", "cyan", "springgreen", "yellow", "orange", "red", "red4"))
}
return(p)
}
Let’s start by plotting some markers.
PlotViz(cells = cells_sub, dimr = dimr, marker = "CD33") # monocytes
PlotViz(cells = cells_sub, dimr = dimr, marker = "CD3") # T cells
Let’s group these plots into a grid. Let’s take the first nine surface markers.
library(patchwork)
p_list <- lapply(names(cells_sub[,surface][1:9]), function(i) {
PlotViz(cells = cells_sub, dimr = dimr, marker = i)
})
(p_list[[1]] + p_list[[2]] + p_list[[3]]) / (p_list[[4]] + p_list[[5]] + p_list[[6]]) / (p_list[[7]] + p_list[[8]] + p_list[[9]])
Now let’s look at our clusters. Notice how the metaclusters begin to approximate the islands on the map. The islands on the map correspond to the most obvious cell subsets. In manual gating terms, these are the subsets that are obvious “blobs” on a biaxial plot.
PlotViz(cells = cells_sub, dimr = dimr, marker = "cluster")
PlotViz(cells = cells_sub, dimr = dimr, marker = "metacluster")
Remember, it is better to overcluster than to undercluster the data. I will leave it as an exercise to the reader to tinker with the number of clusters to determine the proper number. Note that this will involve first principles understanding of the expected number of subsets given the panel, plus the output you see on the heatmap and the dimension reduction map. Don’t rely on any one tool.
For more information about CyTOF analysis, please look at CyTOF workflow and SPECTRE, both of which aggregate tools into high-level functions for data analysis.