CyTOF pipeline for one file

Home

The following is a data analysis pipeline for a single fcs file. This pipeline will give the user the ability to import a fcs file, turn it into an expression matrix, cluster it, make a frequency table, perform dimensionality reduction, and do various visualizations. This is a small piece of the CyTOF analysis scripts I have optimized over the past several years and it is grounded in best practices in the field.

This pipeline is to be used after the input file has been pre-gated. This is typically on live/dead and DNA by event_length (analogous to the scatter gates on flow cytometry).

Now let’s get acquainted with how CyTOF data are imported and processed. The data are stored as fcs files, which is a file format that has been used for flow cytometry long before CyTOF was developed. We are going to use the flowCore package to import the fcs file into a data structure called a flow frame. In reality, all we need is the expression matrix, which has each cell as a row and the markers as columns. Here is how you do that.

Import and pre-processing

First, we set the working directory to the folder named “data” that is located in the same folder as the R script. The working directory is where R looks for files and saves outputs. The here::here() function is from the here package, which makes it easier to find files and folders. There’s only one file in there. The data analyzed in this markdown is healthy PBMCs from the Stanford HIMC, and can be found here. Note the PDF at the bottom, which explains more about the dataset.

We use read.FCS() function from the flowCore package. The read.FCS() function returns an object of class flowFrame, which contains the data and metadata of the fcs file.

library(flowCore)
library(tidyverse)
set.seed(1)
setwd(here::here("data", 'mike_LISLinnegCD45pos-101917-relabeled-CORRECT'))

ff <- flowCore::read.FCS('Center_1_A_1_MATLAB_LiveIntactSing_Linneg_CD45pos.fcs')
ff

## flowFrame object 'Center_1_A_1_MATLAB_LiveIntactSing_Linneg_CD45pos.fcs'
## with 93816 cells and 30 observables:
##         name   desc     range  minRange  maxRange
## $P1  Ce140Di beads1      2911         0         7
## $P2  Ce142Di beads2      1509         0         1
## $P3  Er167Di   CD27       781         0        17
## $P4  Er170Di    CD3      1664         0        27
## $P5  Eu151Di  CD123      2814         0        16
## ...      ...    ...       ...       ...       ...
## $P26 Tm169Di CD45RA     12000         0        23
## $P27 Yb172Di   CD38      5629         0        18
## $P28 Yb174Di  HLADR      6990         0        11
## $P29 Yb176Di   CD56      1349         0        12
## $P30    Time   Time         0         0         1
## 261 keywords are stored in the 'description' slot

There’s a little bit we have to add here so that the column names of the expression matrix are the marker names and not the mass channel names.

mass_tags <- ff@parameters@data$name
params <- ff@parameters@data$desc

# All NA values in marker names will revert to mass tags names
params <- ifelse(is.na(params), mass_tags, params)
colnames(ff) <- params

We need to transform the data. In general, single cell data are transformed either logarithmically or log-like. For CyTOF, the best practices is the asinh transform with a scale argument of 5. This means that you divide all the values of the matrix by 5 and then you perform an asinh transform on that. This is how CyTOF data have always been scaled from the beginning, though it doesn’t look radically different than log(x + 1), which is used for single-cell sequencing. The goal is to make the data look as flow cytometry-like as possible (eg. you can gate it).

Below, I show you what the expression matrix looks like before and after the transform, and then I set the new expression matrix values to that which is stored in the flow frame.

exp <- flowCore::exprs(ff)
as_tibble(exp)

## # A tibble: 93,816 × 30
##    beads1 beads2    CD27    CD3 CD123 beads3 `Event #` Event_le…¹   CD33    CD14
##     <dbl>  <dbl>   <dbl>  <dbl> <dbl>  <dbl>     <dbl>      <dbl>  <dbl>   <dbl>
##  1   0         0   0       1.74  2.35   2.09         0       24.3  3.41  11.7   
##  2   0         0   0      56.1   3.39  16.7          0       25.2  0.686  8.97  
##  3   0         0   0.437   2.93  6.23 228.           0       17.7  2.23   6.48  
##  4   0         0 108.    288.    5.61  91.3          0       15.9  2.84   2.30  
##  5   2.48      0   9.69    7.18  3.35  22.2          0       20.5  5.29  15.9   
##  6   1.49      0  27.6   191.    2.44   2.25         0       24.3  0      0.0995
##  7   1.57      0  26.0   160.    2.30  46.6          0       27.1  0.894  1.97  
##  8   0         0   1.29   42.2   2.56  22.9          0       23.3  0     11.9   
##  9   0         0   4.12    9.30  0     14.1          0       19.6  8.39  22.4   
## 10   0         0   0.715   0     3.28   1.21         0       19.6 70.8   18.2   
## # … with 93,806 more rows, 20 more variables: CD61 <dbl>, cells1 <dbl>,
## #   cells2 <dbl>, beads4 <dbl>, beads5 <dbl>, CD19 <dbl>, CD4 <dbl>,
## #   CD8A <dbl>, CD16 <dbl>, CD235 <dbl>, livedead <dbl>, CD20 <dbl>,
## #   CD66 <dbl>, CD45 <dbl>, CD11C <dbl>, CD45RA <dbl>, CD38 <dbl>, HLADR <dbl>,
## #   CD56 <dbl>, Time <dbl>, and abbreviated variable name ¹Event_length

exp <- asinh(exp/5)
as_tibble(exp)

## # A tibble: 93,816 × 30
##    beads1 beads2   CD27   CD3 CD123 beads3 `Event #` Event…¹  CD33   CD14   CD61
##     <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>
##  1  0          0 0      0.342 0.454  0.407         0    2.28 0.638 1.59   0.199 
##  2  0          0 0      3.11  0.635  1.92          0    2.32 0.137 1.35   0.0704
##  3  0          0 0.0872 0.557 1.05   4.51          0    1.98 0.432 1.08   1.03  
##  4  0          0 3.76   4.75  0.965  3.60          0    1.87 0.542 0.446  0.527 
##  5  0.478      0 1.42   1.16  0.627  2.19          0    2.12 0.922 1.88   0.594 
##  6  0.294      0 2.41   4.34  0.470  0.435         0    2.28 0     0.0199 0.716 
##  7  0.308      0 2.35   4.16  0.444  2.93          0    2.39 0.178 0.384  0.662 
##  8  0          0 0.255  2.83  0.492  2.22          0    2.24 0     1.60   0.573 
##  9  0          0 0.752  1.38  0      1.76          0    2.07 1.29  2.20   0     
## 10  0          0 0.142  0     0.617  0.240         0    2.07 3.34  2.01   1.34  
## # … with 93,806 more rows, 19 more variables: cells1 <dbl>, cells2 <dbl>,
## #   beads4 <dbl>, beads5 <dbl>, CD19 <dbl>, CD4 <dbl>, CD8A <dbl>, CD16 <dbl>,
## #   CD235 <dbl>, livedead <dbl>, CD20 <dbl>, CD66 <dbl>, CD45 <dbl>,
## #   CD11C <dbl>, CD45RA <dbl>, CD38 <dbl>, HLADR <dbl>, CD56 <dbl>, Time <dbl>,
## #   and abbreviated variable name ¹Event_length

flowCore::exprs(ff) <- exp

Clustering

The next step is to cluster the data. For that, we’re going to use FlowSOM, which is in general the best practices for clustering in the CyTOF field both in terms of run time and accurracy in comparison to manual gating. For rare subset discovery, consider using X-shift. But for the projects I have done, especially in pharama, FlowSOM has gotten the job done.

FlowSOM builds what is called a self-organizing map with a default of 100 clusters evenly spaced in your data. Then similar clusters are merged until you have a specific number of pre-set clusters. How do you know how many clusters are in your data? The spoiler alert is you’re going to have to run this tool a number of times. I would think about how many subset you expect given your panel, and how many islands you see on your UMAP (stay tuned). It’s better that you overcluster than undercluster.

We use the high-level FlowSOM command, which performs the whole process.

We need to specify which columns to use. I’m going to do a shortcut and simply select for CD markers. The commands below simply look for the expressions in quotes in the vector of markers.

params <- ff@parameters@data$desc

# Filter out markers we're not using
to_remove <- c("CD45", "CD66", "CD61", "CD235")
to_remove <- which(params %in% to_remove)

# Markers to use
cd <- grep("CD", params)
ig <- grep("Ig", params)
hla <- grep("HLA", params)
ccr <- c(grep("CCR", params), grep("CXC", params))

surface <- c(cd, ig, hla, ccr)
surface <- surface[!(surface %in% to_remove)]

The arguments to pay attention to are xdim and ydim, which specify the size of the SOM. Right now I have it set at 10x10, which gives you 100 clusters to start with. You can go higher, but the runtime goes down. Sofie VanGassen (author) has the default at 7x7. The other one is colsToUse, which specifies the specific columns of the input flowFrame that will be used as clustering. Which do you want to cluster with? In general, surface markers rather than intracellular markers.

What is the final number of clusters? That’s the nClus argument. The algorithm will produce 100 clusters, and then merge clusters until we have that number of clusters, which needs to be less than xdim*ydim. We’ll be able to see what this looks like later.

library(FlowSOM)

fsom <- FlowSOM(ff,
                # Input options:
                compensate = FALSE, 
                transform = FALSE,
                scale = FALSE,
                # SOM options:
                colsToUse = surface, xdim = 10, ydim = 10,
                # Metaclustering options:
                nClus = 22)

fsom

## FlowSOM model trained on 93816 cells and 15 markers, 
##  using a 10x10 grid (100 clusters) and 22 metaclusters.
## 
## Markers used:  CD27, CD3, CD123, CD33, CD14, CD19, CD4, CD8A, CD16, CD20, CD11C, CD45RA, CD38, CD56, HLADR
## 
## Metacluster cell count:
##  MC 1  MC 2  MC 3  MC 4  MC 5  MC 6  MC 7  MC 8  MC 9 MC 10 MC 11 MC 12 MC 13 
## 18057  5913   269  2438   876  7312   151  6799 12408  1858   252 20851   347 
## MC 14 MC 15 MC 16 MC 17 MC 18 MC 19 MC 20 MC 21 MC 22 
##   653   342   487   161  1019 12047   121   787   668 
## 
##  717 cells (0.76%) are further than 4 MAD from their cluster center.

Building frequency and expression tables

The next question you’re going to have when you cluster the data is what clusters express what markers, and what clusters appear at what frequencies. There are great tools for this, and in later markdowns, I’m going to show you how to use them. What I’m going to do here is show you how to do it from scratch.

First, let’s look at percent of cells per cluster. To do that, we have to better understand what is inside the fsom object we just made.

We have the data here.

cells <- fsom$data %>% as_tibble() 
cells

## # A tibble: 93,816 × 30
##    beads1 beads2   CD27   CD3 CD123 beads3 `Event #` Event…¹  CD33   CD14   CD61
##     <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>
##  1  0          0 0      0.342 0.454  0.407         0    2.28 0.638 1.59   0.199 
##  2  0          0 0      3.11  0.635  1.92          0    2.32 0.137 1.35   0.0704
##  3  0          0 0.0872 0.557 1.05   4.51          0    1.98 0.432 1.08   1.03  
##  4  0          0 3.76   4.75  0.965  3.60          0    1.87 0.542 0.446  0.527 
##  5  0.478      0 1.42   1.16  0.627  2.19          0    2.12 0.922 1.88   0.594 
##  6  0.294      0 2.41   4.34  0.470  0.435         0    2.28 0     0.0199 0.716 
##  7  0.308      0 2.35   4.16  0.444  2.93          0    2.39 0.178 0.384  0.662 
##  8  0          0 0.255  2.83  0.492  2.22          0    2.24 0     1.60   0.573 
##  9  0          0 0.752  1.38  0      1.76          0    2.07 1.29  2.20   0     
## 10  0          0 0.142  0     0.617  0.240         0    2.07 3.34  2.01   1.34  
## # … with 93,806 more rows, 19 more variables: cells1 <dbl>, cells2 <dbl>,
## #   beads4 <dbl>, beads5 <dbl>, CD19 <dbl>, CD4 <dbl>, CD8A <dbl>, CD16 <dbl>,
## #   CD235 <dbl>, livedead <dbl>, CD20 <dbl>, CD66 <dbl>, CD45 <dbl>,
## #   CD11C <dbl>, CD45RA <dbl>, CD38 <dbl>, HLADR <dbl>, CD56 <dbl>, Time <dbl>,
## #   and abbreviated variable name ¹Event_length

We have the cluster and metacluster ID per cell here.

clusters <- FlowSOM::GetClusters(fsom)
clusters[1:10]

##  [1]  40  46  78  41 100   2   4   5  99  10

metaclusters <- FlowSOM::GetMetaclusters(fsom) # Note that this is a vector of factors
metaclusters[1:10]

##  [1] 8  6  19 9  19 1  1  1  19 2 
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

cells$cluster <- clusters
cells$metacluster <- metaclusters

Now we tabulate the cell frequencies per cluster. Note that the variation between percent of cells per cluster widens with the metaclusters. This corresponds to the clusters representing the frequencies of actual cell subsets in our data. We’re expressing these as percentages.

table(clusters) %>% sort() %>% `/`(nrow(ff)) %>% `*`(100)

## clusters
##         28         20         38         80         60         69         21 
## 0.05755948 0.06395498 0.10339388 0.12897587 0.17161252 0.18013985 0.25155624 
##         61         44         18         58         98         59         50 
## 0.25262215 0.26861090 0.28673147 0.28673147 0.31764305 0.33896137 0.36454336 
##         48          7         97         53         12         17         74 
## 0.36987294 0.38053211 0.39438902 0.40611410 0.40824593 0.41890509 0.41890509 
##         63         77         36         72         94         68         34 
## 0.48179415 0.50524431 0.54148546 0.55427646 0.57666070 0.58092436 0.58731986 
##         86         24         96         85         31         67         49 
## 0.61290185 0.63422018 0.64487934 0.67898866 0.68218641 0.68218641 0.69604332 
##         22         25         33         76         35         83         15 
## 0.70350473 0.71416390 0.72482306 0.74081180 0.76319604 0.76319604 0.77705295 
##         55         75         27         57         54         19         64 
## 0.77918479 0.79623945 0.79943720 0.80689861 0.80903044 0.81436002 0.83248060 
##         95         16         43         45         73          1         46 
## 0.83887610 0.84527160 0.84953526 0.88897416 0.90389699 0.92201757 0.92521532 
##         81         93         87         29         47         90         66 
## 0.92947898 0.98170888 0.99769762 0.99982945 1.00302720 1.01155453 1.04140019 
##         39         84         14         37         23         56         26 
## 1.04992752 1.09043234 1.18316705 1.20981496 1.22793553 1.23539695 1.25031977 
##         78         71         88         13         92         32         10 
## 1.25031977 1.28123135 1.36224098 1.38036156 1.40381172 1.46883261 1.49334868 
##         42         89         79         11         65          3         41 
## 1.55303999 1.55836957 1.56050141 1.59461073 1.64364288 1.66815895 1.68094994 
##         82          6          5         91          4         70         99 
## 1.68094994 1.69160911 1.73211393 1.77261874 1.77688241 1.78434382 1.81845314 
##          2        100         51         62          8         40         52 
## 1.90266053 2.02417498 2.07000938 2.11371195 2.13716210 2.24481965 2.25121514 
##          9         30 
## 2.60829709 2.95258804

table(metaclusters) %>% sort() %>% `/`(nrow(ff)) %>% `*`(100)

## metaclusters
##         20          7         17         11          3         15         13 
##  0.1289759  0.1609534  0.1716125  0.2686109  0.2867315  0.3645434  0.3698729 
##         16         14         22         21          5         18         10 
##  0.5191012  0.6960433  0.7120321  0.8388761  0.9337426  1.0861687  1.9804724 
##          4          2          8          6         19          9          1 
##  2.5987038  6.3027629  7.2471647  7.7939797 12.8410932 13.2258890 19.2472499 
##         12 
## 22.2254200

In later markdowns, this will take the form of a matrix, where each column is a different fcs file.

Now, let’s look at average expression. We’re going to create a matrix of markers by cluster ID and view it as a heatmap. This will allow us to determine which clusters express which markers. Again, we’re going to do this from scratch.

exp_mat <- lapply(sort(unique(metaclusters)), function(i) {
    result <- dplyr::filter(cells, metaclusters == i)[,surface]
    result <- apply(result, 2, mean)
    return(result) # You don't have to return, but I do it to be explicit
}) %>% dplyr::bind_rows()

exp_mat

## # A tibble: 22 × 15
##     CD27   CD3 CD123  CD33  CD14  CD19   CD4  CD8A  CD16  CD20 CD11C CD45RA
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1 2.03  3.50  0.418 0.333 0.727 0.145 2.48   1.28  1.53  1.28 0.289  0.534
##  2 0.153 0.488 0.783 3.81  2.44  0.207 1.05   1.14  1.50  1.21 4.00   1.09 
##  3 0.288 0.644 4.07  0.689 0.670 0.168 1.54   1.08  1.39  1.21 0.382  4.57 
##  4 0.285 0.698 1.50  1.98  1.27  0.299 1.39   1.85  3.89  1.92 4.31   4.21 
##  5 3.15  4.00  0.644 0.450 1.10  0.197 2.83   1.70  2.01  1.76 0.386  1.14 
##  6 1.75  3.04  0.682 0.456 1.12  0.280 1.23   4.28  2.27  1.90 0.388  0.765
##  7 1.36  2.93  0.540 0.522 0.886 2.06  1.51   2.69  1.89  2.48 0.583  2.35 
##  8 0.364 0.532 0.493 0.758 0.676 4.09  0.556  1.03  1.59  3.59 0.338  4.30 
##  9 3.05  4.16  0.528 0.371 0.855 0.186 2.52   1.35  1.68  1.44 0.316  3.98 
## 10 1.37  3.43  0.347 0.287 0.558 0.144 1.85   1.04  1.35  1.04 0.320  3.01 
## # … with 12 more rows, and 3 more variables: CD38 <dbl>, CD56 <dbl>,
## #   HLADR <dbl>

Now let’s visualize it. I’ve been using the pheatmap package, which has worked well for me over the years.

library(pheatmap)
pheatmap::pheatmap(exp_mat)

With the heatmap, you can figure out what clusters express what, and are therefore what cell type (or perhaps you have found a novel cell type). This heatmap matters when you’re doing per-cluster statistics on multiple fcs files. When you find out that the frequency of cluster 2 changes in the control vs the treatment condition, you want to be able to quickly know what cluster 2 is.

Dimensionality reduction

From here, we’re going to do dimensionality reduction. We’re going to use UMAP. It is a non-linear dimensionality reduction algorithm similar to t-SNE but with better global structure preservation (note that it is by no means perfect). It takes your original marker space and compresses it down to two dimensions, so we can visualize the data on a XY plot. Then, we can get an idea of what subsets are there at what frequencies. We don’t want to over-interpret our map beyond that. See my work here for more.

First, we’re going to subsample our data so it doesn’t take UMAP too long to compute. We don’t need the whole dataset to get the insights we need.

to_sample <- 20000
cells_sub <- cells[sample(seq(nrow(cells)), to_sample),]

Now we run the R implementation of UMAP via the umap package.

library(umap)

dimr <- umap::umap(d = cells_sub[,surface])$layout %>% as_tibble()
names(dimr) <- c("umap1", "umap2")

Now we can color the map by various markers. We’re going to use a helper function I wrote to do that easily.

#' @title Visualize dimension reduction plots 
#' @description Plot dimension reduction map of interest colored by dimension 
#' reduction comparison object of interest, with a blue to red color gradient.
#' Color palette is an attempt at the Cytobank rainbow palette.
#' @param cells Tibble of cells by features that the dimr was computed from
#' @param dimr Tibble of dimension reduction axis 1 and axis 2
#' @param marker vector of values for a particular marker
#' @param lim_min Minimum value to be colored
#' @param lim_max Maximum value to be colored
#' @return ggplot2 object
PlotViz <- function(cells, dimr, marker, lim_min = NULL, lim_max = NULL) {
    
    p <- ggplot(data = cells, aes(x = dimr[[1]], 
                                  y = dimr[[2]], 
                                  color = .data[[marker]])) + 
        geom_point(shape = ".") +
        xlab(names(dimr[1])) +
        ylab(names(dimr[2])) + 
        theme(legend.title = element_blank()) +
        ggtitle(marker)
    
    if(is.factor(cells[[marker]])) {
        # Add labels to the plot using geom_text
        p <- p + geom_point(size = 1.2)
        return(p)
    }
    
    if(!is.null(lim_min)) {
        p <- p + scale_color_gradientn(colors = c("blue", "cyan", "springgreen", "yellow", "orange", "red", "red4"), limits = c(lim_min, lim_max)) 
    } else {
        p <- p + scale_color_gradientn(colors = c("blue", "cyan", "springgreen", "yellow", "orange", "red", "red4")) 
    }
    
    return(p)
}

Let’s start by plotting some markers.

PlotViz(cells = cells_sub, dimr = dimr, marker = "CD33") # monocytes

PlotViz(cells = cells_sub, dimr = dimr, marker = "CD3") # T cells

Let’s group these plots into a grid. Let’s take the first nine surface markers.

library(patchwork)

p_list <- lapply(names(cells_sub[,surface][1:9]), function(i) {
    PlotViz(cells = cells_sub, dimr = dimr, marker = i)
})

(p_list[[1]] + p_list[[2]] + p_list[[3]]) / (p_list[[4]] + p_list[[5]] + p_list[[6]]) / (p_list[[7]] + p_list[[8]] + p_list[[9]])

Now let’s look at our clusters. Notice how the metaclusters begin to approximate the islands on the map. The islands on the map correspond to the most obvious cell subsets. In manual gating terms, these are the subsets that are obvious “blobs” on a biaxial plot.

PlotViz(cells = cells_sub, dimr = dimr, marker = "cluster")

PlotViz(cells = cells_sub, dimr = dimr, marker = "metacluster")

Remember, it is better to overcluster than to undercluster the data. I will leave it as an exercise to the reader to tinker with the number of clusters to determine the proper number. Note that this will involve first principles understanding of the expected number of subsets given the panel, plus the output you see on the heatmap and the dimension reduction map. Don’t rely on any one tool.

For more information about CyTOF analysis, please look at CyTOF workflow and SPECTRE, both of which aggregate tools into high-level functions for data analysis.

CyTOF pipeline for one file

2023-03-12

Import and pre-processing

Clustering

Building frequency and expression tables

Dimensionality reduction