CyTOF pipeline, multiple files

Home

The purpose of this document is to provide a practical guide to the analysis of CyTOF data when we’re dealing with multiple fcs files, comparing between two groups. My hope is that this markdown can help users understand the best practices of CyTOF analysis, as well as utilize some of the code here that has proven beneficial to me over the years. The dataset we are importing comes from Cross et al, Fronteirs in Immunology 2020. The fcs files can be found on Flow Repository here.

Load data

First, we load some libraries we will be using.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.0
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(diffcyt)
library(flowCore)
library(SummarizedExperiment)

## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'matrixStats'
## 
## The following object is masked from 'package:dplyr':
## 
##     count
## 
## 
## Attaching package: 'MatrixGenerics'
## 
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## 
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## 
## The following object is masked from 'package:flowCore':
## 
##     normalize
## 
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## 
## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## 
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
## 
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## 
## The following objects are masked from 'package:lubridate':
## 
##     second, second<-
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, rename
## 
## The following object is masked from 'package:tidyr':
## 
##     expand
## 
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## 
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## 
## The following object is masked from 'package:lubridate':
## 
##     %within%
## 
## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice
## 
## The following object is masked from 'package:purrr':
## 
##     reduce
## 
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## 
## Attaching package: 'Biobase'
## 
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## 
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians

library(here)

## here() starts at /Users/tylerburns/Documents/projects/cytof_pipeline_many_files

Next, we upload our patient metadata, which we will use when we’re distinguishing between control and treatment. Note that I’m removing one of the groups below (Dneg28). This is because for the purpose of this markdown, I want to focus on a comparison between two groups rather than three.

setwd(here::here("data", "FlowRepository_FR-FCM-Z2GG_files", "metadata"))

# There's a weird file read bug, where I can't read a handful of files
# Read in the metadata
md <- readr::read_csv("experiment_info.csv")

## Rows: 93 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): file_name, plate, vaccine, day, diagnosis, gating_status, sample_id
## dbl (1): patient_id
## lgl (1): read_error
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

md <- dplyr::filter(md, read_error == FALSE)

# Filter by vaccine
md <- dplyr::filter(md, vaccine == "TT") 

# Removing one of the time points so we can focus just on two
md <- md[!grepl("Dneg28", md$day),]

# Set day as group_id
names(md)[which(names(md) == "day")] <- "group_id"

# Set sample ID (for diffcyt)
md$sample_id <- paste0(md$group_id, "_", seq(nrow(md)))

The first thing we’re going to do is read in the fcs files of interest as a list of flow frames. If you want a detailed primer of the components of reading in a single, fcs file, please visit my tutorial here.

setwd(here::here("data", "FlowRepository_FR-FCM-Z2GG_files"))
set.seed(1234)

ff_list <- lapply(md$file_name, function(i) {
    # Get the flow frame object
    ff <- flowCore::read.FCS(i)
    
    # Random subsample (note how we can treat the flow frame like a dataframe)
    ff <- ff[sample(nrow(ff), 100000),]
    return(ff)
})

names(ff_list) <- md$file_name

ff_list[1:3]

## $`Plate10_8602_TT_D0_nTD_Live Intact Singlets.fcs`
## flowFrame object 'Plate10_8602_TT_D0_nTD_Live Intact Singlets.fcs'
## with 100000 cells and 52 observables:
##          name        desc     range  minRange  maxRange
## $P1   Bi209Di  209Bi_CD16      3888         0      3888
## $P2   Ce140Di 140Ce_Beads      6236         0      6236
## $P3    Center      Center      3611         0      3611
## $P4   Dy161Di  161Dy_CD26      2397         0      2397
## $P5   Dy162Di  162Dy_CD8a      1695         0      1695
## ...       ...         ...       ...       ...       ...
## $P48  Yb173Di  173Yb_a4b7       505         0       505
## $P49  Yb174Di  174Yb_PD-1      2698         0      2698
## $P50  Yb176Di   176Yb_CLA      9942         0      9942
## $P51 beadDist    beadDist       237         0       237
## $P52     Time        Time      8557         0      8557
## 498 keywords are stored in the 'description' slot
## 
## $`Plate10_8602_TT_Dneg21_nTD_Live Intact Singlets.fcs`
## flowFrame object 'Plate10_8602_TT_Dneg21_nTD_Live Intact Singlets.fcs'
## with 100000 cells and 52 observables:
##          name        desc     range  minRange  maxRange
## $P1   Bi209Di  209Bi_CD16     15162         0     15162
## $P2   Ce140Di 140Ce_Beads      6798         0      6798
## $P3    Center      Center      3371         0      3371
## $P4   Dy161Di  161Dy_CD26      2118         0      2118
## $P5   Dy162Di  162Dy_CD8a      1408         0      1408
## ...       ...         ...       ...       ...       ...
## $P48  Yb173Di  173Yb_a4b7      2126         0      2126
## $P49  Yb174Di  174Yb_PD-1       522         0       522
## $P50  Yb176Di   176Yb_CLA     10401         0     10401
## $P51 beadDist    beadDist       257         0       257
## $P52     Time        Time      8557         0      8557
## 498 keywords are stored in the 'description' slot
## 
## $`Plate11_8842_TT_D0_nTD_Live Intact Singlets.fcs`
## flowFrame object 'Plate11_8842_TT_D0_nTD_Live Intact Singlets.fcs'
## with 100000 cells and 52 observables:
##          name        desc     range  minRange  maxRange
## $P1   Bi209Di  209Bi_CD16      1735         0      1735
## $P2   Ce140Di 140Ce_Beads      6656         0      6656
## $P3    Center      Center      6510         0      6510
## $P4   Dy161Di  161Dy_CD26      1944         0      1944
## $P5   Dy162Di  162Dy_CD8a      1651         0      1651
## ...       ...         ...       ...       ...       ...
## $P48  Yb173Di  173Yb_a4b7       210         0       210
## $P49  Yb174Di  174Yb_PD-1       416         0       416
## $P50  Yb176Di   176Yb_CLA      8220         0      8220
## $P51 beadDist    beadDist       231         0       231
## $P52     Time        Time     19777         0     19777
## 496 keywords are stored in the 'description' slot

Now we load the marker data.

setwd(here::here("data", "FlowRepository_FR-FCM-Z2GG_files", "metadata"))
marker_info <- readr::read_csv("markers_edited.csv")

## Rows: 52 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): channel_name, marker_name, marker_class
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now, we’re going to get our data into diffcyt, which will perform clustering and statistics for us.

d_se <- diffcyt::prepareData(d_input = ff_list, experiment_info = as.data.frame(md), marker_info = as.data.frame(marker_info))
d_se

## class: SummarizedExperiment 
## dim: 2000000 52 
## metadata(2): experiment_info n_cells
## assays(1): exprs
## rownames: NULL
## rowData names(9): file_name plate ... sample_id read_error
## colnames(52): 209Bi_CD16 140Ce_Beads ... beadDist Time
## colData names(3): channel_name marker_name marker_class

Now we transform the data directly within diffcyt.

d_se <- diffcyt::transformData(d_se, cofactor = 5)

FlowSOM clustering

Now we perform FlowSOM clustering directly within diffcyt. We’re going to do a 10x10 grid (100 clusters) with a smaller number of metaclusters. We have already defined what markers we’re going to cluster in marker_info. They are the ones labeled “type.” We will print these here:

dplyr::filter(marker_info, marker_class == "type")$marker_name

##  [1] "209Bi_CD16"   "161Dy_CD26"   "162Dy_CD8a"   "163Dy_CD33"   "164Dy_CD161" 
##  [6] "166Er_CCR10"  "167Er_CCR7"   "168Er_CCR9"   "170Er_CD3"    "151Eu_ICOS"  
## [11] "153Eu_CD45RA" "155Gd_CD27"   "156Gd_CXCR3"  "158Gd_NKG2A"  "160Gd_CD14"  
## [16] "165Ho_CD127"  "113In_CD57"   "115In_HLA-DR" "175Lu_CD62L"  "142Nd_CD19"  
## [21] "144Nd_IgG"    "145Nd_CD4"    "146Nd_IgD"    "148Nd_IgA"    "150Nd_CD86"  
## [26] "141Pr_CCR6"   "147Sm_CD20"   "149Sm_CD56"   "152Sm_TCRgd"  "154Sm_CD123" 
## [31] "159Tb_CD11c"  "169Tm_CD25"   "171Yb_CXCR5"  "172Yb_CD38"   "173Yb_a4b7"  
## [36] "174Yb_PD-1"   "176Yb_CLA"

# Generate clusters
# note: include random seed for reproducible clustering
d_se <- generateClusters(d_se, seed_clustering = 123, 
                         xdim = 10, 
                         ydim = 10, 
                         meta_clustering = TRUE, 
                         meta_k = 15)

## FlowSOM clustering completed in 91.8 seconds

Now let’s find our clusters. It’s in rowData. Look at the end, for cluster_id.

rowData(d_se)

## DataFrame with 2000000 rows and 10 columns
##                                                  file_name    plate patient_id
##                                                   <factor> <factor>   <factor>
## 1          Plate10_8602_TT_D0_nTD_Live Intact Singlets.fcs  Plate10       8602
## 2          Plate10_8602_TT_D0_nTD_Live Intact Singlets.fcs  Plate10       8602
## 3          Plate10_8602_TT_D0_nTD_Live Intact Singlets.fcs  Plate10       8602
## 4          Plate10_8602_TT_D0_nTD_Live Intact Singlets.fcs  Plate10       8602
## 5          Plate10_8602_TT_D0_nTD_Live Intact Singlets.fcs  Plate10       8602
## ...                                                    ...      ...        ...
## 1999996 Plate19_8016_TT_Dneg21_TD_Live Intact Singlets.fcs  Plate19       8016
## 1999997 Plate19_8016_TT_Dneg21_TD_Live Intact Singlets.fcs  Plate19       8016
## 1999998 Plate19_8016_TT_Dneg21_TD_Live Intact Singlets.fcs  Plate19       8016
## 1999999 Plate19_8016_TT_Dneg21_TD_Live Intact Singlets.fcs  Plate19       8016
## 2000000 Plate19_8016_TT_Dneg21_TD_Live Intact Singlets.fcs  Plate19       8016
##          vaccine group_id diagnosis            gating_status sample_id
##         <factor> <factor>  <factor>                 <factor>  <factor>
## 1             TT       D0       nTD Live Intact Singlets.fcs      D0_1
## 2             TT       D0       nTD Live Intact Singlets.fcs      D0_1
## 3             TT       D0       nTD Live Intact Singlets.fcs      D0_1
## 4             TT       D0       nTD Live Intact Singlets.fcs      D0_1
## 5             TT       D0       nTD Live Intact Singlets.fcs      D0_1
## ...          ...      ...       ...                      ...       ...
## 1999996       TT   Dneg21        TD Live Intact Singlets.fcs Dneg21_20
## 1999997       TT   Dneg21        TD Live Intact Singlets.fcs Dneg21_20
## 1999998       TT   Dneg21        TD Live Intact Singlets.fcs Dneg21_20
## 1999999       TT   Dneg21        TD Live Intact Singlets.fcs Dneg21_20
## 2000000       TT   Dneg21        TD Live Intact Singlets.fcs Dneg21_20
##         read_error cluster_id
##           <factor>   <factor>
## 1            FALSE         13
## 2            FALSE         13
## 3            FALSE         7 
## 4            FALSE         7 
## 5            FALSE         4 
## ...            ...        ...
## 1999996      FALSE         12
## 1999997      FALSE         12
## 1999998      FALSE         7 
## 1999999      FALSE         12
## 2000000      FALSE         3

Let’s get a feel for the cluster frequency distribution. In the table below, cluster ID is the name and the element is the number of cells in that cluster.

rowData(d_se)$cluster_id %>% table() %>% sort()

## .
##      5     11      6     10      9      2      8     15     12     13      3 
##    721   2446   4405   8328  15760  20170  34046  46853  71970 164485 171308 
##      4     14      1      7 
## 187487 241512 281237 749272

Frequency and expression tables

Now we are going to make our frequency and expression tables for statistics and visualization. Diffcyt has built in functionality for that.

We will start with the counts table. This is the number of cells per cluster per sample. The results are a new SummarizedExperiment object, so I have to use the “assay” command within the SummarizedExperiment framework to access it. If you’re confused, don’t worry. It took me a while to figure these things out too.

d_counts <- diffcyt::calcCounts(d_se = d_se)
SummarizedExperiment::assay(d_counts)[1:10, 1:10]

##     D0_1 Dneg21_2  D0_3 Dneg21_4  D0_5 Dneg21_6  D0_7 Dneg21_8  D0_9 Dneg21_10
## 1  14448    11149  5467     6254 19902    19401 14655    26473 14432     21418
## 2   1411     1582   363      449   619     1057  1376     2278   761      1391
## 3   5042     5244  1353     1457  9178    10273  9123     9880  4525      5526
## 4  10022     7259  5815     5678 14141    10475  9106    24310 10014      9199
## 5     28       42    39        9    84       47    40       18    27        61
## 6    642      564    44       37   155       85   231       14   279        28
## 7  36825    37075 62636    63714 30405    29311 53953    10618 33049     24695
## 8    971     1125  2525     2700  1946     1965  1237     3316  2279      3731
## 9   1192      706   378      476   873      410   733     1184   705       457
## 10    89     1095    77      684   122     1000   114     1752    66       167

Ok, let’s convert these to frequencies, expressed as a percentage of total cells for the file.

# The frequency table itself, embedded within the d_counts SummarizedExperiment 
# Note that the metadata stored in the SummarizedExperiment file is the same order
d_freqs <- SummarizedExperiment::assay(d_counts) %>% apply(., 2, function(i) {
    result <- i/sum(i)
    return(result*100)
})

d_freqs[1:10, 1:10]

##      D0_1 Dneg21_2   D0_3 Dneg21_4   D0_5 Dneg21_6   D0_7 Dneg21_8   D0_9
## 1  14.448   11.149  5.467    6.254 19.902   19.401 14.655   26.473 14.432
## 2   1.411    1.582  0.363    0.449  0.619    1.057  1.376    2.278  0.761
## 3   5.042    5.244  1.353    1.457  9.178   10.273  9.123    9.880  4.525
## 4  10.022    7.259  5.815    5.678 14.141   10.475  9.106   24.310 10.014
## 5   0.028    0.042  0.039    0.009  0.084    0.047  0.040    0.018  0.027
## 6   0.642    0.564  0.044    0.037  0.155    0.085  0.231    0.014  0.279
## 7  36.825   37.075 62.636   63.714 30.405   29.311 53.953   10.618 33.049
## 8   0.971    1.125  2.525    2.700  1.946    1.965  1.237    3.316  2.279
## 9   1.192    0.706  0.378    0.476  0.873    0.410  0.733    1.184  0.705
## 10  0.089    1.095  0.077    0.684  0.122    1.000  0.114    1.752  0.066
##    Dneg21_10
## 1     21.418
## 2      1.391
## 3      5.526
## 4      9.199
## 5      0.061
## 6      0.028
## 7     24.695
## 8      3.731
## 9      0.457
## 10     0.167

Great. Now let’s look at marker expression per cluster, so we can know what cluster expresses what, and therefore annotate the clusters.

# Expression heatmap
cluster_exp <- diffcyt::calcMediansByClusterMarker(d_se) %>% 
    SummarizedExperiment::assay()

cluster_exp[1:10, 1:10]

##    209Bi_CD16 161Dy_CD26 162Dy_CD8a 163Dy_CD33 164Dy_CD161 166Er_CCR10
## 1    0.000000 0.43597161 0.46041100 4.39457615  0.32049829   2.2342883
## 2    3.140056 0.01549244 0.00000000 1.95843842  0.00000000   0.0000000
## 3    5.017952 0.00000000 0.09018378 0.09013180  3.14408704   0.0000000
## 4    0.000000 0.00000000 0.08877896 0.01528608  0.00000000   0.3031054
## 5    0.000000 2.25675482 0.80250410 3.74086862  0.39427732   1.6774147
## 6    0.000000 2.78207652 0.37146481 2.20336332  0.01313306   0.0000000
## 7    0.000000 2.58985497 0.08170316 0.00000000  0.00000000   0.6258491
## 8    0.000000 2.58480716 0.43751613 0.26018413  0.13605166   3.6969467
## 9    0.000000 0.95637654 0.00000000 0.26293038  0.00000000   0.0000000
## 10   0.000000 0.00000000 0.12264340 0.02304905  0.01821971   2.5493985
##    167Er_CCR7 168Er_CCR9 170Er_CD3 151Eu_ICOS
## 1   0.3305481 0.00000000 1.7772263 0.00000000
## 2   0.0000000 0.00000000 0.2443670 0.06368639
## 3   0.0000000 0.00000000 0.5818328 0.01601792
## 4   3.0851052 0.08584389 0.2407394 0.15491648
## 5   3.4882836 0.54234434 5.4409847 0.52329368
## 6   0.0895010 0.00000000 1.2134195 0.00000000
## 7   4.1672967 1.12878605 5.7767146 0.70985307
## 8   1.9618513 0.50304565 5.3912394 0.88943397
## 9   0.4989065 0.00000000 0.7481135 0.47584247
## 10  0.1794202 0.00000000 1.2114854 0.31928028

pheatmap::pheatmap(cluster_exp, labels_row = 1:nrow(cluster_exp), fontsize = 7)

Frequency statistics via EdgeR

If this section is too much for the reader (requires working knowledge of supervised learning, and EdgeR’s vocabulary around it), then skip to the next section that has simple statistical tests between conditions.

We’re going to do some statistics on the frequency tables, and then do some visualizations. To do this, we’re going to use EdgeR. This is a linear model framework within diffcyt. Just think of it as a t-test, but more rigorous. We’re not using a linear model to make a prediction. We’re using it to make a statistical test.

First, you have to make what is called a design matrix. Before you go any further, please go here, and quickly skim Figure 1 (on categorical variables), Figure 2 (design and contrast matrix), and section 4.2 and 4.4. This should help you understand what we’re doing with respect to linear models and the design matrix.

If you want a nice video-based explanation of the design matrix, watch this.

The design matrix below has rows as samples, and columns are the model parameters.

design <- diffcyt::createDesignMatrix(experiment_info = md, cols_design = 'group_id')

# This makes the names acceptable for input into downstream functions by removing
#   "unacceptable" characters. Comes from base R.
colnames(design) <- make.names(colnames(design))
design[1:10,]

##    X.Intercept. group_idDneg21
## 1             1              0
## 2             1              1
## 3             1              0
## 4             1              1
## 5             1              0
## 6             1              1
## 7             1              0
## 8             1              1
## 9             1              0
## 10            1              1

From here, we create a contrast matrix. For the sake of this comparison with this design matrix with two conditions, it’s literally going to just be a 0 and a 1.

In the contrast matrix, rows are the model parameters (the columns of the design matrix, which was intercept and the experiment group), and columns are a contrast of interest. In this case, there is only one test, controls vs Day 7, so there is only one column.

A thread that helped me understand the contrast matrix specifically when I first encountered it can be found here.

contrast <- diffcyt::createContrast(contrast = c(0, 1))
contrast

##      [,1]
## [1,]    0
## [2,]    1

Finally, we feed these into the built-in EdgeR statistics, within diffcyt. Note that it outputs another SummarizedExperiment object.

# GLM from EdgeR (options to do other types of modeling here)
res_DA <- diffcyt::testDA_edgeR(d_counts = d_counts, design =  design, contrast = contrast)
res_DA

## class: SummarizedExperiment 
## dim: 15 20 
## metadata(0):
## assays(1): counts
## rownames(15): 1 2 ... 14 15
## rowData names(6): cluster_id logFC ... p_val p_adj
## colnames(20): D0_1 Dneg21_2 ... D0_19 Dneg21_20
## colData names(9): file_name plate ... sample_id read_error

Rather than tear open the SummarizedExperiment object, we can use a built-in function called topTable to see what is statistically significant and what is not, between the control and experiment group.

# Quick visualization
edger_table <- diffcyt::topTable(res_DA, top_n = nrow(res_DA@elementMetadata)) %>% as_tibble()
edger_table[1:10,]

## # A tibble: 10 × 3
##    cluster_id      p_val     p_adj
##    <fct>           <dbl>     <dbl>
##  1 10         0.00000143 0.0000214
##  2 12         0.0000637  0.000478 
##  3 9          0.0911     0.342    
##  4 15         0.0790     0.342    
##  5 6          0.130      0.391    
##  6 11         0.212      0.530    
##  7 5          0.294      0.586    
##  8 8          0.313      0.586    
##  9 4          0.581      0.792    
## 10 7          0.533      0.792

Frequency statistics via Mann-Whitney U test

What if you don’t trust the design and contrast matrices above and you need a simple sanity check? What if you don’t want to touch regression-based statistics and you want something simpler? Then you just use a Mann-Whitney U test. I prefer this test in place of a t-test because you don’t have to assume anything about the underlying distributions.

Let’s look again at the freq table and see what info we need.

d_freqs[1:10, 1:10]

##      D0_1 Dneg21_2   D0_3 Dneg21_4   D0_5 Dneg21_6   D0_7 Dneg21_8   D0_9
## 1  14.448   11.149  5.467    6.254 19.902   19.401 14.655   26.473 14.432
## 2   1.411    1.582  0.363    0.449  0.619    1.057  1.376    2.278  0.761
## 3   5.042    5.244  1.353    1.457  9.178   10.273  9.123    9.880  4.525
## 4  10.022    7.259  5.815    5.678 14.141   10.475  9.106   24.310 10.014
## 5   0.028    0.042  0.039    0.009  0.084    0.047  0.040    0.018  0.027
## 6   0.642    0.564  0.044    0.037  0.155    0.085  0.231    0.014  0.279
## 7  36.825   37.075 62.636   63.714 30.405   29.311 53.953   10.618 33.049
## 8   0.971    1.125  2.525    2.700  1.946    1.965  1.237    3.316  2.279
## 9   1.192    0.706  0.378    0.476  0.873    0.410  0.733    1.184  0.705
## 10  0.089    1.095  0.077    0.684  0.122    1.000  0.114    1.752  0.066
##    Dneg21_10
## 1     21.418
## 2      1.391
## 3      5.526
## 4      9.199
## 5      0.061
## 6      0.028
## 7     24.695
## 8      3.731
## 9      0.457
## 10     0.167

Each p-value will come from a test on each cluster. The cluster is the row. The conditions are in the columns, before the underscore. So we have all the information we need. Let’s run the test on the first column, so you can see what’s going on.

Note that the function is called “wilcox.test.” From the descriptin of the function: “Performs one- and two-sample Wilcoxon tests on vectors of data; the latter is also known as ‘Mann-Whitney’ test.”

test <- d_freqs[1,]
test

##      D0_1  Dneg21_2      D0_3  Dneg21_4      D0_5  Dneg21_6      D0_7  Dneg21_8 
##    14.448    11.149     5.467     6.254    19.902    19.401    14.655    26.473 
##      D0_9 Dneg21_10     D0_11 Dneg21_12     D0_13 Dneg21_14     D0_15 Dneg21_16 
##    14.432    21.418     9.278    15.277    18.582    16.296    15.347    12.843 
##     D0_17 Dneg21_18     D0_19 Dneg21_20 
##     6.212     4.775    25.291     3.737

c1 <- test[grep("D0", names(test))]
c2 <- test[grep("Dneg21", names(test))]

wilcox.test(c1, c2)

## 
##  Wilcoxon rank sum exact test
## 
## data:  c1 and c2
## W = 52, p-value = 0.9118
## alternative hypothesis: true location shift is not equal to 0

Ok, now we just have to do that for every other cluster.

p_mwu <- lapply(seq(nrow(d_freqs)), function(i) {
    curr <- d_freqs[i,]
    c1 <- curr[grep("Dneg21", names(curr))]
    c2 <- curr[grep("D0", names(curr))]
    wilcox.test(c1, c2)$p.value
}) %>% unlist()

## Warning in wilcox.test.default(c1, c2): cannot compute exact p-value with ties

## Warning in wilcox.test.default(c1, c2): cannot compute exact p-value with ties

## Warning in wilcox.test.default(c1, c2): cannot compute exact p-value with ties

## Warning in wilcox.test.default(c1, c2): cannot compute exact p-value with ties

## Warning in wilcox.test.default(c1, c2): cannot compute exact p-value with ties

And now we do the FDR adjustment to account for comparisons that were statistically significant out of randomness.

p_mwu_adj <- p.adjust(p_mwu, method = "fdr")
p_mwu_adj

##  [1] 0.9117972 0.7330827 0.7330827 0.7330827 0.7330827 0.2621295 0.7330827
##  [8] 0.7330827 0.2621295 0.2621295 0.7330827 0.7330827 0.7330827 0.7330827
## [15] 0.7330827

Now we can make a statistics table.

mwu_table <- tibble(cluster = seq(nrow(d_freqs)), p_value = p_mwu, p_adj = p_mwu_adj) %>%
    arrange(p_mwu_adj)
mwu_table

## # A tibble: 15 × 3
##    cluster p_value p_adj
##      <int>   <dbl> <dbl>
##  1       6  0.0433 0.262
##  2       9  0.0524 0.262
##  3      10  0.0283 0.262
##  4       2  0.684  0.733
##  5       3  0.684  0.733
##  6       4  0.218  0.733
##  7       5  0.364  0.733
##  8       7  0.529  0.733
##  9       8  0.257  0.733
## 10      11  0.345  0.733
## 11      12  0.520  0.733
## 12      13  0.684  0.733
## 13      14  0.579  0.733
## 14      15  0.579  0.733
## 15       1  0.912  0.912

I would say that a good test for whether your EdgeR statistics are done correctly is whether they are at least similar to what you see with the Mann-Whitney U test here. The spoiler alert is that cluster 12 is flagged as significant by EdgeR and not by the Mann-Whitney U test. Plots below will show why: the presence of an outlier. When you see extreme outliers like this, it’s a good idea to look at the offending file and see if you can figure out why it is such an outlier, and whether it should be removed from analysis.

Boxplots of statistical test results

Now, we’re going to use a plotting function I wrote to visualize the clusters that are statistically significant, and see what we see. This took me a bit of time to optimize, but it gives you boxplots (in the ggplot2 style), with the statistics and *’s to indicate signifcance, as you would see in a standard publication.

I advise always looking at the boxplots, as you can start to see what significant results are relevant, and which are the result of things like outliers, or something I commonly see, clusters that simply have very low frequencies. Accordingly, when you look at the box plots below, always look at the y axis, so you know what frequency range you’re dealing with (eg. 10-20 percent versus 0.1-1 percent).

Let’s introduce the function. It’s complex, so don’t worry if it looks like a bit much. It’s optimized for both differential abundance, and differential expression. In this markdown, we’re only dealing with the former, so if you’re a bioinformatician, you can look at the function, see what I did, and simplify it to your liking. Note that it is also optimized for multiple clusters as input. The loop is built in. Again, if you would rather make the loop yourself, then just pull out the components within “lapply” as needed.

Note that the statistics are the Mann-Whitney U test again.

library(ggpubr)

#' @title Wrapper for plotting
#' @description Produces per cluster plots of control vs experimental condition
#' @param clusters Total Cluster IDs
#' @param Mat cluster frequency table
#' @param test Test that will be used if including statistical testing within
#' the boxplots
#' @param Add_p Whether to add a p-value to the plot itself
#' @param to_save whether to save the plots
#' @param Comp_conds the names of the conditions used in the experiment
#' @param Control_cond the name of the condition that is the control. It will
#' be placed on the left side
#' @param plot_type either da or ds
#' @param marker string indicating what marker is used if DS. Indicates 
#' population frequencies if DA.
#' @param md A tibble of sample metadata, in the same order as the colnames of Mat
#' @return a list of ggplot objects
PlotWrapper <- function(clusters, 
                        Mat, 
                        test = "wilcox.test", 
                        Add_p = TRUE, 
                        to_save = FALSE, 
                        Comp_conds, 
                        Control_cond,
                        plot_type = "da",
                        marker = "frequencies",
                        md) {

    # The plotting loop
    total_plots <- lapply(clusters, function(i) {
        # Get the information in plottable format
        percents <- Mat[i,] %>% unlist()
        if(sum(percents) == 0) { # Otherwise will print jittered garbage around zero
            return()
        }
        toplot <- tibble(immune = percents)
        
        # This will change depending on how you do it
        # Note that you can't do as.factor here
        # toplot$group <- factor(names(percents) %>%
        #                            sub("_[^_]+$", "", .), levels = Comp_conds)
        toplot$group <- factor(md$group_id, levels = Comp_conds)
        
        #group_levels <- comp_conds 
        #toplot$group <- factor(toplot$group, levels = group_levels)
        
        # The actual plotting
        p <- ggplot(toplot, aes(group, immune)) + 
            geom_boxplot(outlier.shape = NA) +
            geom_jitter()
        
            
        p <- p + theme(text = element_text(size=18))
        p <- p + theme(axis.text.x=element_text(angle=30, hjust=1))
        
        if(plot_type == "da") {
            p <- p + labs(y = "percent all cells") 
            p <- p + ggtitle(paste("Cluster", i, sep = " "))
        } else if(plot_type == "ds") {
            p <- p + labs(y = "marker expression") 
            p <- p + ggtitle(paste("Cluster", i, marker, sep = " "))
        }
        
        # Adding error bars
        if(Add_p) {
            # Re-ordering so control goes first
            my_comparisons <- Comp_conds %>%
                .[-which(comp_conds %in% control_cond)]
            
            my_comparisons <- lapply(my_comparisons, function(i) {
                c(Control_cond, i)
            })
            
            p <- p + ggpubr::stat_compare_means(method = test, 
                                        aes(label = ..p.signif..),
                                        comparisons = my_comparisons,
                                        hide.ns = FALSE) 
            
            p$layers[[gginnards::which_layers(p, "GeomSignif")]]$aes_params$textsize <- 6
        }
        
        if(to_save) {
            ggsave(width = 7, height = 7, filename = paste("cluster", 
                                                           i, 
                                                           test, 
                                                           marker,
                                                           "jpg", 
                                                           sep = "."))
        }
        
        message(paste("cluster", i, "plot produced"))
        
        return(p)
    })
    
    return(total_plots)
}

Ok, now let’s select a subset of clusters that have the lowest adjusted p-values. We’ll go with the Mann Whitney U test to be consistent. Always look at the y axis, so you can see the frequency range of the cluster of interest. Sometimes you’ll get a significant result for a cluster that is less than one percent in frequency, in which case you should double check and see if that makes sense.

cluster_id <- edger_table[1:5,]$cluster_id
comp_conds <- unique(md$group_id)
control_cond <- "D0" 

PlotWrapper(
    clusters = cluster_id,
    Mat = d_freqs,
    to_save = FALSE,
    Comp_conds = comp_conds,
    Control_cond = control_cond,
    md = md
)

## cluster 10 plot produced

## cluster 12 plot produced

## cluster 9 plot produced

## cluster 15 plot produced

## cluster 6 plot produced

## [[1]]

## Warning in wilcox.test.default(c(0.089, 0.077, 0.122, 0.114, 0.066, 0.191, :
## cannot compute exact p-value with ties

## 
## [[2]]

## Warning in wilcox.test.default(c(0.022, 0.034, 0.058, 0.051, 0.011, 0.046, :
## cannot compute exact p-value with ties

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

Dimension reduction

Ok, the next thing we’re going to do is dimension reduction. With this, you can get a feel for how many cell subsets your data might have and how big each subset is. Don’t read into it any more than that.

We are going to use UMAP, as we did in the one fcs file example. It is relatively simple to run. Here is how you do it with the data, in its current format.

library(tidyverse)
library(umap)
set.seed(1234)

# Subsample the data and filter by surface markers
dimr_sample <- 10000
d_sub <- d_se[sample(nrow(d_se@elementMetadata), dimr_sample),]
surface <- d_se@colData$marker_name[d_se@colData$marker_class == "type"]

# Get the map in tibble form
dimr <- umap::umap(d = d_sub@assays@data$exprs[,surface])$layout %>% as_tibble()

## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

names(dimr) <- c("umap1", "umap2")

Now, we use a dimension reduction visualization function to visualize the map and color it by what we want to see.

#' @title Visualize dimension reduction plots 
#' @description Plot dimension reduction map of interest colored by dimension 
#' reduction comparison object of interest, with a blue to red color gradient.
#' Color palette is an attempt at the Cytobank rainbow palette.
#' @param cells Tibble of cells by features that the dimr was computed from
#' @param dimr Tibble of dimension reduction axis 1 and axis 2
#' @param marker vector of values for a particular marker
#' @param lim_min Minimum value to be colored
#' @param lim_max Maximum value to be colored
#' @return ggplot2 object
PlotViz <- function(cells, dimr, marker, lim_min = NULL, lim_max = NULL) {
    
    p <- qplot(x = dimr[[1]], 
               y = dimr[[2]], 
               color = cells[[marker]], 
               xlab = names(dimr[1]), 
               ylab = names(dimr[2])) + 
        theme(legend.title = element_blank()) +
        geom_point(shape = ".") +
        ggtitle(marker)
    
    if(is.factor(cells[[marker]])) {
        return(p)
    }
    
    if(!is.null(lim_min)) {
        p <- p + scale_color_gradientn(colors = c("blue", "cyan", "springgreen", "yellow", "orange", "red", "red4"), limits = c(lim_min, lim_max)) 
    } else {
        p <- p + scale_color_gradientn(colors = c("blue", "cyan", "springgreen", "yellow", "orange", "red", "red4")) 
    }
    
    return(p)
}

Now let’s visualize a few markers as before.

# Get the data out of the SummarizedExperiment class
out_cells <- as_tibble(d_sub@assays@data$exprs)
PlotViz(cells = out_cells, dimr = dimr, marker = '170Er_CD3')

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

PlotViz(cells = out_cells, dimr = dimr, marker = '142Nd_CD19')

library(patchwork)

p_list <- lapply(names(out_cells[,surface][1:9]), function(i) {
    PlotViz(cells = out_cells, dimr = dimr, marker = i)
})

(p_list[[1]] + p_list[[2]] + p_list[[3]]) / (p_list[[4]] + p_list[[5]] + p_list[[6]]) / (p_list[[7]] + p_list[[8]] + p_list[[9]])

Ok, now let’s do some plotting of our metadata. To do that, we need the metadata label cell by cell. Here is what that looks like.

md_percell <- d_sub@elementMetadata
md_percell

## DataFrame with 10000 rows and 10 columns
##                                                 file_name    plate patient_id
##                                                  <factor> <factor>   <factor>
## 1     Plate19_8016_TT_D0_TD_Live Intact Singlets.fcs       Plate19       8016
## 2     Plate18_8763_TT_D0_nTD_Live Intact Singlets.fcs      Plate18       8763
## 3     Plate11_8842_TT_Dneg21_nTD_Live Intact Singlets.fcs  Plate11       8842
## 4     Plate14_8527_TT_Dneg21_TD_Live Intact Singlets.fcs   Plate14       8527
## 5     Plate11_8842_TT_D0_nTD_Live Intact Singlets.fcs      Plate11       8842
## ...                                                   ...      ...        ...
## 9996  Plate19_8016_TT_D0_TD_Live Intact Singlets.fcs       Plate19       8016
## 9997  Plate15_8144_TT_D0_nTD_Live Intact Singlets.fcs      Plate15       8144
## 9998  Plate19_8016_TT_D0_TD_Live Intact Singlets.fcs       Plate19       8016
## 9999  Plate16_8733_TT_Dneg21_nTD_Live Intact Singlets.fcs  Plate16       8733
## 10000 Plate16_8733_TT_Dneg21_nTD_Live Intact Singlets.fcs  Plate16       8733
##        vaccine group_id diagnosis            gating_status sample_id read_error
##       <factor> <factor>  <factor>                 <factor>  <factor>   <factor>
## 1           TT   D0           TD  Live Intact Singlets.fcs D0_19          FALSE
## 2           TT   D0           nTD Live Intact Singlets.fcs D0_17          FALSE
## 3           TT   Dneg21       nTD Live Intact Singlets.fcs Dneg21_4       FALSE
## 4           TT   Dneg21       TD  Live Intact Singlets.fcs Dneg21_10      FALSE
## 5           TT   D0           nTD Live Intact Singlets.fcs D0_3           FALSE
## ...        ...      ...       ...                      ...       ...        ...
## 9996        TT   D0           TD  Live Intact Singlets.fcs D0_19          FALSE
## 9997        TT   D0           nTD Live Intact Singlets.fcs D0_11          FALSE
## 9998        TT   D0           TD  Live Intact Singlets.fcs D0_19          FALSE
## 9999        TT   Dneg21       nTD Live Intact Singlets.fcs Dneg21_14      FALSE
## 10000       TT   Dneg21       nTD Live Intact Singlets.fcs Dneg21_14      FALSE
##       cluster_id
##         <factor>
## 1             3 
## 2             7 
## 3             7 
## 4             14
## 5             1 
## ...          ...
## 9996          1 
## 9997          14
## 9998          1 
## 9999          7 
## 10000         1

We’ll start with that. Anyway, let’s look a at cluster ID and group_id.

out_cells <- bind_cols(out_cells, as_tibble(md_percell))
PlotViz(cells = out_cells, dimr = dimr, marker = "group_id")

PlotViz(cells = out_cells, dimr = dimr, marker = "cluster_id")

The group ID appears to be well-mixed, with some exceptions. But if you really wanted to examine this, you would simply look at every cell’s k-nearest neighbors in terms of the percentage of the two groups. You can read more about it in my pre-print here. Look at figure 2.

What if we want to visualize the map by p-values? Rather than relying on any package to have the element metadata served up for us, we need a way assign the per-cluster statistics to each cell. Then, we would have to take the negative log10 of the p values (0.01 -> 2, 0.001 -> 3, etc) so they can fit on a color scheme that would otherwise be smashed to zero.

We will do that now. We’re going to iterate through the cluster ID column of our data, and for each cluster, we’re going to do a lookup of the corresponding p-values of interest. We’ll use the diffcyt p-values for now, but you can switch out the following block of code with whatever table of p-values you please.

To refresh your memory, here’s the EdgeR table that was previously calculated. It has p-values for every cluster.

# We're using the whole thing. This is just a visualization.
edger_table[1:10,]

## # A tibble: 10 × 3
##    cluster_id      p_val     p_adj
##    <fct>           <dbl>     <dbl>
##  1 10         0.00000143 0.0000214
##  2 12         0.0000637  0.000478 
##  3 9          0.0911     0.342    
##  4 15         0.0790     0.342    
##  5 6          0.130      0.391    
##  6 11         0.212      0.530    
##  7 5          0.294      0.586    
##  8 8          0.313      0.586    
##  9 4          0.581      0.792    
## 10 7          0.533      0.792

neg_log10_p_adj <- lapply(out_cells$cluster_id, function(i) {
    result <- dplyr::filter(edger_table, cluster_id == i)$p_adj
}) %>% unlist()

neg_log10_p_adj <- -log10(neg_log10_p_adj)

Great. Now we’re going to color the map by the p-value itself so you can see what regions of the map are statistically significant.

out_cells$neg_log10_p_adj <- neg_log10_p_adj
PlotViz(cells = out_cells, dimr = dimr, marker = "neg_log10_p_adj")

Ok, and that will do it for now. I hope this helps you in putting together a CyTOF pipeline for your data. Additional tools and visualizations are available with CyTOF workflow, here. There are other rabbit holes we can go down around QC, but that will be for another time.