The purpose of this document is to provide a practical guide to the analysis of CyTOF data when we’re dealing with multiple fcs files, comparing between two groups. My hope is that this markdown can help users understand the best practices of CyTOF analysis, as well as utilize some of the code here that has proven beneficial to me over the years. The dataset we are importing comes from Cross et al, Fronteirs in Immunology 2020. The fcs files can be found on Flow Repository here.
First, we load some libraries we will be using.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.0
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(diffcyt)
library(flowCore)
library(SummarizedExperiment)
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
##
## The following object is masked from 'package:dplyr':
##
## count
##
##
## Attaching package: 'MatrixGenerics'
##
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
##
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
##
## The following object is masked from 'package:flowCore':
##
## normalize
##
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
##
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
##
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which.max, which.min
##
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
##
## The following objects are masked from 'package:lubridate':
##
## second, second<-
##
## The following objects are masked from 'package:dplyr':
##
## first, rename
##
## The following object is masked from 'package:tidyr':
##
## expand
##
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
##
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
##
## The following object is masked from 'package:lubridate':
##
## %within%
##
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
##
## The following object is masked from 'package:purrr':
##
## reduce
##
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
##
## Attaching package: 'Biobase'
##
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
##
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
library(here)
## here() starts at /Users/tylerburns/Documents/projects/cytof_pipeline_many_files
Next, we upload our patient metadata, which we will use when we’re distinguishing between control and treatment. Note that I’m removing one of the groups below (Dneg28). This is because for the purpose of this markdown, I want to focus on a comparison between two groups rather than three.
setwd(here::here("data", "FlowRepository_FR-FCM-Z2GG_files", "metadata"))
# There's a weird file read bug, where I can't read a handful of files
# Read in the metadata
md <- readr::read_csv("experiment_info.csv")
## Rows: 93 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): file_name, plate, vaccine, day, diagnosis, gating_status, sample_id
## dbl (1): patient_id
## lgl (1): read_error
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
md <- dplyr::filter(md, read_error == FALSE)
# Filter by vaccine
md <- dplyr::filter(md, vaccine == "TT")
# Removing one of the time points so we can focus just on two
md <- md[!grepl("Dneg28", md$day),]
# Set day as group_id
names(md)[which(names(md) == "day")] <- "group_id"
# Set sample ID (for diffcyt)
md$sample_id <- paste0(md$group_id, "_", seq(nrow(md)))
The first thing we’re going to do is read in the fcs files of interest as a list of flow frames. If you want a detailed primer of the components of reading in a single, fcs file, please visit my tutorial here.
setwd(here::here("data", "FlowRepository_FR-FCM-Z2GG_files"))
set.seed(1234)
ff_list <- lapply(md$file_name, function(i) {
# Get the flow frame object
ff <- flowCore::read.FCS(i)
# Random subsample (note how we can treat the flow frame like a dataframe)
ff <- ff[sample(nrow(ff), 100000),]
return(ff)
})
names(ff_list) <- md$file_name
ff_list[1:3]
## $`Plate10_8602_TT_D0_nTD_Live Intact Singlets.fcs`
## flowFrame object 'Plate10_8602_TT_D0_nTD_Live Intact Singlets.fcs'
## with 100000 cells and 52 observables:
## name desc range minRange maxRange
## $P1 Bi209Di 209Bi_CD16 3888 0 3888
## $P2 Ce140Di 140Ce_Beads 6236 0 6236
## $P3 Center Center 3611 0 3611
## $P4 Dy161Di 161Dy_CD26 2397 0 2397
## $P5 Dy162Di 162Dy_CD8a 1695 0 1695
## ... ... ... ... ... ...
## $P48 Yb173Di 173Yb_a4b7 505 0 505
## $P49 Yb174Di 174Yb_PD-1 2698 0 2698
## $P50 Yb176Di 176Yb_CLA 9942 0 9942
## $P51 beadDist beadDist 237 0 237
## $P52 Time Time 8557 0 8557
## 498 keywords are stored in the 'description' slot
##
## $`Plate10_8602_TT_Dneg21_nTD_Live Intact Singlets.fcs`
## flowFrame object 'Plate10_8602_TT_Dneg21_nTD_Live Intact Singlets.fcs'
## with 100000 cells and 52 observables:
## name desc range minRange maxRange
## $P1 Bi209Di 209Bi_CD16 15162 0 15162
## $P2 Ce140Di 140Ce_Beads 6798 0 6798
## $P3 Center Center 3371 0 3371
## $P4 Dy161Di 161Dy_CD26 2118 0 2118
## $P5 Dy162Di 162Dy_CD8a 1408 0 1408
## ... ... ... ... ... ...
## $P48 Yb173Di 173Yb_a4b7 2126 0 2126
## $P49 Yb174Di 174Yb_PD-1 522 0 522
## $P50 Yb176Di 176Yb_CLA 10401 0 10401
## $P51 beadDist beadDist 257 0 257
## $P52 Time Time 8557 0 8557
## 498 keywords are stored in the 'description' slot
##
## $`Plate11_8842_TT_D0_nTD_Live Intact Singlets.fcs`
## flowFrame object 'Plate11_8842_TT_D0_nTD_Live Intact Singlets.fcs'
## with 100000 cells and 52 observables:
## name desc range minRange maxRange
## $P1 Bi209Di 209Bi_CD16 1735 0 1735
## $P2 Ce140Di 140Ce_Beads 6656 0 6656
## $P3 Center Center 6510 0 6510
## $P4 Dy161Di 161Dy_CD26 1944 0 1944
## $P5 Dy162Di 162Dy_CD8a 1651 0 1651
## ... ... ... ... ... ...
## $P48 Yb173Di 173Yb_a4b7 210 0 210
## $P49 Yb174Di 174Yb_PD-1 416 0 416
## $P50 Yb176Di 176Yb_CLA 8220 0 8220
## $P51 beadDist beadDist 231 0 231
## $P52 Time Time 19777 0 19777
## 496 keywords are stored in the 'description' slot
Now we load the marker data.
setwd(here::here("data", "FlowRepository_FR-FCM-Z2GG_files", "metadata"))
marker_info <- readr::read_csv("markers_edited.csv")
## Rows: 52 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): channel_name, marker_name, marker_class
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now, we’re going to get our data into diffcyt, which will perform clustering and statistics for us.
d_se <- diffcyt::prepareData(d_input = ff_list, experiment_info = as.data.frame(md), marker_info = as.data.frame(marker_info))
d_se
## class: SummarizedExperiment
## dim: 2000000 52
## metadata(2): experiment_info n_cells
## assays(1): exprs
## rownames: NULL
## rowData names(9): file_name plate ... sample_id read_error
## colnames(52): 209Bi_CD16 140Ce_Beads ... beadDist Time
## colData names(3): channel_name marker_name marker_class
Now we transform the data directly within diffcyt.
d_se <- diffcyt::transformData(d_se, cofactor = 5)
Now we perform FlowSOM clustering directly within diffcyt. We’re going to do a 10x10 grid (100 clusters) with a smaller number of metaclusters. We have already defined what markers we’re going to cluster in marker_info. They are the ones labeled “type.” We will print these here:
dplyr::filter(marker_info, marker_class == "type")$marker_name
## [1] "209Bi_CD16" "161Dy_CD26" "162Dy_CD8a" "163Dy_CD33" "164Dy_CD161"
## [6] "166Er_CCR10" "167Er_CCR7" "168Er_CCR9" "170Er_CD3" "151Eu_ICOS"
## [11] "153Eu_CD45RA" "155Gd_CD27" "156Gd_CXCR3" "158Gd_NKG2A" "160Gd_CD14"
## [16] "165Ho_CD127" "113In_CD57" "115In_HLA-DR" "175Lu_CD62L" "142Nd_CD19"
## [21] "144Nd_IgG" "145Nd_CD4" "146Nd_IgD" "148Nd_IgA" "150Nd_CD86"
## [26] "141Pr_CCR6" "147Sm_CD20" "149Sm_CD56" "152Sm_TCRgd" "154Sm_CD123"
## [31] "159Tb_CD11c" "169Tm_CD25" "171Yb_CXCR5" "172Yb_CD38" "173Yb_a4b7"
## [36] "174Yb_PD-1" "176Yb_CLA"
# Generate clusters
# note: include random seed for reproducible clustering
d_se <- generateClusters(d_se, seed_clustering = 123,
xdim = 10,
ydim = 10,
meta_clustering = TRUE,
meta_k = 15)
## FlowSOM clustering completed in 91.8 seconds
Now let’s find our clusters. It’s in rowData. Look at the end, for cluster_id.
rowData(d_se)
## DataFrame with 2000000 rows and 10 columns
## file_name plate patient_id
## <factor> <factor> <factor>
## 1 Plate10_8602_TT_D0_nTD_Live Intact Singlets.fcs Plate10 8602
## 2 Plate10_8602_TT_D0_nTD_Live Intact Singlets.fcs Plate10 8602
## 3 Plate10_8602_TT_D0_nTD_Live Intact Singlets.fcs Plate10 8602
## 4 Plate10_8602_TT_D0_nTD_Live Intact Singlets.fcs Plate10 8602
## 5 Plate10_8602_TT_D0_nTD_Live Intact Singlets.fcs Plate10 8602
## ... ... ... ...
## 1999996 Plate19_8016_TT_Dneg21_TD_Live Intact Singlets.fcs Plate19 8016
## 1999997 Plate19_8016_TT_Dneg21_TD_Live Intact Singlets.fcs Plate19 8016
## 1999998 Plate19_8016_TT_Dneg21_TD_Live Intact Singlets.fcs Plate19 8016
## 1999999 Plate19_8016_TT_Dneg21_TD_Live Intact Singlets.fcs Plate19 8016
## 2000000 Plate19_8016_TT_Dneg21_TD_Live Intact Singlets.fcs Plate19 8016
## vaccine group_id diagnosis gating_status sample_id
## <factor> <factor> <factor> <factor> <factor>
## 1 TT D0 nTD Live Intact Singlets.fcs D0_1
## 2 TT D0 nTD Live Intact Singlets.fcs D0_1
## 3 TT D0 nTD Live Intact Singlets.fcs D0_1
## 4 TT D0 nTD Live Intact Singlets.fcs D0_1
## 5 TT D0 nTD Live Intact Singlets.fcs D0_1
## ... ... ... ... ... ...
## 1999996 TT Dneg21 TD Live Intact Singlets.fcs Dneg21_20
## 1999997 TT Dneg21 TD Live Intact Singlets.fcs Dneg21_20
## 1999998 TT Dneg21 TD Live Intact Singlets.fcs Dneg21_20
## 1999999 TT Dneg21 TD Live Intact Singlets.fcs Dneg21_20
## 2000000 TT Dneg21 TD Live Intact Singlets.fcs Dneg21_20
## read_error cluster_id
## <factor> <factor>
## 1 FALSE 13
## 2 FALSE 13
## 3 FALSE 7
## 4 FALSE 7
## 5 FALSE 4
## ... ... ...
## 1999996 FALSE 12
## 1999997 FALSE 12
## 1999998 FALSE 7
## 1999999 FALSE 12
## 2000000 FALSE 3
Let’s get a feel for the cluster frequency distribution. In the table below, cluster ID is the name and the element is the number of cells in that cluster.
rowData(d_se)$cluster_id %>% table() %>% sort()
## .
## 5 11 6 10 9 2 8 15 12 13 3
## 721 2446 4405 8328 15760 20170 34046 46853 71970 164485 171308
## 4 14 1 7
## 187487 241512 281237 749272
Now we are going to make our frequency and expression tables for statistics and visualization. Diffcyt has built in functionality for that.
We will start with the counts table. This is the number of cells per cluster per sample. The results are a new SummarizedExperiment object, so I have to use the “assay” command within the SummarizedExperiment framework to access it. If you’re confused, don’t worry. It took me a while to figure these things out too.
d_counts <- diffcyt::calcCounts(d_se = d_se)
SummarizedExperiment::assay(d_counts)[1:10, 1:10]
## D0_1 Dneg21_2 D0_3 Dneg21_4 D0_5 Dneg21_6 D0_7 Dneg21_8 D0_9 Dneg21_10
## 1 14448 11149 5467 6254 19902 19401 14655 26473 14432 21418
## 2 1411 1582 363 449 619 1057 1376 2278 761 1391
## 3 5042 5244 1353 1457 9178 10273 9123 9880 4525 5526
## 4 10022 7259 5815 5678 14141 10475 9106 24310 10014 9199
## 5 28 42 39 9 84 47 40 18 27 61
## 6 642 564 44 37 155 85 231 14 279 28
## 7 36825 37075 62636 63714 30405 29311 53953 10618 33049 24695
## 8 971 1125 2525 2700 1946 1965 1237 3316 2279 3731
## 9 1192 706 378 476 873 410 733 1184 705 457
## 10 89 1095 77 684 122 1000 114 1752 66 167
Ok, let’s convert these to frequencies, expressed as a percentage of total cells for the file.
# The frequency table itself, embedded within the d_counts SummarizedExperiment
# Note that the metadata stored in the SummarizedExperiment file is the same order
d_freqs <- SummarizedExperiment::assay(d_counts) %>% apply(., 2, function(i) {
result <- i/sum(i)
return(result*100)
})
d_freqs[1:10, 1:10]
## D0_1 Dneg21_2 D0_3 Dneg21_4 D0_5 Dneg21_6 D0_7 Dneg21_8 D0_9
## 1 14.448 11.149 5.467 6.254 19.902 19.401 14.655 26.473 14.432
## 2 1.411 1.582 0.363 0.449 0.619 1.057 1.376 2.278 0.761
## 3 5.042 5.244 1.353 1.457 9.178 10.273 9.123 9.880 4.525
## 4 10.022 7.259 5.815 5.678 14.141 10.475 9.106 24.310 10.014
## 5 0.028 0.042 0.039 0.009 0.084 0.047 0.040 0.018 0.027
## 6 0.642 0.564 0.044 0.037 0.155 0.085 0.231 0.014 0.279
## 7 36.825 37.075 62.636 63.714 30.405 29.311 53.953 10.618 33.049
## 8 0.971 1.125 2.525 2.700 1.946 1.965 1.237 3.316 2.279
## 9 1.192 0.706 0.378 0.476 0.873 0.410 0.733 1.184 0.705
## 10 0.089 1.095 0.077 0.684 0.122 1.000 0.114 1.752 0.066
## Dneg21_10
## 1 21.418
## 2 1.391
## 3 5.526
## 4 9.199
## 5 0.061
## 6 0.028
## 7 24.695
## 8 3.731
## 9 0.457
## 10 0.167
Great. Now let’s look at marker expression per cluster, so we can know what cluster expresses what, and therefore annotate the clusters.
# Expression heatmap
cluster_exp <- diffcyt::calcMediansByClusterMarker(d_se) %>%
SummarizedExperiment::assay()
cluster_exp[1:10, 1:10]
## 209Bi_CD16 161Dy_CD26 162Dy_CD8a 163Dy_CD33 164Dy_CD161 166Er_CCR10
## 1 0.000000 0.43597161 0.46041100 4.39457615 0.32049829 2.2342883
## 2 3.140056 0.01549244 0.00000000 1.95843842 0.00000000 0.0000000
## 3 5.017952 0.00000000 0.09018378 0.09013180 3.14408704 0.0000000
## 4 0.000000 0.00000000 0.08877896 0.01528608 0.00000000 0.3031054
## 5 0.000000 2.25675482 0.80250410 3.74086862 0.39427732 1.6774147
## 6 0.000000 2.78207652 0.37146481 2.20336332 0.01313306 0.0000000
## 7 0.000000 2.58985497 0.08170316 0.00000000 0.00000000 0.6258491
## 8 0.000000 2.58480716 0.43751613 0.26018413 0.13605166 3.6969467
## 9 0.000000 0.95637654 0.00000000 0.26293038 0.00000000 0.0000000
## 10 0.000000 0.00000000 0.12264340 0.02304905 0.01821971 2.5493985
## 167Er_CCR7 168Er_CCR9 170Er_CD3 151Eu_ICOS
## 1 0.3305481 0.00000000 1.7772263 0.00000000
## 2 0.0000000 0.00000000 0.2443670 0.06368639
## 3 0.0000000 0.00000000 0.5818328 0.01601792
## 4 3.0851052 0.08584389 0.2407394 0.15491648
## 5 3.4882836 0.54234434 5.4409847 0.52329368
## 6 0.0895010 0.00000000 1.2134195 0.00000000
## 7 4.1672967 1.12878605 5.7767146 0.70985307
## 8 1.9618513 0.50304565 5.3912394 0.88943397
## 9 0.4989065 0.00000000 0.7481135 0.47584247
## 10 0.1794202 0.00000000 1.2114854 0.31928028
pheatmap::pheatmap(cluster_exp, labels_row = 1:nrow(cluster_exp), fontsize = 7)
If this section is too much for the reader (requires working knowledge of supervised learning, and EdgeR’s vocabulary around it), then skip to the next section that has simple statistical tests between conditions.
We’re going to do some statistics on the frequency tables, and then do some visualizations. To do this, we’re going to use EdgeR. This is a linear model framework within diffcyt. Just think of it as a t-test, but more rigorous. We’re not using a linear model to make a prediction. We’re using it to make a statistical test.
First, you have to make what is called a design matrix. Before you go any further, please go here, and quickly skim Figure 1 (on categorical variables), Figure 2 (design and contrast matrix), and section 4.2 and 4.4. This should help you understand what we’re doing with respect to linear models and the design matrix.
If you want a nice video-based explanation of the design matrix, watch this.
The design matrix below has rows as samples, and columns are the model parameters.
design <- diffcyt::createDesignMatrix(experiment_info = md, cols_design = 'group_id')
# This makes the names acceptable for input into downstream functions by removing
# "unacceptable" characters. Comes from base R.
colnames(design) <- make.names(colnames(design))
design[1:10,]
## X.Intercept. group_idDneg21
## 1 1 0
## 2 1 1
## 3 1 0
## 4 1 1
## 5 1 0
## 6 1 1
## 7 1 0
## 8 1 1
## 9 1 0
## 10 1 1
From here, we create a contrast matrix. For the sake of this comparison with this design matrix with two conditions, it’s literally going to just be a 0 and a 1.
In the contrast matrix, rows are the model parameters (the columns of the design matrix, which was intercept and the experiment group), and columns are a contrast of interest. In this case, there is only one test, controls vs Day 7, so there is only one column.
A thread that helped me understand the contrast matrix specifically when I first encountered it can be found here.
contrast <- diffcyt::createContrast(contrast = c(0, 1))
contrast
## [,1]
## [1,] 0
## [2,] 1
Finally, we feed these into the built-in EdgeR statistics, within diffcyt. Note that it outputs another SummarizedExperiment object.
# GLM from EdgeR (options to do other types of modeling here)
res_DA <- diffcyt::testDA_edgeR(d_counts = d_counts, design = design, contrast = contrast)
res_DA
## class: SummarizedExperiment
## dim: 15 20
## metadata(0):
## assays(1): counts
## rownames(15): 1 2 ... 14 15
## rowData names(6): cluster_id logFC ... p_val p_adj
## colnames(20): D0_1 Dneg21_2 ... D0_19 Dneg21_20
## colData names(9): file_name plate ... sample_id read_error
Rather than tear open the SummarizedExperiment object, we can use a built-in function called topTable to see what is statistically significant and what is not, between the control and experiment group.
# Quick visualization
edger_table <- diffcyt::topTable(res_DA, top_n = nrow(res_DA@elementMetadata)) %>% as_tibble()
edger_table[1:10,]
## # A tibble: 10 × 3
## cluster_id p_val p_adj
## <fct> <dbl> <dbl>
## 1 10 0.00000143 0.0000214
## 2 12 0.0000637 0.000478
## 3 9 0.0911 0.342
## 4 15 0.0790 0.342
## 5 6 0.130 0.391
## 6 11 0.212 0.530
## 7 5 0.294 0.586
## 8 8 0.313 0.586
## 9 4 0.581 0.792
## 10 7 0.533 0.792
What if you don’t trust the design and contrast matrices above and you need a simple sanity check? What if you don’t want to touch regression-based statistics and you want something simpler? Then you just use a Mann-Whitney U test. I prefer this test in place of a t-test because you don’t have to assume anything about the underlying distributions.
Let’s look again at the freq table and see what info we need.
d_freqs[1:10, 1:10]
## D0_1 Dneg21_2 D0_3 Dneg21_4 D0_5 Dneg21_6 D0_7 Dneg21_8 D0_9
## 1 14.448 11.149 5.467 6.254 19.902 19.401 14.655 26.473 14.432
## 2 1.411 1.582 0.363 0.449 0.619 1.057 1.376 2.278 0.761
## 3 5.042 5.244 1.353 1.457 9.178 10.273 9.123 9.880 4.525
## 4 10.022 7.259 5.815 5.678 14.141 10.475 9.106 24.310 10.014
## 5 0.028 0.042 0.039 0.009 0.084 0.047 0.040 0.018 0.027
## 6 0.642 0.564 0.044 0.037 0.155 0.085 0.231 0.014 0.279
## 7 36.825 37.075 62.636 63.714 30.405 29.311 53.953 10.618 33.049
## 8 0.971 1.125 2.525 2.700 1.946 1.965 1.237 3.316 2.279
## 9 1.192 0.706 0.378 0.476 0.873 0.410 0.733 1.184 0.705
## 10 0.089 1.095 0.077 0.684 0.122 1.000 0.114 1.752 0.066
## Dneg21_10
## 1 21.418
## 2 1.391
## 3 5.526
## 4 9.199
## 5 0.061
## 6 0.028
## 7 24.695
## 8 3.731
## 9 0.457
## 10 0.167
Each p-value will come from a test on each cluster. The cluster is the row. The conditions are in the columns, before the underscore. So we have all the information we need. Let’s run the test on the first column, so you can see what’s going on.
Note that the function is called “wilcox.test.” From the descriptin of the function: “Performs one- and two-sample Wilcoxon tests on vectors of data; the latter is also known as ‘Mann-Whitney’ test.”
test <- d_freqs[1,]
test
## D0_1 Dneg21_2 D0_3 Dneg21_4 D0_5 Dneg21_6 D0_7 Dneg21_8
## 14.448 11.149 5.467 6.254 19.902 19.401 14.655 26.473
## D0_9 Dneg21_10 D0_11 Dneg21_12 D0_13 Dneg21_14 D0_15 Dneg21_16
## 14.432 21.418 9.278 15.277 18.582 16.296 15.347 12.843
## D0_17 Dneg21_18 D0_19 Dneg21_20
## 6.212 4.775 25.291 3.737
c1 <- test[grep("D0", names(test))]
c2 <- test[grep("Dneg21", names(test))]
wilcox.test(c1, c2)
##
## Wilcoxon rank sum exact test
##
## data: c1 and c2
## W = 52, p-value = 0.9118
## alternative hypothesis: true location shift is not equal to 0
Ok, now we just have to do that for every other cluster.
p_mwu <- lapply(seq(nrow(d_freqs)), function(i) {
curr <- d_freqs[i,]
c1 <- curr[grep("Dneg21", names(curr))]
c2 <- curr[grep("D0", names(curr))]
wilcox.test(c1, c2)$p.value
}) %>% unlist()
## Warning in wilcox.test.default(c1, c2): cannot compute exact p-value with ties
## Warning in wilcox.test.default(c1, c2): cannot compute exact p-value with ties
## Warning in wilcox.test.default(c1, c2): cannot compute exact p-value with ties
## Warning in wilcox.test.default(c1, c2): cannot compute exact p-value with ties
## Warning in wilcox.test.default(c1, c2): cannot compute exact p-value with ties
And now we do the FDR adjustment to account for comparisons that were statistically significant out of randomness.
p_mwu_adj <- p.adjust(p_mwu, method = "fdr")
p_mwu_adj
## [1] 0.9117972 0.7330827 0.7330827 0.7330827 0.7330827 0.2621295 0.7330827
## [8] 0.7330827 0.2621295 0.2621295 0.7330827 0.7330827 0.7330827 0.7330827
## [15] 0.7330827
Now we can make a statistics table.
mwu_table <- tibble(cluster = seq(nrow(d_freqs)), p_value = p_mwu, p_adj = p_mwu_adj) %>%
arrange(p_mwu_adj)
mwu_table
## # A tibble: 15 × 3
## cluster p_value p_adj
## <int> <dbl> <dbl>
## 1 6 0.0433 0.262
## 2 9 0.0524 0.262
## 3 10 0.0283 0.262
## 4 2 0.684 0.733
## 5 3 0.684 0.733
## 6 4 0.218 0.733
## 7 5 0.364 0.733
## 8 7 0.529 0.733
## 9 8 0.257 0.733
## 10 11 0.345 0.733
## 11 12 0.520 0.733
## 12 13 0.684 0.733
## 13 14 0.579 0.733
## 14 15 0.579 0.733
## 15 1 0.912 0.912
I would say that a good test for whether your EdgeR statistics are done correctly is whether they are at least similar to what you see with the Mann-Whitney U test here. The spoiler alert is that cluster 12 is flagged as significant by EdgeR and not by the Mann-Whitney U test. Plots below will show why: the presence of an outlier. When you see extreme outliers like this, it’s a good idea to look at the offending file and see if you can figure out why it is such an outlier, and whether it should be removed from analysis.
Now, we’re going to use a plotting function I wrote to visualize the clusters that are statistically significant, and see what we see. This took me a bit of time to optimize, but it gives you boxplots (in the ggplot2 style), with the statistics and *’s to indicate signifcance, as you would see in a standard publication.
I advise always looking at the boxplots, as you can start to see what significant results are relevant, and which are the result of things like outliers, or something I commonly see, clusters that simply have very low frequencies. Accordingly, when you look at the box plots below, always look at the y axis, so you know what frequency range you’re dealing with (eg. 10-20 percent versus 0.1-1 percent).
Let’s introduce the function. It’s complex, so don’t worry if it looks like a bit much. It’s optimized for both differential abundance, and differential expression. In this markdown, we’re only dealing with the former, so if you’re a bioinformatician, you can look at the function, see what I did, and simplify it to your liking. Note that it is also optimized for multiple clusters as input. The loop is built in. Again, if you would rather make the loop yourself, then just pull out the components within “lapply” as needed.
Note that the statistics are the Mann-Whitney U test again.
library(ggpubr)
#' @title Wrapper for plotting
#' @description Produces per cluster plots of control vs experimental condition
#' @param clusters Total Cluster IDs
#' @param Mat cluster frequency table
#' @param test Test that will be used if including statistical testing within
#' the boxplots
#' @param Add_p Whether to add a p-value to the plot itself
#' @param to_save whether to save the plots
#' @param Comp_conds the names of the conditions used in the experiment
#' @param Control_cond the name of the condition that is the control. It will
#' be placed on the left side
#' @param plot_type either da or ds
#' @param marker string indicating what marker is used if DS. Indicates
#' population frequencies if DA.
#' @param md A tibble of sample metadata, in the same order as the colnames of Mat
#' @return a list of ggplot objects
PlotWrapper <- function(clusters,
Mat,
test = "wilcox.test",
Add_p = TRUE,
to_save = FALSE,
Comp_conds,
Control_cond,
plot_type = "da",
marker = "frequencies",
md) {
# The plotting loop
total_plots <- lapply(clusters, function(i) {
# Get the information in plottable format
percents <- Mat[i,] %>% unlist()
if(sum(percents) == 0) { # Otherwise will print jittered garbage around zero
return()
}
toplot <- tibble(immune = percents)
# This will change depending on how you do it
# Note that you can't do as.factor here
# toplot$group <- factor(names(percents) %>%
# sub("_[^_]+$", "", .), levels = Comp_conds)
toplot$group <- factor(md$group_id, levels = Comp_conds)
#group_levels <- comp_conds
#toplot$group <- factor(toplot$group, levels = group_levels)
# The actual plotting
p <- ggplot(toplot, aes(group, immune)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter()
p <- p + theme(text = element_text(size=18))
p <- p + theme(axis.text.x=element_text(angle=30, hjust=1))
if(plot_type == "da") {
p <- p + labs(y = "percent all cells")
p <- p + ggtitle(paste("Cluster", i, sep = " "))
} else if(plot_type == "ds") {
p <- p + labs(y = "marker expression")
p <- p + ggtitle(paste("Cluster", i, marker, sep = " "))
}
# Adding error bars
if(Add_p) {
# Re-ordering so control goes first
my_comparisons <- Comp_conds %>%
.[-which(comp_conds %in% control_cond)]
my_comparisons <- lapply(my_comparisons, function(i) {
c(Control_cond, i)
})
p <- p + ggpubr::stat_compare_means(method = test,
aes(label = ..p.signif..),
comparisons = my_comparisons,
hide.ns = FALSE)
p$layers[[gginnards::which_layers(p, "GeomSignif")]]$aes_params$textsize <- 6
}
if(to_save) {
ggsave(width = 7, height = 7, filename = paste("cluster",
i,
test,
marker,
"jpg",
sep = "."))
}
message(paste("cluster", i, "plot produced"))
return(p)
})
return(total_plots)
}
Ok, now let’s select a subset of clusters that have the lowest adjusted p-values. We’ll go with the Mann Whitney U test to be consistent. Always look at the y axis, so you can see the frequency range of the cluster of interest. Sometimes you’ll get a significant result for a cluster that is less than one percent in frequency, in which case you should double check and see if that makes sense.
cluster_id <- edger_table[1:5,]$cluster_id
comp_conds <- unique(md$group_id)
control_cond <- "D0"
PlotWrapper(
clusters = cluster_id,
Mat = d_freqs,
to_save = FALSE,
Comp_conds = comp_conds,
Control_cond = control_cond,
md = md
)
## cluster 10 plot produced
## cluster 12 plot produced
## cluster 9 plot produced
## cluster 15 plot produced
## cluster 6 plot produced
## [[1]]
## Warning in wilcox.test.default(c(0.089, 0.077, 0.122, 0.114, 0.066, 0.191, :
## cannot compute exact p-value with ties
##
## [[2]]
## Warning in wilcox.test.default(c(0.022, 0.034, 0.058, 0.051, 0.011, 0.046, :
## cannot compute exact p-value with ties
##
## [[3]]
##
## [[4]]
##
## [[5]]
Ok, the next thing we’re going to do is dimension reduction. With this, you can get a feel for how many cell subsets your data might have and how big each subset is. Don’t read into it any more than that.
We are going to use UMAP, as we did in the one fcs file example. It is relatively simple to run. Here is how you do it with the data, in its current format.
library(tidyverse)
library(umap)
set.seed(1234)
# Subsample the data and filter by surface markers
dimr_sample <- 10000
d_sub <- d_se[sample(nrow(d_se@elementMetadata), dimr_sample),]
surface <- d_se@colData$marker_name[d_se@colData$marker_class == "type"]
# Get the map in tibble form
dimr <- umap::umap(d = d_sub@assays@data$exprs[,surface])$layout %>% as_tibble()
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
names(dimr) <- c("umap1", "umap2")
Now, we use a dimension reduction visualization function to visualize the map and color it by what we want to see.
#' @title Visualize dimension reduction plots
#' @description Plot dimension reduction map of interest colored by dimension
#' reduction comparison object of interest, with a blue to red color gradient.
#' Color palette is an attempt at the Cytobank rainbow palette.
#' @param cells Tibble of cells by features that the dimr was computed from
#' @param dimr Tibble of dimension reduction axis 1 and axis 2
#' @param marker vector of values for a particular marker
#' @param lim_min Minimum value to be colored
#' @param lim_max Maximum value to be colored
#' @return ggplot2 object
PlotViz <- function(cells, dimr, marker, lim_min = NULL, lim_max = NULL) {
p <- qplot(x = dimr[[1]],
y = dimr[[2]],
color = cells[[marker]],
xlab = names(dimr[1]),
ylab = names(dimr[2])) +
theme(legend.title = element_blank()) +
geom_point(shape = ".") +
ggtitle(marker)
if(is.factor(cells[[marker]])) {
return(p)
}
if(!is.null(lim_min)) {
p <- p + scale_color_gradientn(colors = c("blue", "cyan", "springgreen", "yellow", "orange", "red", "red4"), limits = c(lim_min, lim_max))
} else {
p <- p + scale_color_gradientn(colors = c("blue", "cyan", "springgreen", "yellow", "orange", "red", "red4"))
}
return(p)
}
Now let’s visualize a few markers as before.
# Get the data out of the SummarizedExperiment class
out_cells <- as_tibble(d_sub@assays@data$exprs)
PlotViz(cells = out_cells, dimr = dimr, marker = '170Er_CD3')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
PlotViz(cells = out_cells, dimr = dimr, marker = '142Nd_CD19')
library(patchwork)
p_list <- lapply(names(out_cells[,surface][1:9]), function(i) {
PlotViz(cells = out_cells, dimr = dimr, marker = i)
})
(p_list[[1]] + p_list[[2]] + p_list[[3]]) / (p_list[[4]] + p_list[[5]] + p_list[[6]]) / (p_list[[7]] + p_list[[8]] + p_list[[9]])
Ok, now let’s do some plotting of our metadata. To do that, we need the metadata label cell by cell. Here is what that looks like.
md_percell <- d_sub@elementMetadata
md_percell
## DataFrame with 10000 rows and 10 columns
## file_name plate patient_id
## <factor> <factor> <factor>
## 1 Plate19_8016_TT_D0_TD_Live Intact Singlets.fcs Plate19 8016
## 2 Plate18_8763_TT_D0_nTD_Live Intact Singlets.fcs Plate18 8763
## 3 Plate11_8842_TT_Dneg21_nTD_Live Intact Singlets.fcs Plate11 8842
## 4 Plate14_8527_TT_Dneg21_TD_Live Intact Singlets.fcs Plate14 8527
## 5 Plate11_8842_TT_D0_nTD_Live Intact Singlets.fcs Plate11 8842
## ... ... ... ...
## 9996 Plate19_8016_TT_D0_TD_Live Intact Singlets.fcs Plate19 8016
## 9997 Plate15_8144_TT_D0_nTD_Live Intact Singlets.fcs Plate15 8144
## 9998 Plate19_8016_TT_D0_TD_Live Intact Singlets.fcs Plate19 8016
## 9999 Plate16_8733_TT_Dneg21_nTD_Live Intact Singlets.fcs Plate16 8733
## 10000 Plate16_8733_TT_Dneg21_nTD_Live Intact Singlets.fcs Plate16 8733
## vaccine group_id diagnosis gating_status sample_id read_error
## <factor> <factor> <factor> <factor> <factor> <factor>
## 1 TT D0 TD Live Intact Singlets.fcs D0_19 FALSE
## 2 TT D0 nTD Live Intact Singlets.fcs D0_17 FALSE
## 3 TT Dneg21 nTD Live Intact Singlets.fcs Dneg21_4 FALSE
## 4 TT Dneg21 TD Live Intact Singlets.fcs Dneg21_10 FALSE
## 5 TT D0 nTD Live Intact Singlets.fcs D0_3 FALSE
## ... ... ... ... ... ... ...
## 9996 TT D0 TD Live Intact Singlets.fcs D0_19 FALSE
## 9997 TT D0 nTD Live Intact Singlets.fcs D0_11 FALSE
## 9998 TT D0 TD Live Intact Singlets.fcs D0_19 FALSE
## 9999 TT Dneg21 nTD Live Intact Singlets.fcs Dneg21_14 FALSE
## 10000 TT Dneg21 nTD Live Intact Singlets.fcs Dneg21_14 FALSE
## cluster_id
## <factor>
## 1 3
## 2 7
## 3 7
## 4 14
## 5 1
## ... ...
## 9996 1
## 9997 14
## 9998 1
## 9999 7
## 10000 1
We’ll start with that. Anyway, let’s look a at cluster ID and group_id.
out_cells <- bind_cols(out_cells, as_tibble(md_percell))
PlotViz(cells = out_cells, dimr = dimr, marker = "group_id")
PlotViz(cells = out_cells, dimr = dimr, marker = "cluster_id")
The group ID appears to be well-mixed, with some exceptions. But if you really wanted to examine this, you would simply look at every cell’s k-nearest neighbors in terms of the percentage of the two groups. You can read more about it in my pre-print here. Look at figure 2.
What if we want to visualize the map by p-values? Rather than relying on any package to have the element metadata served up for us, we need a way assign the per-cluster statistics to each cell. Then, we would have to take the negative log10 of the p values (0.01 -> 2, 0.001 -> 3, etc) so they can fit on a color scheme that would otherwise be smashed to zero.
We will do that now. We’re going to iterate through the cluster ID column of our data, and for each cluster, we’re going to do a lookup of the corresponding p-values of interest. We’ll use the diffcyt p-values for now, but you can switch out the following block of code with whatever table of p-values you please.
To refresh your memory, here’s the EdgeR table that was previously calculated. It has p-values for every cluster.
# We're using the whole thing. This is just a visualization.
edger_table[1:10,]
## # A tibble: 10 × 3
## cluster_id p_val p_adj
## <fct> <dbl> <dbl>
## 1 10 0.00000143 0.0000214
## 2 12 0.0000637 0.000478
## 3 9 0.0911 0.342
## 4 15 0.0790 0.342
## 5 6 0.130 0.391
## 6 11 0.212 0.530
## 7 5 0.294 0.586
## 8 8 0.313 0.586
## 9 4 0.581 0.792
## 10 7 0.533 0.792
neg_log10_p_adj <- lapply(out_cells$cluster_id, function(i) {
result <- dplyr::filter(edger_table, cluster_id == i)$p_adj
}) %>% unlist()
neg_log10_p_adj <- -log10(neg_log10_p_adj)
Great. Now we’re going to color the map by the p-value itself so you can see what regions of the map are statistically significant.
out_cells$neg_log10_p_adj <- neg_log10_p_adj
PlotViz(cells = out_cells, dimr = dimr, marker = "neg_log10_p_adj")
Ok, and that will do it for now. I hope this helps you in putting together a CyTOF pipeline for your data. Additional tools and visualizations are available with CyTOF workflow, here. There are other rabbit holes we can go down around QC, but that will be for another time.