The purpose of this markdown is to explore the topic of data transformations for CyTOF data. We will look at how the asinh and logarithmic data transformations are related to each other. Then, we will examine t-SNE maps of untransformed versus transformed data.
The dataset we’re going to use is a PBMC dataset generated by Stanford’s HIMC in 2016, found here.
library(flowCore)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks flowCore::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(here)
## here() starts at /Users/tylerburns/Documents/projects/drfz/dimr_quality_umbrella/cytof_data_transformations
setwd(here::here("data", "himc_dataset"))
ff <- flowCore::read.FCS("081216-Mike-HIMC ctrls-001_01_normalized.fcs",
transformation=FALSE,
truncate_max_range=FALSE)
## uneven number of tokens: 525
## The last keyword is dropped.
## uneven number of tokens: 525
## The last keyword is dropped.
ff
## flowFrame object '1.fcs'
## with 250000 cells and 49 observables:
## name desc range minRange maxRange
## $P1 Time NA 1129870 0 1129869
## $P2 Event_length NA 68 0 67
## $P3 In113Di 113In_CD57 7887 0 7886
## $P4 In115Di 115In_Dead 7527 0 7526
## $P5 Sn120Di 120Sn 16821 0 16820
## ... ... ... ... ... ...
## $P45 Pb208Di 208Pb 30923 0 30922
## $P46 Center NA 4621 0 4620
## $P47 Offset NA 512 0 511
## $P48 Width NA 229 0 228
## $P49 Residual NA 580 0 579
## 262 keywords are stored in the 'description' slot
We’ve read in the fcs and now we have a flow frame object. You can see that there are 841000 cells. We’re not going to need nearly that many to do the dimension reduction anlaysis that we’re going to get to. So we’re going to subsample the flow frame to 30,000 cells. How do you do that? While the flow frame object is complex, it turns out you can treat a flow frame like a matrix, in that you can pull out “rows” as shown below for 10 of them. For completeness, I show a subsampling of columns, and the use of “dim” to get at the dimensions of the matrix that the flow frame object contains.
ff[1:10,]
## flowFrame object '1.fcs'
## with 10 cells and 49 observables:
## name desc range minRange maxRange
## $P1 Time NA 1129870 0 1129869
## $P2 Event_length NA 68 0 67
## $P3 In113Di 113In_CD57 7887 0 7886
## $P4 In115Di 115In_Dead 7527 0 7526
## $P5 Sn120Di 120Sn 16821 0 16820
## ... ... ... ... ... ...
## $P45 Pb208Di 208Pb 30923 0 30922
## $P46 Center NA 4621 0 4620
## $P47 Offset NA 512 0 511
## $P48 Width NA 229 0 228
## $P49 Residual NA 580 0 579
## 262 keywords are stored in the 'description' slot
ff[,1:10]
## flowFrame object '1.fcs'
## with 250000 cells and 10 observables:
## name desc range minRange maxRange
## $P1 Time NA 1129870 0 1129869
## $P2 Event_length NA 68 0 67
## $P3 In113Di 113In_CD57 7887 0 7886
## $P4 In115Di 115In_Dead 7527 0 7526
## $P5 Sn120Di 120Sn 16821 0 16820
## $P6 I127Di 127I 10781 0 10780
## $P7 Xe131Di 131Xe 665 0 664
## $P8 Ba138Di 138Ba 18370 0 18369
## $P9 Ce140Di 140Ce_Bead 8547 0 8546
## $P10 Nd142Di 142Nd_CD19 3094 0 3093
## 262 keywords are stored in the 'description' slot
dim(ff)
## events parameters
## 250000 49
Given this notation, a random subsample to a smaller number is straightforward. We’re going to run it below. Part of the rationale behind this is simply that it makes it easier to visualize the plots.
n_cells <- 5000
ff <- ff[sample(x = nrow(ff), size = n_cells, replace = FALSE),]
ff
## flowFrame object '1.fcs'
## with 5000 cells and 49 observables:
## name desc range minRange maxRange
## $P1 Time NA 1129870 0 1129869
## $P2 Event_length NA 68 0 67
## $P3 In113Di 113In_CD57 7887 0 7886
## $P4 In115Di 115In_Dead 7527 0 7526
## $P5 Sn120Di 120Sn 16821 0 16820
## ... ... ... ... ... ...
## $P45 Pb208Di 208Pb 30923 0 30922
## $P46 Center NA 4621 0 4620
## $P47 Offset NA 512 0 511
## $P48 Width NA 229 0 228
## $P49 Residual NA 580 0 579
## 262 keywords are stored in the 'description' slot
You can see that now we’re dealing with a flow frame that has 30000 cells rather than the much larger orignal number. Now we have to pull out the matrix in the flow frame, as that’s where we’re going to do our work. Luckily, there’s a command called exprs() that does this. Let’s look at what happens when we run it.
In most files, the column names of the matrix are actually the isotope values. One can easily change these by going into the “name” and “desc” parameter, which are in the proper order row-wise as the expression matrix is column-wise.
cells <- exprs(ff) %>% as_tibble()
cells
## # A tibble: 5,000 × 49
## Time Event…¹ In113Di In115Di Sn120Di I127Di Xe131Di Ba138Di Ce140Di Nd142Di
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 8.91e5 19 552. 31.1 0 4.11 0 11.5 0 3.08
## 2 3.80e5 16 0 0 0 10.5 0 106. 0.373 0.287
## 3 8.30e5 14 42.3 0.0865 0 1.32 2.54 62.5 0.385 0
## 4 8.40e5 18 0 0.528 0 9.60 2.30 116. 0 81.6
## 5 9.39e5 16 0 0.783 0 4.08 3.97 80.1 0 1.52
## 6 5.57e5 20 339. 23.7 0 20.3 0 59.4 0 0.294
## 7 9.77e5 20 0 0 0 95.6 4.51 45.5 0 0.746
## 8 8.20e5 15 0 2.25 0 17.3 0 104. 2.25 0.928
## 9 2.41e4 14 609. 30.4 0 1.01 0 30.6 0 0
## 10 5.49e5 15 0 1.59 0 6.06 5.08 39.1 0.0970 74.2
## # … with 4,990 more rows, 39 more variables: Nd143Di <dbl>, Nd144Di <dbl>,
## # Nd146Di <dbl>, Sm147Di <dbl>, Nd148Di <dbl>, Sm149Di <dbl>, Nd150Di <dbl>,
## # Eu151Di <dbl>, Sm152Di <dbl>, Eu153Di <dbl>, Sm154Di <dbl>, Gd155Di <dbl>,
## # Gd156Di <dbl>, Gd157Di <dbl>, Gd158Di <dbl>, Tb159Di <dbl>, Gd160Di <dbl>,
## # Dy162Di <dbl>, Dy164Di <dbl>, Ho165Di <dbl>, Er166Di <dbl>, Er167Di <dbl>,
## # Er168Di <dbl>, Tm169Di <dbl>, Er170Di <dbl>, Yb171Di <dbl>, Yb172Di <dbl>,
## # Yb173Di <dbl>, Yb174Di <dbl>, Lu175Di <dbl>, Yb176Di <dbl>, …
ff@parameters@data %>% as_tibble()
## # A tibble: 49 × 5
## name desc range minRange maxRange
## <I<chr>> <I<chr>> <dbl> <dbl> <dbl>
## 1 Time <NA> 1129870 0 1129869
## 2 Event_length <NA> 68 0 67
## 3 In113Di 113In_CD57 7887 0 7886
## 4 In115Di 115In_Dead 7527 0 7526
## 5 Sn120Di 120Sn 16821 0 16820
## 6 I127Di 127I 10781 0 10780
## 7 Xe131Di 131Xe 665 0 664
## 8 Ba138Di 138Ba 18370 0 18369
## 9 Ce140Di 140Ce_Bead 8547 0 8546
## 10 Nd142Di 142Nd_CD19 3094 0 3093
## # … with 39 more rows
params <- ff@parameters@data$desc
colnames(cells) <- params
# tidy
cells <- cells[!is.na(names(cells))]
From here, we do a data transformation. For CyTOF, we typically do an asinh transform of the given channel value divided by 5 (call a scale argument of 5). The tranform is log-like, but is ok for negative numbers as well. But let’s look at different “scale arguments” to see if 5 is special.
biaxial <- dplyr::select(cells, "166Er_CD33", "150Nd_CD3")
MakePlot <- function(dat) {
qplot(dat[[1]], dat[[2]]) + labs(x = names(dat)[1], y = names(dat)[2])
}
scale_arguments <- c(-1, 1, 5, 10, 50, 100, 500)
plot_list <- lapply(scale_arguments, function(i) {
if(i == -1) {
result <- MakePlot(biaxial) + ggtitle("untransformed")
} else {
result <- MakePlot(asinh(biaxial/i)) + ggtitle(paste0("asinh(x/", i, ")"))
}
return(result)
})
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
n_col <- floor(sqrt(length(plot_list)))
cowplot::plot_grid(plotlist = plot_list[1:4], align = "v", nrow = 2)
cowplot::plot_grid(plotlist = plot_list[5:length(plot_list)], align = "v", nrow = 2)
The purpose of the data transformation for CyTOF is to separate the signal from the noise, and from a visual perspective to make the data “gateable.” This means we want it to be as similar to transformed flow cytometry data as possible (the best practices for this type of data). See figure 1 in Bendall, Simonds, et al (2011). Please note the co-first authorship. Many refer to this paper as Bendall et al, 2011, but Erin Simonds contributed equally.
We note here that in terms of the “scale-argument” that there is not much of a difference between, for example, 5 and 10. We do see a difference when you don’t have a scale argument at all, or when you set the scale argument very high, in which the data start to take the shape of the untransformed value.
Now since the paper has been written single-cell sequencing has taken off, largely a result of Macosko et al, 2015 also known as the drop-seq paper. A common analysis pipeline for these data is Seurat from the laboratory of Rahul Satija. Here, the data transformation is the natural log of the data plus 1 (in R, this is log1p). Let’s see what this does to the CyTOF data.
p1 <- MakePlot(biaxial) + ggtitle("untransformed")
p2 <- MakePlot(asinh(biaxial/5)) + ggtitle("asinh(x/5)")
p3 <- MakePlot(log1p(biaxial)) + ggtitle("log(x + 1)")
p4 <- MakePlot(log(biaxial)) + ggtitle("log(x)")
cowplot::plot_grid(plotlist = list(p1, p2, p3, p4), align = "v", nrow = 2)
We can see that the log(x + 1) does not look much different than the asinh with a scale argument of 5. Regardless of rationale, this shows that slightly changing the data transformation does not make or break the data. We see that the log(x) transform does produce data that are problematic, in that there is a large gap between the lowest possible value of the cell and the second lowest possible value of the cell. In flow cytometry, there are many data transformations that get the job done. It’s probably the same for CyTOF, despite the fact that asinh(x/5) is universal.
We can make this a bit more concrete by subtracting the matricies from each other and seeing what the differences are.
b_asinh <- asinh(biaxial/5)
b_log1p <- log1p(biaxial)
p1 <- qplot(biaxial$`166Er_CD33`, b_asinh$`166Er_CD33`) + ggtitle("untransformed vs asinh(x/5)")
p2 <- qplot(biaxial$`166Er_CD33`, b_log1p$`166Er_CD33`) + ggtitle("untransformed vs log(x + 1)")
p3 <- qplot(b_asinh$`166Er_CD33`, b_log1p$`166Er_CD33`) + ggtitle("asinh(x/5) vs log(x + 1)")
cowplot::plot_grid(plotlist = list(p1, p2, p3), align = "v", nrow = 2)
In the first two cases, correlating the original data with the transformed data, the transformed data decrease more rapidly as the untransformed values near zero. This is to say that the transformed data are more sensitive to changes in values near zero.
We can see that asinh and log1p are largely correlated, but there is subtle version of the trend that is similar to what we saw between the untransformed data and the transformed data. That is that log1p decreases more rapidly than asinh as the line approaches zero. This is to say that in the values very close to zero, log1p is just a bit more sensitive than the asinh. However, for the sake of CyTOF data, when the values are near this close to zero, we typically consider this as the “negative” population. You can see this in the biaxial plots. Furthermore (this is anecdotal from a conversation with Sean Bendall, mass spec chemist by training, in 2012 or so) the variance of the values near zero in terms of the ions hitting the detector for CyTOF data are meaningless for all practical purposes. Thus, the asinh with a scale argument of 5 does a good job “squishing” the values near zero as opposed to a log transforation. I show here that yes this is true, but it’s more subtle than I appreciated.
But let’s do one more comparison. Does the asinh by itself “squash” the values near zero? Is it the scale argument? Is it both?
b_asinh <- asinh(biaxial)
b_asinh5 <- asinh(biaxial/5)
b_log1p <- log1p(biaxial)
b_log1p5 <- log1p(biaxial/5)
p1 <- qplot(b_asinh$`166Er_CD33`, b_log1p$`166Er_CD33`) + ggtitle("asinh(x) vs log(x + 1)")
p2 <- qplot(b_asinh5$`166Er_CD33`, b_log1p$`166Er_CD33`) + ggtitle("asinh(x/5) vs log(x + 1)")
p3 <- qplot(b_log1p5$`166Er_CD33`, b_asinh$`166Er_CD33`) + ggtitle("log((x + 1)/5) vs asinh(x)")
p4 <- qplot(b_asinh5$`166Er_CD33`, b_log1p5$`166Er_CD33`) + ggtitle("asinh(x/5) vs log((x + 1)/5)")
cowplot::plot_grid(plotlist = list(p1, p2, p3, p4), align = "v", nrow = 2)
What we find is it’s the scale argument that brings the values near zero closer to zero, not log vs asinh, per-se. However, what we see is that in the asinh case, we see the influence of the scale argument at lower values than the log case. In other words, we see the “curve” in the plots start at lower values in the asinh case as opposed to the log case.
If a goal of the data transformation and scale argument is to bring values that are near zero closer to zero, then you would choose the asinh transformation over the log transformation if you want this to happen at values closer to zero than the log transformation can provide.
Now we’ve seen that the data transformations do matter in terms of how a user interprets how the cells should be subsetted. Now we will see if data transformation matters in terms of dimension reduction. We’ll focus on t-SNE for now. Below, we run t-SNE on untransformed data and transformed data, and visually examine the output of both.
library(Rtsne)
cells <- cells[grep("CD", names(cells))]
tsne_orig <- Rtsne::Rtsne(cells, check_duplicates = FALSE)$Y %>% as_tibble()
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
tsne_asinh <- Rtsne::Rtsne(asinh(cells/5), check_duplicates = FALSE)$Y %>% as_tibble()
tsne_log1p <- Rtsne::Rtsne(log1p(cells), check_duplicates = FALSE)$Y %>% as_tibble()
qplot(tsne_orig[[1]], tsne_orig[[2]], color = cells$`150Nd_CD3`)
qplot(tsne_asinh[[1]], tsne_asinh[[2]], color = asinh(cells$`150Nd_CD3`/5))
qplot(tsne_log1p[[1]], tsne_log1p[[2]], color = log1p(cells$`150Nd_CD3`))
One observation is that the t-SNE map for the untransformed data looks different but it doesn’t look like complete chaos (eg. a ball). There appear to be different subsets that are discerned by the algorithm despite the fact that the data are untransformed.
This being said, it looks much more like a continuum than a series of nicely subsetted niches. We color by CD3 to show that there are many more visible “subsets” when the data are transformed, both from asinh(x/5) and log1p(x). One can see that these transformations can change the interpretation of our data. Data transformation leads to a better color palette as well.