The purpose of this markdown is to explore the topic of data transformations for CyTOF data. We will look at how the asinh and logarithmic data transformations are related to each other. Then, we will examine t-SNE maps of untransformed versus transformed data.

Import data

The dataset we’re going to use is a PBMC dataset generated by Stanford’s HIMC in 2016, found here.

library(flowCore)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks flowCore::filter(), stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(here)
## here() starts at /Users/tylerburns/Documents/projects/drfz/dimr_quality_umbrella/cytof_data_transformations
setwd(here::here("data", "himc_dataset"))
ff <- flowCore::read.FCS("081216-Mike-HIMC ctrls-001_01_normalized.fcs",
                            transformation=FALSE, 
                            truncate_max_range=FALSE)
## uneven number of tokens: 525
## The last keyword is dropped.
## uneven number of tokens: 525
## The last keyword is dropped.
ff
## flowFrame object '1.fcs'
## with 250000 cells and 49 observables:
##              name        desc     range  minRange  maxRange
## $P1          Time          NA   1129870         0   1129869
## $P2  Event_length          NA        68         0        67
## $P3       In113Di  113In_CD57      7887         0      7886
## $P4       In115Di  115In_Dead      7527         0      7526
## $P5       Sn120Di       120Sn     16821         0     16820
## ...           ...         ...       ...       ...       ...
## $P45      Pb208Di       208Pb     30923         0     30922
## $P46       Center          NA      4621         0      4620
## $P47       Offset          NA       512         0       511
## $P48        Width          NA       229         0       228
## $P49     Residual          NA       580         0       579
## 262 keywords are stored in the 'description' slot

We’ve read in the fcs and now we have a flow frame object. You can see that there are 841000 cells. We’re not going to need nearly that many to do the dimension reduction anlaysis that we’re going to get to. So we’re going to subsample the flow frame to 30,000 cells. How do you do that? While the flow frame object is complex, it turns out you can treat a flow frame like a matrix, in that you can pull out “rows” as shown below for 10 of them. For completeness, I show a subsampling of columns, and the use of “dim” to get at the dimensions of the matrix that the flow frame object contains.

ff[1:10,]
## flowFrame object '1.fcs'
## with 10 cells and 49 observables:
##              name        desc     range  minRange  maxRange
## $P1          Time          NA   1129870         0   1129869
## $P2  Event_length          NA        68         0        67
## $P3       In113Di  113In_CD57      7887         0      7886
## $P4       In115Di  115In_Dead      7527         0      7526
## $P5       Sn120Di       120Sn     16821         0     16820
## ...           ...         ...       ...       ...       ...
## $P45      Pb208Di       208Pb     30923         0     30922
## $P46       Center          NA      4621         0      4620
## $P47       Offset          NA       512         0       511
## $P48        Width          NA       229         0       228
## $P49     Residual          NA       580         0       579
## 262 keywords are stored in the 'description' slot
ff[,1:10]
## flowFrame object '1.fcs'
## with 250000 cells and 10 observables:
##              name        desc     range  minRange  maxRange
## $P1          Time          NA   1129870         0   1129869
## $P2  Event_length          NA        68         0        67
## $P3       In113Di  113In_CD57      7887         0      7886
## $P4       In115Di  115In_Dead      7527         0      7526
## $P5       Sn120Di       120Sn     16821         0     16820
## $P6        I127Di        127I     10781         0     10780
## $P7       Xe131Di       131Xe       665         0       664
## $P8       Ba138Di       138Ba     18370         0     18369
## $P9       Ce140Di  140Ce_Bead      8547         0      8546
## $P10      Nd142Di  142Nd_CD19      3094         0      3093
## 262 keywords are stored in the 'description' slot
dim(ff)
##     events parameters 
##     250000         49

Given this notation, a random subsample to a smaller number is straightforward. We’re going to run it below. Part of the rationale behind this is simply that it makes it easier to visualize the plots.

n_cells <- 5000
ff <- ff[sample(x = nrow(ff), size = n_cells, replace = FALSE),]
ff
## flowFrame object '1.fcs'
## with 5000 cells and 49 observables:
##              name        desc     range  minRange  maxRange
## $P1          Time          NA   1129870         0   1129869
## $P2  Event_length          NA        68         0        67
## $P3       In113Di  113In_CD57      7887         0      7886
## $P4       In115Di  115In_Dead      7527         0      7526
## $P5       Sn120Di       120Sn     16821         0     16820
## ...           ...         ...       ...       ...       ...
## $P45      Pb208Di       208Pb     30923         0     30922
## $P46       Center          NA      4621         0      4620
## $P47       Offset          NA       512         0       511
## $P48        Width          NA       229         0       228
## $P49     Residual          NA       580         0       579
## 262 keywords are stored in the 'description' slot

You can see that now we’re dealing with a flow frame that has 30000 cells rather than the much larger orignal number. Now we have to pull out the matrix in the flow frame, as that’s where we’re going to do our work. Luckily, there’s a command called exprs() that does this. Let’s look at what happens when we run it.

In most files, the column names of the matrix are actually the isotope values. One can easily change these by going into the “name” and “desc” parameter, which are in the proper order row-wise as the expression matrix is column-wise.

cells <- exprs(ff) %>% as_tibble()
cells
## # A tibble: 5,000 × 49
##      Time Event…¹ In113Di In115Di Sn120Di I127Di Xe131Di Ba138Di Ce140Di Nd142Di
##     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 8.91e5      19   552.  31.1          0   4.11    0       11.5  0        3.08 
##  2 3.80e5      16     0    0            0  10.5     0      106.   0.373    0.287
##  3 8.30e5      14    42.3  0.0865       0   1.32    2.54    62.5  0.385    0    
##  4 8.40e5      18     0    0.528        0   9.60    2.30   116.   0       81.6  
##  5 9.39e5      16     0    0.783        0   4.08    3.97    80.1  0        1.52 
##  6 5.57e5      20   339.  23.7          0  20.3     0       59.4  0        0.294
##  7 9.77e5      20     0    0            0  95.6     4.51    45.5  0        0.746
##  8 8.20e5      15     0    2.25         0  17.3     0      104.   2.25     0.928
##  9 2.41e4      14   609.  30.4          0   1.01    0       30.6  0        0    
## 10 5.49e5      15     0    1.59         0   6.06    5.08    39.1  0.0970  74.2  
## # … with 4,990 more rows, 39 more variables: Nd143Di <dbl>, Nd144Di <dbl>,
## #   Nd146Di <dbl>, Sm147Di <dbl>, Nd148Di <dbl>, Sm149Di <dbl>, Nd150Di <dbl>,
## #   Eu151Di <dbl>, Sm152Di <dbl>, Eu153Di <dbl>, Sm154Di <dbl>, Gd155Di <dbl>,
## #   Gd156Di <dbl>, Gd157Di <dbl>, Gd158Di <dbl>, Tb159Di <dbl>, Gd160Di <dbl>,
## #   Dy162Di <dbl>, Dy164Di <dbl>, Ho165Di <dbl>, Er166Di <dbl>, Er167Di <dbl>,
## #   Er168Di <dbl>, Tm169Di <dbl>, Er170Di <dbl>, Yb171Di <dbl>, Yb172Di <dbl>,
## #   Yb173Di <dbl>, Yb174Di <dbl>, Lu175Di <dbl>, Yb176Di <dbl>, …
ff@parameters@data %>% as_tibble()
## # A tibble: 49 × 5
##    name         desc         range minRange maxRange
##    <I<chr>>     <I<chr>>     <dbl>    <dbl>    <dbl>
##  1 Time         <NA>       1129870        0  1129869
##  2 Event_length <NA>            68        0       67
##  3 In113Di      113In_CD57    7887        0     7886
##  4 In115Di      115In_Dead    7527        0     7526
##  5 Sn120Di      120Sn        16821        0    16820
##  6 I127Di       127I         10781        0    10780
##  7 Xe131Di      131Xe          665        0      664
##  8 Ba138Di      138Ba        18370        0    18369
##  9 Ce140Di      140Ce_Bead    8547        0     8546
## 10 Nd142Di      142Nd_CD19    3094        0     3093
## # … with 39 more rows
params <- ff@parameters@data$desc 
colnames(cells) <- params

# tidy
cells <- cells[!is.na(names(cells))]

Data transformation

Asinh with different scale arguments

From here, we do a data transformation. For CyTOF, we typically do an asinh transform of the given channel value divided by 5 (call a scale argument of 5). The tranform is log-like, but is ok for negative numbers as well. But let’s look at different “scale arguments” to see if 5 is special.

biaxial <- dplyr::select(cells, "166Er_CD33", "150Nd_CD3")

MakePlot <- function(dat) {
    qplot(dat[[1]], dat[[2]]) + labs(x = names(dat)[1], y = names(dat)[2]) 
}

scale_arguments <- c(-1, 1, 5, 10, 50, 100, 500)
plot_list <- lapply(scale_arguments, function(i) {
    if(i == -1) {
        result <- MakePlot(biaxial) + ggtitle("untransformed")
    } else {
        result <- MakePlot(asinh(biaxial/i)) + ggtitle(paste0("asinh(x/", i, ")"))
    }
    return(result)
})
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
n_col <- floor(sqrt(length(plot_list)))

cowplot::plot_grid(plotlist = plot_list[1:4], align = "v", nrow = 2)

cowplot::plot_grid(plotlist = plot_list[5:length(plot_list)], align = "v", nrow = 2)

The purpose of the data transformation for CyTOF is to separate the signal from the noise, and from a visual perspective to make the data “gateable.” This means we want it to be as similar to transformed flow cytometry data as possible (the best practices for this type of data). See figure 1 in Bendall, Simonds, et al (2011). Please note the co-first authorship. Many refer to this paper as Bendall et al, 2011, but Erin Simonds contributed equally.

We note here that in terms of the “scale-argument” that there is not much of a difference between, for example, 5 and 10. We do see a difference when you don’t have a scale argument at all, or when you set the scale argument very high, in which the data start to take the shape of the untransformed value.

Asinh vs log transformation

Now since the paper has been written single-cell sequencing has taken off, largely a result of Macosko et al, 2015 also known as the drop-seq paper. A common analysis pipeline for these data is Seurat from the laboratory of Rahul Satija. Here, the data transformation is the natural log of the data plus 1 (in R, this is log1p). Let’s see what this does to the CyTOF data.

p1 <- MakePlot(biaxial) + ggtitle("untransformed")
p2 <- MakePlot(asinh(biaxial/5)) + ggtitle("asinh(x/5)")
p3 <- MakePlot(log1p(biaxial)) + ggtitle("log(x + 1)")
p4 <- MakePlot(log(biaxial)) + ggtitle("log(x)")
cowplot::plot_grid(plotlist = list(p1, p2, p3, p4), align = "v", nrow = 2)

We can see that the log(x + 1) does not look much different than the asinh with a scale argument of 5. Regardless of rationale, this shows that slightly changing the data transformation does not make or break the data. We see that the log(x) transform does produce data that are problematic, in that there is a large gap between the lowest possible value of the cell and the second lowest possible value of the cell. In flow cytometry, there are many data transformations that get the job done. It’s probably the same for CyTOF, despite the fact that asinh(x/5) is universal.

Relationship between asinh and log transformations

We can make this a bit more concrete by subtracting the matricies from each other and seeing what the differences are.

b_asinh <- asinh(biaxial/5)
b_log1p <- log1p(biaxial)

p1 <- qplot(biaxial$`166Er_CD33`, b_asinh$`166Er_CD33`) + ggtitle("untransformed vs asinh(x/5)")
p2 <- qplot(biaxial$`166Er_CD33`, b_log1p$`166Er_CD33`) + ggtitle("untransformed vs log(x + 1)")
p3 <- qplot(b_asinh$`166Er_CD33`, b_log1p$`166Er_CD33`) + ggtitle("asinh(x/5) vs log(x + 1)")

cowplot::plot_grid(plotlist = list(p1, p2, p3), align = "v", nrow = 2)

In the first two cases, correlating the original data with the transformed data, the transformed data decrease more rapidly as the untransformed values near zero. This is to say that the transformed data are more sensitive to changes in values near zero.

We can see that asinh and log1p are largely correlated, but there is subtle version of the trend that is similar to what we saw between the untransformed data and the transformed data. That is that log1p decreases more rapidly than asinh as the line approaches zero. This is to say that in the values very close to zero, log1p is just a bit more sensitive than the asinh. However, for the sake of CyTOF data, when the values are near this close to zero, we typically consider this as the “negative” population. You can see this in the biaxial plots. Furthermore (this is anecdotal from a conversation with Sean Bendall, mass spec chemist by training, in 2012 or so) the variance of the values near zero in terms of the ions hitting the detector for CyTOF data are meaningless for all practical purposes. Thus, the asinh with a scale argument of 5 does a good job “squishing” the values near zero as opposed to a log transforation. I show here that yes this is true, but it’s more subtle than I appreciated.

But let’s do one more comparison. Does the asinh by itself “squash” the values near zero? Is it the scale argument? Is it both?

b_asinh <- asinh(biaxial)
b_asinh5 <- asinh(biaxial/5)
b_log1p <- log1p(biaxial)
b_log1p5 <- log1p(biaxial/5)

p1 <- qplot(b_asinh$`166Er_CD33`, b_log1p$`166Er_CD33`) + ggtitle("asinh(x) vs log(x + 1)")
p2 <- qplot(b_asinh5$`166Er_CD33`, b_log1p$`166Er_CD33`) + ggtitle("asinh(x/5) vs log(x + 1)")
p3 <- qplot(b_log1p5$`166Er_CD33`, b_asinh$`166Er_CD33`) + ggtitle("log((x + 1)/5) vs asinh(x)")
p4 <- qplot(b_asinh5$`166Er_CD33`, b_log1p5$`166Er_CD33`) + ggtitle("asinh(x/5) vs log((x + 1)/5)")

cowplot::plot_grid(plotlist = list(p1, p2, p3, p4), align = "v", nrow = 2)

What we find is it’s the scale argument that brings the values near zero closer to zero, not log vs asinh, per-se. However, what we see is that in the asinh case, we see the influence of the scale argument at lower values than the log case. In other words, we see the “curve” in the plots start at lower values in the asinh case as opposed to the log case.

If a goal of the data transformation and scale argument is to bring values that are near zero closer to zero, then you would choose the asinh transformation over the log transformation if you want this to happen at values closer to zero than the log transformation can provide.

t-SNE and data transformation

Now we’ve seen that the data transformations do matter in terms of how a user interprets how the cells should be subsetted. Now we will see if data transformation matters in terms of dimension reduction. We’ll focus on t-SNE for now. Below, we run t-SNE on untransformed data and transformed data, and visually examine the output of both.

library(Rtsne)

cells <- cells[grep("CD", names(cells))]

tsne_orig <- Rtsne::Rtsne(cells, check_duplicates = FALSE)$Y %>% as_tibble()
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
tsne_asinh <- Rtsne::Rtsne(asinh(cells/5), check_duplicates = FALSE)$Y %>% as_tibble()
tsne_log1p <- Rtsne::Rtsne(log1p(cells), check_duplicates = FALSE)$Y %>% as_tibble()

qplot(tsne_orig[[1]], tsne_orig[[2]], color = cells$`150Nd_CD3`)

qplot(tsne_asinh[[1]], tsne_asinh[[2]],  color = asinh(cells$`150Nd_CD3`/5))

qplot(tsne_log1p[[1]], tsne_log1p[[2]],  color = log1p(cells$`150Nd_CD3`))

One observation is that the t-SNE map for the untransformed data looks different but it doesn’t look like complete chaos (eg. a ball). There appear to be different subsets that are discerned by the algorithm despite the fact that the data are untransformed.

This being said, it looks much more like a continuum than a series of nicely subsetted niches. We color by CD3 to show that there are many more visible “subsets” when the data are transformed, both from asinh(x/5) and log1p(x). One can see that these transformations can change the interpretation of our data. Data transformation leads to a better color palette as well.