The following examines the difference between mean(asinh(x/5)) and asinh(mean(x/5)). For the scope of this markdown, when I say “mean” I mean “arithmetic mean.”
First, we make some toy data. I take 10,000 instances 100 normally distributed data points (think 10,000 clusters). To make the data CyTOF-like (more log-normal-like), I do the inverse of the standard asinh(x/5) transformation for CyTOF data.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
rand_list <- lapply(1:10000, function(i) rnorm(100, mean = 5, sd = 2))
rand_list <- lapply(rand_list, function(i) sinh(i)*5)
rand_list[1:3]
## [[1]]
## [1] 1746.0757536 420.2970402 281.1666214 524.9709878 56.5013780
## [6] 403.5724889 15.9439362 58095.4671170 26432.2422277 72.4213880
## [11] 959.8826441 357.0029326 525.9660021 528.3548157 34.4592987
## [16] 3152.5255679 252.3495527 553.5552746 1730.5754956 324.1488566
## [21] 440.0371066 5.5486396 165.6549788 64.8685427 215.5503969
## [26] 113.9300958 1978.7456439 6031.5067267 54.7363505 38.9923806
## [31] 5874.7021037 1838.2045899 467.4419724 1557.5076455 1119.8460847
## [36] 662.5377975 608.8868651 1658.1743909 8608.0498159 2896.1920858
## [41] 2496.8366735 434.2369688 11988.7139791 69.0725255 891.7047822
## [46] 243.6850210 66.4701989 89906.7764213 3912.7853885 92.9448296
## [51] 34.6368336 24.3939101 71.4034468 22.1461634 608.6268410
## [56] 1792.0376446 10495.7521424 3.5288467 4323.1397068 142.2619409
## [61] 5.1177573 369.9936541 17.2541253 299.6243381 884.3247160
## [66] 69.0902181 6.9665513 141.8996084 7.0812145 272.3181236
## [71] 1026.4574609 -0.2955320 11756.1558043 1829.9141896 841.1556645
## [76] 2665.1922702 537.1055199 700.6955314 126.5913209 4012.8312896
## [81] 3151.6674662 2540.8256633 1009.3098028 210.6880843 5967.7375487
## [86] 1085.0677963 1168.5115907 196.9926250 1246.7262322 659.7143056
## [91] -0.6613423 688.8745440 285.1679390 20.3297343 4539.8770921
## [96] 52.5400801 228.6629801 1118.0288285 260.2348837 7.8060900
##
## [[2]]
## [1] 4.034610e+03 1.026494e+03 1.267857e+03 2.815649e+02 4.685218e+02
## [6] 9.661845e+01 1.409353e+05 1.223276e+02 4.132786e+01 1.824882e+01
## [11] 7.443437e+02 5.528934e+01 1.208500e+03 -1.080373e+00 7.446284e+03
## [16] 1.925238e+02 1.009399e+04 2.212393e+02 3.956889e+04 1.682333e+03
## [21] 1.342092e+04 1.965190e+01 2.763727e+02 1.647481e+03 7.979056e+03
## [26] 1.378196e+02 5.540994e+02 5.403612e+02 1.738488e+03 3.142428e+02
## [31] 9.816026e+02 1.030701e+03 1.503352e+04 6.664515e-01 2.163169e+04
## [36] 2.428208e+01 2.385866e+02 7.884476e+02 2.363507e+02 6.922780e+02
## [41] 4.410700e+03 7.260006e+02 1.523496e+02 2.245870e+01 1.514329e+02
## [46] 7.839947e+02 9.747094e+02 2.014186e+03 1.479641e+04 4.222060e+03
## [51] 1.196505e+03 1.166676e+01 1.327936e+03 3.125151e+02 3.347027e+00
## [56] 3.978845e+03 3.257598e+02 1.178646e+03 3.502583e+02 3.768050e+02
## [61] 2.731557e+03 1.100080e+04 6.915140e+02 6.432045e+00 4.614079e+01
## [66] 1.477315e+04 9.864926e+01 4.957144e+01 2.060374e+02 1.040545e+03
## [71] 2.794849e+02 1.356731e+03 4.557137e+02 6.632460e+01 8.262151e+02
## [76] 7.935307e+02 1.327505e+02 4.377748e+02 1.675814e+02 1.168413e+04
## [81] 3.630991e+03 1.728686e+02 2.614279e+02 4.852752e+03 2.066405e+02
## [86] 4.104074e+02 1.872247e+02 3.762643e+00 6.830458e+01 1.427025e+03
## [91] 1.281973e+04 7.094996e+02 1.588157e+02 8.791537e+01 9.768902e+02
## [96] 5.802986e+01 1.284005e+02 1.406308e+02 4.150838e+01 1.079575e+03
##
## [[3]]
## [1] 6698.184248 1682.786190 46.651683 3424.912227 1253.914548
## [6] 1234.313410 22.495870 3729.535549 541.428455 138.097208
## [11] 3477.768851 18.431418 47.858522 407.849859 2005.460822
## [16] 948.941279 965.034427 11.233440 8.342860 669.297166
## [21] 487.946981 2497.213690 462.505537 106.331585 228.718348
## [26] 25.381312 269.986701 655.667984 359.847718 540.279603
## [31] 15.872637 769.943474 390.807837 85.168689 1259.111581
## [36] 862.100613 5109.344980 1.280160 2989.122642 18.930802
## [41] 164.279317 38.611415 1826.346633 1110.679437 418.879626
## [46] 1401.438280 725.106501 246.381576 3.108472 3597.138827
## [51] 168.287476 58.795861 2481.573731 17.649912 1712.431904
## [56] 199.268567 13020.332453 973.659362 191.519682 253.199132
## [61] 80.963353 1197.101318 53.943140 72.194229 162.216039
## [66] 182.608605 1918.659805 3785.434006 14767.769142 34.887089
## [71] 2783.749274 121.351197 2233.075484 1.710268 14.208843
## [76] 244.671776 265.615435 92.862496 133.398605 766.161737
## [81] 43.536302 948.388762 102.816503 1719.262350 895.646418
## [86] 7052.741434 713.759463 815.036367 31.815229 30.129238
## [91] 154.929203 34.265568 33.465824 56.357590 20.677419
## [96] 216.916177 765.120463 123.054501 3.968968 17.384131
We then do the asinh(mean(x/5)) and the mean(asinh(x/5)), storing them as separate vectors.
x <- lapply(rand_list, function(i) asinh(mean(i)/5)) %>% unlist()
x[1:10]
## [1] 7.118008 7.348267 6.137893 6.836156 6.772216 6.562271 6.530395 6.623748
## [9] 7.323337 6.884612
y <- lapply(rand_list, function(i) mean(asinh(i/5))) %>% unlist()
y[1:10]
## [1] 5.054472 5.230885 4.667892 5.251546 5.073105 4.910224 5.020283 4.914269
## [9] 5.131838 5.092842
We next check the correlation between the data transformed by the two operations, and do a simple biaxial plot.
library(ggplot2)
scor <- cor(x, y, method = "spearman")
qplot(x, y) + ggtitle(paste("spearman cor = ", scor))
Finally, we view each vector separately. We note than the mean of the asinh transformed data form a bell curve, whereas the asinh of the mean of the raw data form a distribution that has a tail to the right.
hist(x, breaks = 100) # asinh of the mean
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.599 6.578 6.830 6.879 7.130 9.608
hist(y, breaks = 100) # mean of the asinh (this is what omiq exports by default)
summary(y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.245 4.862 4.997 4.999 5.138 5.741
Finally, what is the verbal explanation as to why you should take the mean of the asinh transformed data, and not the asinh of the mean of the raw data? Because the raw data are skewed, and the asinh transformation makes the data more normal distribution-like. Taking the mean gives you better intuition around a normal distribution than a skewed distribution. In the latter case, you have to deal with outliers on the right affecting the value of the mean. For skewed distributions and for outliers in general, there are other operations, like the geometric mean, that may be more appropriate here. But that’s outside of the scope of this markdown.