“To imagine a language is to imagine a form of life.”
Ludwig Wittgenstein, Philosophical Investigations
The purpose of this markdown is to pass CyTOF data through a number of useful programming languages within a single R Markdown file, leveraging the best practices as we go.
We will start with R, where we will read in the data normally. Then, we’ll take the data that has been processed in R, and place it into other languages directly within the R Markdown. Knowing how to do this will allow CyTOF users to become agnostic to the programming language(s) that will get the job done.
For me, I do most of my CyTOF analysis in R (and especially R Markdown). But it’s nice to now that I can move my data into other languages at any point depending on need.
library(flowCore)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks flowCore::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
set.seed(1)
setwd(here::here("data"))
ff <- flowCore::read.FCS('Center_1_A_1_MATLAB_LiveIntactSing_Linneg_CD45pos.fcs')
## Warning in readFCSdata(con, offsets, txt, transformation, which.lines, scale, : Some data values of 'Nd142Di' channel exceed its $PnR value 1509 and will be truncated!
## To avoid truncation, either fix $PnR before generating FCS or set 'truncate_max_range = FALSE'
## Warning in readFCSdata(con, offsets, txt, transformation, which.lines, scale, : Some data values of 'Yb176Di' channel exceed its $PnR value 1349 and will be truncated!
## To avoid truncation, either fix $PnR before generating FCS or set 'truncate_max_range = FALSE'
ff
## flowFrame object 'Center_1_A_1_MATLAB_LiveIntactSing_Linneg_CD45pos.fcs'
## with 93816 cells and 30 observables:
## name desc range minRange maxRange
## $P1 Ce140Di beads1 2911 0 7
## $P2 Ce142Di beads2 1509 0 1
## $P3 Er167Di CD27 781 0 17
## $P4 Er170Di CD3 1664 0 27
## $P5 Eu151Di CD123 2814 0 16
## ... ... ... ... ... ...
## $P26 Tm169Di CD45RA 12000 0 23
## $P27 Yb172Di CD38 5629 0 18
## $P28 Yb174Di HLADR 6990 0 11
## $P29 Yb176Di CD56 1349 0 12
## $P30 Time Time 0 0 1
## 261 keywords are stored in the 'description' slot
Let’s tidy the data.
mass_tags <- ff@parameters@data$name
params <- ff@parameters@data$desc
# All NA values in marker names will revert to mass tags names
params <- ifelse(is.na(params), mass_tags, params)
colnames(ff) <- params
We do our asinh transform.
exp <- flowCore::exprs(ff)
as_tibble(exp)
## # A tibble: 93,816 × 30
## beads1 beads2 CD27 CD3 CD123 beads3 `Event #` Event_length CD33
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 1.74 2.35 2.09 0 24.3 3.41
## 2 0 0 0 56.1 3.39 16.7 0 25.2 0.686
## 3 0 0 0.437 2.93 6.23 228. 0 17.7 2.23
## 4 0 0 108. 288. 5.61 91.3 0 15.9 2.84
## 5 2.48 0 9.69 7.18 3.35 22.2 0 20.5 5.29
## 6 1.49 0 27.6 191. 2.44 2.25 0 24.3 0
## 7 1.57 0 26.0 160. 2.30 46.6 0 27.1 0.894
## 8 0 0 1.29 42.2 2.56 22.9 0 23.3 0
## 9 0 0 4.12 9.30 0 14.1 0 19.6 8.39
## 10 0 0 0.715 0 3.28 1.21 0 19.6 70.8
## # ℹ 93,806 more rows
## # ℹ 21 more variables: CD14 <dbl>, CD61 <dbl>, cells1 <dbl>, cells2 <dbl>,
## # beads4 <dbl>, beads5 <dbl>, CD19 <dbl>, CD4 <dbl>, CD8A <dbl>, CD16 <dbl>,
## # CD235 <dbl>, livedead <dbl>, CD20 <dbl>, CD66 <dbl>, CD45 <dbl>,
## # CD11C <dbl>, CD45RA <dbl>, CD38 <dbl>, HLADR <dbl>, CD56 <dbl>, Time <dbl>
exp <- asinh(exp/5)
as_tibble(exp)
## # A tibble: 93,816 × 30
## beads1 beads2 CD27 CD3 CD123 beads3 `Event #` Event_length CD33 CD14
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 0.342 0.454 0.407 0 2.28 0.638 1.59
## 2 0 0 0 3.11 0.635 1.92 0 2.32 0.137 1.35
## 3 0 0 0.0872 0.557 1.05 4.51 0 1.98 0.432 1.08
## 4 0 0 3.76 4.75 0.965 3.60 0 1.87 0.542 0.446
## 5 0.478 0 1.42 1.16 0.627 2.19 0 2.12 0.922 1.88
## 6 0.294 0 2.41 4.34 0.470 0.435 0 2.28 0 0.0199
## 7 0.308 0 2.35 4.16 0.444 2.93 0 2.39 0.178 0.384
## 8 0 0 0.255 2.83 0.492 2.22 0 2.24 0 1.60
## 9 0 0 0.752 1.38 0 1.76 0 2.07 1.29 2.20
## 10 0 0 0.142 0 0.617 0.240 0 2.07 3.34 2.01
## # ℹ 93,806 more rows
## # ℹ 20 more variables: CD61 <dbl>, cells1 <dbl>, cells2 <dbl>, beads4 <dbl>,
## # beads5 <dbl>, CD19 <dbl>, CD4 <dbl>, CD8A <dbl>, CD16 <dbl>, CD235 <dbl>,
## # livedead <dbl>, CD20 <dbl>, CD66 <dbl>, CD45 <dbl>, CD11C <dbl>,
## # CD45RA <dbl>, CD38 <dbl>, HLADR <dbl>, CD56 <dbl>, Time <dbl>
flowCore::exprs(ff) <- exp
We get the surface markers of interest.
params <- ff@parameters@data$desc
# Markers to use
cd <- grep("CD", params)
ig <- grep("Ig", params)
hla <- grep("HLA", params)
ccr <- c(grep("CCR", params), grep("CXC", params))
surface <- c(cd, ig, hla, ccr)
surface
## [1] 3 4 5 9 10 11 16 17 18 19 20 22 23 24 25 26 27 29 28
dat <- flowCore::exprs(ff)[,surface] %>% as_tibble()
dat <- dat[sample(nrow(dat), 10000),]
dat
## # A tibble: 10,000 × 19
## CD27 CD3 CD123 CD33 CD14 CD61 CD19 CD4 CD8A CD16 CD235 CD20
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.893 1.07 0.995 0.382 1.23 0.104 0.0445 1.33 1.72 4.69 2.42 2.27
## 2 2.90 4.36 0.436 0.0821 0.200 0.692 0.298 2.31 1.40 1.41 1.16 0.941
## 3 1.45 3.48 0.176 0.490 1.53 0.333 0.302 1.11 3.52 1.69 1.45 1.74
## 4 0 0 0.504 4.09 2.69 0.685 0.0548 1.90 2.03 2.44 0.687 1.95
## 5 0.414 0.833 1.05 4.41 3.62 1.42 0.158 0.0475 0.947 1.31 1.79 0.827
## 6 0.479 0 0.113 4.05 3.28 1.26 0 0.990 1.43 1.32 1.16 1.02
## 7 2.47 3.93 0.227 0.712 1.23 0.622 0 2.83 1.33 1.79 1.90 2.18
## 8 0.315 3.36 0.0257 0 0.795 0 0.385 0.653 0.766 3.17 1.22 1.68
## 9 0.838 0.111 1.04 1.85 2.17 0 2.30 1.69 2.53 4.90 3.01 2.68
## 10 2.86 3.65 0 0.628 0.316 0 0 0.332 1.42 1.24 1.33 0.730
## # ℹ 9,990 more rows
## # ℹ 7 more variables: CD66 <dbl>, CD45 <dbl>, CD11C <dbl>, CD45RA <dbl>,
## # CD38 <dbl>, CD56 <dbl>, HLADR <dbl>
In python, we want to start by creating a virtual environment. How do you do that directly in a R markdown? With the reculate package, as follows:
library(reticulate)
reticulate::virtualenv_create("myenv", python = "python3", packages = c("pandas"))
## virtualenv: myenv
reticulate::use_virtualenv("myenv", required = TRUE)
Ok, now let’s pull the data frame into Python and visualize it.
import pandas as pd
dat_python = r.dat
dat_python
## CD27 CD3 CD123 ... CD38 CD56 HLADR
## 0 0.893362 1.065713 0.994633 ... 4.482064 4.346219 1.432408
## 1 2.896564 4.362652 0.436426 ... 3.510879 0.222827 0.250180
## 2 1.454037 3.482632 0.176360 ... 0.937635 0.449524 0.920010
## 3 0.000000 0.000000 0.503708 ... 3.116092 0.211046 2.345862
## 4 0.414235 0.832575 1.050114 ... 3.470916 0.083655 2.277326
## ... ... ... ... ... ... ... ...
## 9995 3.292594 4.659797 0.084277 ... 2.030386 0.647847 0.240177
## 9996 0.961703 3.510068 0.000000 ... 0.351932 0.240576 0.000000
## 9997 0.983802 1.316247 1.352664 ... 2.339488 3.500288 0.467577
## 9998 0.324605 0.577332 0.000000 ... 3.207215 3.685156 1.372337
## 9999 0.449814 0.661805 0.710178 ... 4.701716 3.662713 1.170080
##
## [10000 rows x 19 columns]
And now let’s put it back into R.
dat_r <- py$dat_python %>% as_tibble()
dat_r
## # A tibble: 10,000 × 19
## CD27 CD3 CD123 CD33 CD14 CD61 CD19 CD4 CD8A CD16 CD235 CD20
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.893 1.07 0.995 0.382 1.23 0.104 0.0445 1.33 1.72 4.69 2.42 2.27
## 2 2.90 4.36 0.436 0.0821 0.200 0.692 0.298 2.31 1.40 1.41 1.16 0.941
## 3 1.45 3.48 0.176 0.490 1.53 0.333 0.302 1.11 3.52 1.69 1.45 1.74
## 4 0 0 0.504 4.09 2.69 0.685 0.0548 1.90 2.03 2.44 0.687 1.95
## 5 0.414 0.833 1.05 4.41 3.62 1.42 0.158 0.0475 0.947 1.31 1.79 0.827
## 6 0.479 0 0.113 4.05 3.28 1.26 0 0.990 1.43 1.32 1.16 1.02
## 7 2.47 3.93 0.227 0.712 1.23 0.622 0 2.83 1.33 1.79 1.90 2.18
## 8 0.315 3.36 0.0257 0 0.795 0 0.385 0.653 0.766 3.17 1.22 1.68
## 9 0.838 0.111 1.04 1.85 2.17 0 2.30 1.69 2.53 4.90 3.01 2.68
## 10 2.86 3.65 0 0.628 0.316 0 0 0.332 1.42 1.24 1.33 0.730
## # ℹ 9,990 more rows
## # ℹ 7 more variables: CD66 <dbl>, CD45 <dbl>, CD11C <dbl>, CD45RA <dbl>,
## # CD38 <dbl>, CD56 <dbl>, HLADR <dbl>
In later markdowns, we will explore what can be done with CyTOF data in python, which is more of a general purpose language than R.
Julia is a fast and efficient language that is a good alternative to R and Python if you need to run something faster, but with the syntax of a higher level language. I’ve explored the Julia language before with respect to it’s implementation of UMAP, which I showed was faster than that of R’s implementation. For more on that, go here
Let’s pull the CyTOF data into Julia so you can see what that looks like. We’re going to use the JuliaConnectoR and JuliaCall package. The latter will import the RCall package into Julia, which allows us to move data around between R and Julia.
library(JuliaConnectoR)
library(JuliaCall)
JuliaCall::julia_setup(JULIA_HOME = "/Applications/Julia-1.8.app/Contents/Resources/julia/bin/")
## Julia version 1.8.5 at location /Applications/Julia-1.8.app/Contents/Resources/julia/bin will be used.
## Loading setup script for JuliaCall...
## Finish loading setup script for JuliaCall.
JuliaCall::julia_command("using Pkg; Pkg.add(\"RCall\")")
One weird thing you need to know about operating in Julia, at least if you’re going to feed it into the UMAP implementation, is that the matrices you import must be transposed. I do that below.
dat_julia <- t(dat)
dat_julia[1:5, 1:5]
## [,1] [,2] [,3] [,4] [,5]
## CD27 0.8933623 2.89656390 1.4540374 0.0000000 0.4142347
## CD3 1.0657127 4.36265231 3.4826318 0.0000000 0.8325753
## CD123 0.9946332 0.43642573 0.1763605 0.5037082 1.0501137
## CD33 0.3823967 0.08214474 0.4902337 4.0885739 4.4137900
## CD14 1.2311487 0.20033205 1.5299325 2.6906949 3.6192467
Note that when we pull the data directly into Julia it automatically prints what you have.
@rget dat_julia
## 19×10000 Matrix{Float64}:
## 0.893362 2.89656 1.45404 0.0 … 0.983802 0.324605 0.449814
## 1.06571 4.36265 3.48263 0.0 1.31625 0.577332 0.661805
## 0.994633 0.436426 0.17636 0.503708 1.35266 0.0 0.710178
## 0.382397 0.0821447 0.490234 4.08857 0.276608 0.10213 0.99656
## 1.23115 0.200332 1.52993 2.69069 0.849813 0.940879 0.947871
## 0.103708 0.69206 0.33269 0.685085 … 0.474183 0.0827106 0.486418
## 0.0444569 0.297682 0.302101 0.0548172 0.0 0.624719 0.953517
## 1.33058 2.31387 1.10763 1.89811 0.96037 1.25523 1.38368
## 1.72375 1.40033 3.51539 2.02925 2.10112 1.75524 2.30779
## 4.69467 1.40753 1.69445 2.44133 4.19829 4.25289 4.96401
## 2.41594 1.16012 1.45371 0.686639 … 2.18343 1.7767 2.50077
## 2.27385 0.941243 1.74271 1.95093 1.53315 2.33247 2.75057
## 1.48265 0.901397 0.910939 1.49271 1.56159 0.753175 1.71092
## 3.79011 4.56578 4.81816 5.09114 4.0469 4.31683 4.83699
## 2.70277 0.0 1.2289 3.71234 1.42651 0.879218 2.17822
## 2.98534 4.00644 1.18134 0.067049 … 3.19505 3.12861 4.59856
## 4.48206 3.51088 0.937635 3.11609 2.33949 3.20722 4.70172
## 4.34622 0.222827 0.449524 0.211046 3.50029 3.68516 3.66271
## 1.43241 0.25018 0.92001 2.34586 0.467577 1.37234 1.17008
Now dat_julia can be manipulated as you wish within Julia code blocks. And now we get it back into R. Let’s add 1 to the whole matrix so you’re convinced that it was indeed in Julia. Note that it’s .+ to add 1 to every element of the matrix.
dat_1 = dat_julia .+ 1
## 19×10000 Matrix{Float64}:
## 1.89336 3.89656 2.45404 1.0 … 1.9617 1.9838 1.32461 1.44981
## 2.06571 5.36265 4.48263 1.0 4.51007 2.31625 1.57733 1.66181
## 1.99463 1.43643 1.17636 1.50371 1.0 2.35266 1.0 1.71018
## 1.3824 1.08214 1.49023 5.08857 1.49436 1.27661 1.10213 1.99656
## 2.23115 1.20033 2.52993 3.69069 1.70067 1.84981 1.94088 1.94787
## 1.10371 1.69206 1.33269 1.68508 … 1.0 1.47418 1.08271 1.48642
## 1.04446 1.29768 1.3021 1.05482 1.14217 1.0 1.62472 1.95352
## 2.33058 3.31387 2.10763 2.89811 3.54422 1.96037 2.25523 2.38368
## 2.72375 2.40033 4.51539 3.02925 2.11413 3.10112 2.75524 3.30779
## 5.69467 2.40753 2.69445 3.44133 2.88941 5.19829 5.25289 5.96401
## 3.41594 2.16012 2.45371 1.68664 … 2.33553 3.18343 2.7767 3.50077
## 3.27385 1.94124 2.74271 2.95093 2.22709 2.53315 3.33247 3.75057
## 2.48265 1.9014 1.91094 2.49271 2.52791 2.56159 1.75317 2.71092
## 4.79011 5.56578 5.81816 6.09114 5.63253 5.0469 5.31683 5.83699
## 3.70277 1.0 2.2289 4.71234 1.18129 2.42651 1.87922 3.17822
## 3.98534 5.00644 2.18134 1.06705 … 3.93722 4.19505 4.12861 5.59856
## 5.48206 4.51088 1.93763 4.11609 1.35193 3.33949 4.20722 5.70172
## 5.34622 1.22283 1.44952 1.21105 1.24058 4.50029 4.68516 4.66271
## 2.43241 1.25018 1.92001 3.34586 1.0 1.46758 2.37234 2.17008
Now we will place this new matrix back into R.
@rput dat_1
## 19×10000 Matrix{Float64}:
## 1.89336 3.89656 2.45404 1.0 … 1.9617 1.9838 1.32461 1.44981
## 2.06571 5.36265 4.48263 1.0 4.51007 2.31625 1.57733 1.66181
## 1.99463 1.43643 1.17636 1.50371 1.0 2.35266 1.0 1.71018
## 1.3824 1.08214 1.49023 5.08857 1.49436 1.27661 1.10213 1.99656
## 2.23115 1.20033 2.52993 3.69069 1.70067 1.84981 1.94088 1.94787
## 1.10371 1.69206 1.33269 1.68508 … 1.0 1.47418 1.08271 1.48642
## 1.04446 1.29768 1.3021 1.05482 1.14217 1.0 1.62472 1.95352
## 2.33058 3.31387 2.10763 2.89811 3.54422 1.96037 2.25523 2.38368
## 2.72375 2.40033 4.51539 3.02925 2.11413 3.10112 2.75524 3.30779
## 5.69467 2.40753 2.69445 3.44133 2.88941 5.19829 5.25289 5.96401
## 3.41594 2.16012 2.45371 1.68664 … 2.33553 3.18343 2.7767 3.50077
## 3.27385 1.94124 2.74271 2.95093 2.22709 2.53315 3.33247 3.75057
## 2.48265 1.9014 1.91094 2.49271 2.52791 2.56159 1.75317 2.71092
## 4.79011 5.56578 5.81816 6.09114 5.63253 5.0469 5.31683 5.83699
## 3.70277 1.0 2.2289 4.71234 1.18129 2.42651 1.87922 3.17822
## 3.98534 5.00644 2.18134 1.06705 … 3.93722 4.19505 4.12861 5.59856
## 5.48206 4.51088 1.93763 4.11609 1.35193 3.33949 4.20722 5.70172
## 5.34622 1.22283 1.44952 1.21105 1.24058 4.50029 4.68516 4.66271
## 2.43241 1.25018 1.92001 3.34586 1.0 1.46758 2.37234 2.17008
Now back in R. Note again that we have to transpose the matrix. We lost the column names when we imported it into Julia, so we’re putting them back here. There are ways to keep the column names, but we can talk about that some other time. Right now, I’m just operating in base Julia so you can see how to move data around in the two environments.
dat_r1 <- dat_1 %>% t() %>% as_tibble()
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
names(dat_r1) <- names(dat)
dat_r1
## # A tibble: 10,000 × 19
## CD27 CD3 CD123 CD33 CD14 CD61 CD19 CD4 CD8A CD16 CD235 CD20 CD66
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.89 2.07 1.99 1.38 2.23 1.10 1.04 2.33 2.72 5.69 3.42 3.27 2.48
## 2 3.90 5.36 1.44 1.08 1.20 1.69 1.30 3.31 2.40 2.41 2.16 1.94 1.90
## 3 2.45 4.48 1.18 1.49 2.53 1.33 1.30 2.11 4.52 2.69 2.45 2.74 1.91
## 4 1 1 1.50 5.09 3.69 1.69 1.05 2.90 3.03 3.44 1.69 2.95 2.49
## 5 1.41 1.83 2.05 5.41 4.62 2.42 1.16 1.05 1.95 2.31 2.79 1.83 2.47
## 6 1.48 1 1.11 5.05 4.28 2.26 1 1.99 2.43 2.32 2.16 2.02 2.34
## 7 3.47 4.93 1.23 1.71 2.23 1.62 1 3.83 2.33 2.79 2.90 3.18 2.53
## 8 1.32 4.36 1.03 1 1.79 1 1.38 1.65 1.77 4.17 2.22 2.68 2.22
## 9 1.84 1.11 2.04 2.85 3.17 1 3.30 2.69 3.53 5.90 4.01 3.68 2.72
## 10 3.86 4.65 1 1.63 1.32 1 1 1.33 2.42 2.24 2.33 1.73 1.98
## # ℹ 9,990 more rows
## # ℹ 6 more variables: CD45 <dbl>, CD11C <dbl>, CD45RA <dbl>, CD38 <dbl>,
## # CD56 <dbl>, HLADR <dbl>
C++ is a low level language that is very fast. It is much older, but still very much in use. It is one of the first programming languages I learned, and I come back to it every once in a while, especially when I’m doing imaging work. R has an interesting interface to C++ called rcpp. We’re going to use that here.
What we’re going to do is make a function in C++ that we then use in R. We’re going to print the first 5 rows and columns of our CyTOF data. But the point is, if you want to do something in C++, you make the function directly in C++, and then call it in R.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
void printDataFrame(DataFrame df) {
int numCols = df.size();
int numRows = df.nrows();
// Limit to first 5 columns
numCols = numCols < 5 ? numCols : 5;
for (int i = 0; i < numCols; i++) {
Rcpp::Rcout << "Column " << i+1 << ": ";
NumericVector col = df[i];
// Print the first 5 rows of the column
for (int j = 0; j < numRows && j < 5; j++) {
Rcpp::Rcout << col[j] << " ";
}
Rcpp::Rcout << "\n";
}
}
# This is R code. We're back in R.
printDataFrame(dat)
## Column 1: 0.893362 2.89656 1.45404 0 0.414235
## Column 2: 1.06571 4.36265 3.48263 0 0.832575
## Column 3: 0.994633 0.436426 0.17636 0.503708 1.05011
## Column 4: 0.382397 0.0821447 0.490234 4.08857 4.41379
## Column 5: 1.23115 0.200332 1.52993 2.69069 3.61925
It sure looks like a lot just to print a data frame. What I can tell you is that my primary use case for C++ (via Rcpp) is when I’m dealing with imaging data. But the big point is you can integrate the C++ code into the R markdowns rather than making separate scripts.
SQL is a language you use to query databases. You can also use it to query CyTOF data. Now, we can use sql durectly within R Markdown, but that would be more for if we had CyTOF data stored externally in a relational database. That’s for another time. But to at least show you the syntax, we can use the sqldf package.
So, let’s find some T cells. Note that this is similar to dplyr::filter(), but SQL can be used to make very complicated queries, so it’s good to know how to properly call it.
Below, the translation of the SQL is “select all the rows from the dat object where the column called CD3 has values greater than some number, which we’ll call x.” In dplyr terms, it’s the equivalent to dplyr::filter(dat, CD3 > x). Pick your poison.
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
result <- sqldf("SELECT * FROM dat WHERE CD3 > 5.3")
print(result)
## CD27 CD3 CD123 CD33 CD14 CD61 CD19
## 1 3.1264681 5.370645 1.0021870 0.0000000 1.5759916 0.11341906 0.00000000
## 2 0.7103834 5.907235 1.2547949 0.3177335 1.8698172 0.04869975 0.00000000
## 3 2.9358062 5.308189 1.4780226 0.6108917 0.7769289 0.00000000 0.05806983
## 4 3.0865904 5.351840 0.8511912 0.1929881 2.2991696 0.00000000 0.00000000
## 5 3.4145370 5.346575 1.7870004 1.6294845 1.9240186 0.76869842 0.87898998
## 6 4.0330824 5.302419 1.2703423 0.0000000 1.6397655 0.51174505 0.00000000
## 7 3.3052528 5.359280 0.6599960 0.3097051 1.3083458 0.14727819 0.00000000
## CD4 CD8A CD16 CD235 CD20 CD66 CD45 CD11C
## 1 3.120505 1.974160 2.565491 2.008806 1.956707 1.245664 5.519997 1.0732701
## 2 1.775879 2.758985 3.391104 2.197680 2.198716 2.282003 6.094073 0.6973995
## 3 3.295261 4.242156 2.819916 2.289628 2.330265 1.858125 6.125014 2.8182823
## 4 2.869384 2.195726 2.665933 1.965704 2.264631 2.394309 5.685564 1.6070940
## 5 2.871258 4.278365 3.158603 2.988147 2.130051 1.747186 5.737861 1.6370463
## 6 3.395235 2.172157 2.884782 1.880756 2.533725 1.932959 5.753517 0.0000000
## 7 2.866402 1.797314 2.241456 1.695375 2.085744 1.241980 5.382069 0.0000000
## CD45RA CD38 CD56 HLADR
## 1 5.293691 2.857672 0.4629084 0.06160644
## 2 5.871059 1.526704 1.1450537 0.14262863
## 3 5.848043 3.578526 0.7595501 1.24007899
## 4 5.852604 3.842490 0.4825257 0.80452049
## 5 5.576879 4.148261 3.4024757 1.12120975
## 6 6.091201 3.417513 0.8275273 0.23711468
## 7 5.344108 1.254631 0.3267140 0.22950380
A programmer friend of mine recommended that I look into Rust, so here we are. Rust is an exciting new low-level programming language that emphasizes performance. Thus, Rust is a nice alternative to think about if there’s something in R that you were going to offload to C++. Here is an article saying that Rust is going to become a thing in data science, so I’m keeping my eyes open.
Let’s how we process our CyTOF data using Rust, in the R Markdown.
We’re going to use this tutorial.
First, we install the package rextendr. Let’s make sure rust actually works. Here’s a simple example from the tutorial. Note that we’re still using R, but we’re using a rust function that is called within R.
library(rextendr)
# create a Rust function
rextendr::rust_function("fn add(a:f64, b:f64) -> f64 { a + b }")
## ℹ build directory: '/private/var/folders/6y/qgvrbn_j2v35vfy946hlhh9c0000gn/T/Rtmp1enwYP/file45e46ab0dc32'
## ✔ Writing '/private/var/folders/6y/qgvrbn_j2v35vfy946hlhh9c0000gn/T/Rtmp1enwYP/file45e46ab0dc32/target/extendr_wrappers.R'
# call it from R
add(2.5, 4.7)
## [1] 7.2
And as another way of doing it, let’s use a direct Rust code block. You can’t see this, but I’m using {extendr} as the name. Again, this block is directly from the tutorial.
rprintln!("Hello from Rust!");
let x = 5;
let y = 7;
let z = x*y;
z
## Hello from Rust!
## [1] 35
Ok, now let’s do something with our CyTOF data. Say I want to do something to each column. Here is a function rust_get_sum from the tutorial that takes the sum of either integers or doubles (essentially numbers with decimal points).
I tried to get the whole CyTOF matrix into Rust, but that turned out to be more trouble than it was worth, so the easier path is to just leverage Rust in the things you otherwise were going to do in R.
Anyway, here’s the function, and this is an R code block.
rust_function(
"fn rust_get_sum(x : Either<Integers, Doubles>) -> Either<Rint, Rfloat> {
match x {
Either::Left(x) => Either::Left(x.iter().sum()),
Either::Right(x) => Either::Right(x.iter().sum()),
}
}",
use_dev_extendr = TRUE, # Use development version of extendr from GitHub
features = "either", # Enable support for Either crate
extendr_fn_options = list(use_try_from = TRUE) # Enable advanced type conversion
)
## ℹ build directory: '/private/var/folders/6y/qgvrbn_j2v35vfy946hlhh9c0000gn/T/Rtmp1enwYP/file45e46ab0dc32'
## ✔ Writing '/private/var/folders/6y/qgvrbn_j2v35vfy946hlhh9c0000gn/T/Rtmp1enwYP/file45e46ab0dc32/target/extendr_wrappers.R'
So here, we get the sums of each column of our dataset.
dat_sums <- purrr::map(dat, rust_get_sum) %>% unlist()
dat_sums
## CD27 CD3 CD123 CD33 CD14 CD61 CD19 CD4
## 14423.692 26452.568 6687.240 7560.231 11253.725 3224.887 5598.934 16612.703
## CD8A CD16 CD235 CD20 CD66 CD45 CD11C CD45RA
## 23923.903 22982.456 18212.965 18450.185 14112.381 48206.362 10295.189 29021.748
## CD38 CD56 HLADR
## 21722.920 10319.074 11859.311
So out of curiosity, given I haven’t worked with Rust, let’s compare how long it took.
start_time <- Sys.time()
dat_sums <- purrr::map(dat, rust_get_sum) %>% unlist()
end_time <- Sys.time()
time_taken <- end_time - start_time
time_taken
## Time difference of 0.01321387 secs
start_time <- Sys.time()
dat_sums <- purrr::map(dat, sum)
end_time <- Sys.time()
time_taken <- end_time - start_time
time_taken
## Time difference of 0.001747847 secs
Ok, it looks like the R implementation is faster. Note that a lot of these functions interface with C++, so I’m guessing this is why.
As I learn Rust, I’ll update the document as I find better use cases for using Rust for CyTOF and related data.
There are a few other things you can do within R Markdown, like run bash scripts. I tried getting R Markdown to run other languages like fortran and Lisp, just for fun. What I found was that you can execute external scripts within R Markdown, but you can’t make code chunks specifically for those languages.
There are other literate programming environments, like Jupyter notebooks and Emacs Org Mode, that allow for a much more robust selection of languages. But that is for another time. In my experience, most of my CyTOF analysis is done in R, with the occasional python, and increasingly Julia to speed up code. C++ has been used only for imaging datasets for now, but it’s nice to know that I have it in the toolkit. I have not used SQL yet with my CyTOF projects, but I have worked with large databases, and it’s nice to know that it’s possible to use those commands as needed. Rust is an interesting case, that could become a thing for data science down the line, so I’m paying attention to it.