“To imagine a language is to imagine a form of life.”

Ludwig Wittgenstein, Philosophical Investigations

The purpose of this markdown is to pass CyTOF data through a number of useful programming languages within a single R Markdown file, leveraging the best practices as we go.

We will start with R, where we will read in the data normally. Then, we’ll take the data that has been processed in R, and place it into other languages directly within the R Markdown. Knowing how to do this will allow CyTOF users to become agnostic to the programming language(s) that will get the job done.

For me, I do most of my CyTOF analysis in R (and especially R Markdown). But it’s nice to now that I can move my data into other languages at any point depending on need.

R

library(flowCore)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks flowCore::filter(), stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

set.seed(1)
setwd(here::here("data"))

ff <- flowCore::read.FCS('Center_1_A_1_MATLAB_LiveIntactSing_Linneg_CD45pos.fcs')

## Warning in readFCSdata(con, offsets, txt, transformation, which.lines, scale, : Some data values of 'Nd142Di' channel exceed its $PnR value 1509 and will be truncated!
##  To avoid truncation, either fix $PnR before generating FCS or set 'truncate_max_range = FALSE'

## Warning in readFCSdata(con, offsets, txt, transformation, which.lines, scale, : Some data values of 'Yb176Di' channel exceed its $PnR value 1349 and will be truncated!
##  To avoid truncation, either fix $PnR before generating FCS or set 'truncate_max_range = FALSE'

ff

## flowFrame object 'Center_1_A_1_MATLAB_LiveIntactSing_Linneg_CD45pos.fcs'
## with 93816 cells and 30 observables:
##         name   desc     range  minRange  maxRange
## $P1  Ce140Di beads1      2911         0         7
## $P2  Ce142Di beads2      1509         0         1
## $P3  Er167Di   CD27       781         0        17
## $P4  Er170Di    CD3      1664         0        27
## $P5  Eu151Di  CD123      2814         0        16
## ...      ...    ...       ...       ...       ...
## $P26 Tm169Di CD45RA     12000         0        23
## $P27 Yb172Di   CD38      5629         0        18
## $P28 Yb174Di  HLADR      6990         0        11
## $P29 Yb176Di   CD56      1349         0        12
## $P30    Time   Time         0         0         1
## 261 keywords are stored in the 'description' slot

Let’s tidy the data.

mass_tags <- ff@parameters@data$name
params <- ff@parameters@data$desc

# All NA values in marker names will revert to mass tags names
params <- ifelse(is.na(params), mass_tags, params)
colnames(ff) <- params

We do our asinh transform.

exp <- flowCore::exprs(ff)
as_tibble(exp)

## # A tibble: 93,816 × 30
##    beads1 beads2    CD27    CD3 CD123 beads3 `Event #` Event_length   CD33
##     <dbl>  <dbl>   <dbl>  <dbl> <dbl>  <dbl>     <dbl>        <dbl>  <dbl>
##  1   0         0   0       1.74  2.35   2.09         0         24.3  3.41 
##  2   0         0   0      56.1   3.39  16.7          0         25.2  0.686
##  3   0         0   0.437   2.93  6.23 228.           0         17.7  2.23 
##  4   0         0 108.    288.    5.61  91.3          0         15.9  2.84 
##  5   2.48      0   9.69    7.18  3.35  22.2          0         20.5  5.29 
##  6   1.49      0  27.6   191.    2.44   2.25         0         24.3  0    
##  7   1.57      0  26.0   160.    2.30  46.6          0         27.1  0.894
##  8   0         0   1.29   42.2   2.56  22.9          0         23.3  0    
##  9   0         0   4.12    9.30  0     14.1          0         19.6  8.39 
## 10   0         0   0.715   0     3.28   1.21         0         19.6 70.8  
## # ℹ 93,806 more rows
## # ℹ 21 more variables: CD14 <dbl>, CD61 <dbl>, cells1 <dbl>, cells2 <dbl>,
## #   beads4 <dbl>, beads5 <dbl>, CD19 <dbl>, CD4 <dbl>, CD8A <dbl>, CD16 <dbl>,
## #   CD235 <dbl>, livedead <dbl>, CD20 <dbl>, CD66 <dbl>, CD45 <dbl>,
## #   CD11C <dbl>, CD45RA <dbl>, CD38 <dbl>, HLADR <dbl>, CD56 <dbl>, Time <dbl>

exp <- asinh(exp/5)
as_tibble(exp)

## # A tibble: 93,816 × 30
##    beads1 beads2   CD27   CD3 CD123 beads3 `Event #` Event_length  CD33   CD14
##     <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>     <dbl>        <dbl> <dbl>  <dbl>
##  1  0          0 0      0.342 0.454  0.407         0         2.28 0.638 1.59  
##  2  0          0 0      3.11  0.635  1.92          0         2.32 0.137 1.35  
##  3  0          0 0.0872 0.557 1.05   4.51          0         1.98 0.432 1.08  
##  4  0          0 3.76   4.75  0.965  3.60          0         1.87 0.542 0.446 
##  5  0.478      0 1.42   1.16  0.627  2.19          0         2.12 0.922 1.88  
##  6  0.294      0 2.41   4.34  0.470  0.435         0         2.28 0     0.0199
##  7  0.308      0 2.35   4.16  0.444  2.93          0         2.39 0.178 0.384 
##  8  0          0 0.255  2.83  0.492  2.22          0         2.24 0     1.60  
##  9  0          0 0.752  1.38  0      1.76          0         2.07 1.29  2.20  
## 10  0          0 0.142  0     0.617  0.240         0         2.07 3.34  2.01  
## # ℹ 93,806 more rows
## # ℹ 20 more variables: CD61 <dbl>, cells1 <dbl>, cells2 <dbl>, beads4 <dbl>,
## #   beads5 <dbl>, CD19 <dbl>, CD4 <dbl>, CD8A <dbl>, CD16 <dbl>, CD235 <dbl>,
## #   livedead <dbl>, CD20 <dbl>, CD66 <dbl>, CD45 <dbl>, CD11C <dbl>,
## #   CD45RA <dbl>, CD38 <dbl>, HLADR <dbl>, CD56 <dbl>, Time <dbl>

flowCore::exprs(ff) <- exp

We get the surface markers of interest.

params <- ff@parameters@data$desc

# Markers to use
cd <- grep("CD", params)
ig <- grep("Ig", params)
hla <- grep("HLA", params)
ccr <- c(grep("CCR", params), grep("CXC", params))

surface <- c(cd, ig, hla, ccr)
surface

##  [1]  3  4  5  9 10 11 16 17 18 19 20 22 23 24 25 26 27 29 28

dat <- flowCore::exprs(ff)[,surface] %>% as_tibble()
dat <- dat[sample(nrow(dat), 10000),]
dat

## # A tibble: 10,000 × 19
##     CD27   CD3  CD123   CD33  CD14  CD61   CD19    CD4  CD8A  CD16 CD235  CD20
##    <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 0.893 1.07  0.995  0.382  1.23  0.104 0.0445 1.33   1.72   4.69 2.42  2.27 
##  2 2.90  4.36  0.436  0.0821 0.200 0.692 0.298  2.31   1.40   1.41 1.16  0.941
##  3 1.45  3.48  0.176  0.490  1.53  0.333 0.302  1.11   3.52   1.69 1.45  1.74 
##  4 0     0     0.504  4.09   2.69  0.685 0.0548 1.90   2.03   2.44 0.687 1.95 
##  5 0.414 0.833 1.05   4.41   3.62  1.42  0.158  0.0475 0.947  1.31 1.79  0.827
##  6 0.479 0     0.113  4.05   3.28  1.26  0      0.990  1.43   1.32 1.16  1.02 
##  7 2.47  3.93  0.227  0.712  1.23  0.622 0      2.83   1.33   1.79 1.90  2.18 
##  8 0.315 3.36  0.0257 0      0.795 0     0.385  0.653  0.766  3.17 1.22  1.68 
##  9 0.838 0.111 1.04   1.85   2.17  0     2.30   1.69   2.53   4.90 3.01  2.68 
## 10 2.86  3.65  0      0.628  0.316 0     0      0.332  1.42   1.24 1.33  0.730
## # ℹ 9,990 more rows
## # ℹ 7 more variables: CD66 <dbl>, CD45 <dbl>, CD11C <dbl>, CD45RA <dbl>,
## #   CD38 <dbl>, CD56 <dbl>, HLADR <dbl>

Python

In python, we want to start by creating a virtual environment. How do you do that directly in a R markdown? With the reculate package, as follows:

library(reticulate)

reticulate::virtualenv_create("myenv", python = "python3", packages = c("pandas"))

## virtualenv: myenv

reticulate::use_virtualenv("myenv", required = TRUE)

Ok, now let’s pull the data frame into Python and visualize it.

import pandas as pd 
dat_python = r.dat
dat_python

##           CD27       CD3     CD123  ...      CD38      CD56     HLADR
## 0     0.893362  1.065713  0.994633  ...  4.482064  4.346219  1.432408
## 1     2.896564  4.362652  0.436426  ...  3.510879  0.222827  0.250180
## 2     1.454037  3.482632  0.176360  ...  0.937635  0.449524  0.920010
## 3     0.000000  0.000000  0.503708  ...  3.116092  0.211046  2.345862
## 4     0.414235  0.832575  1.050114  ...  3.470916  0.083655  2.277326
## ...        ...       ...       ...  ...       ...       ...       ...
## 9995  3.292594  4.659797  0.084277  ...  2.030386  0.647847  0.240177
## 9996  0.961703  3.510068  0.000000  ...  0.351932  0.240576  0.000000
## 9997  0.983802  1.316247  1.352664  ...  2.339488  3.500288  0.467577
## 9998  0.324605  0.577332  0.000000  ...  3.207215  3.685156  1.372337
## 9999  0.449814  0.661805  0.710178  ...  4.701716  3.662713  1.170080
## 
## [10000 rows x 19 columns]

And now let’s put it back into R.

dat_r <- py$dat_python %>% as_tibble()
dat_r

## # A tibble: 10,000 × 19
##     CD27   CD3  CD123   CD33  CD14  CD61   CD19    CD4  CD8A  CD16 CD235  CD20
##    <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 0.893 1.07  0.995  0.382  1.23  0.104 0.0445 1.33   1.72   4.69 2.42  2.27 
##  2 2.90  4.36  0.436  0.0821 0.200 0.692 0.298  2.31   1.40   1.41 1.16  0.941
##  3 1.45  3.48  0.176  0.490  1.53  0.333 0.302  1.11   3.52   1.69 1.45  1.74 
##  4 0     0     0.504  4.09   2.69  0.685 0.0548 1.90   2.03   2.44 0.687 1.95 
##  5 0.414 0.833 1.05   4.41   3.62  1.42  0.158  0.0475 0.947  1.31 1.79  0.827
##  6 0.479 0     0.113  4.05   3.28  1.26  0      0.990  1.43   1.32 1.16  1.02 
##  7 2.47  3.93  0.227  0.712  1.23  0.622 0      2.83   1.33   1.79 1.90  2.18 
##  8 0.315 3.36  0.0257 0      0.795 0     0.385  0.653  0.766  3.17 1.22  1.68 
##  9 0.838 0.111 1.04   1.85   2.17  0     2.30   1.69   2.53   4.90 3.01  2.68 
## 10 2.86  3.65  0      0.628  0.316 0     0      0.332  1.42   1.24 1.33  0.730
## # ℹ 9,990 more rows
## # ℹ 7 more variables: CD66 <dbl>, CD45 <dbl>, CD11C <dbl>, CD45RA <dbl>,
## #   CD38 <dbl>, CD56 <dbl>, HLADR <dbl>

In later markdowns, we will explore what can be done with CyTOF data in python, which is more of a general purpose language than R.

Julia

Julia is a fast and efficient language that is a good alternative to R and Python if you need to run something faster, but with the syntax of a higher level language. I’ve explored the Julia language before with respect to it’s implementation of UMAP, which I showed was faster than that of R’s implementation. For more on that, go here

Let’s pull the CyTOF data into Julia so you can see what that looks like. We’re going to use the JuliaConnectoR and JuliaCall package. The latter will import the RCall package into Julia, which allows us to move data around between R and Julia.

library(JuliaConnectoR)
library(JuliaCall)

JuliaCall::julia_setup(JULIA_HOME = "/Applications/Julia-1.8.app/Contents/Resources/julia/bin/")

## Julia version 1.8.5 at location /Applications/Julia-1.8.app/Contents/Resources/julia/bin will be used.

## Loading setup script for JuliaCall...

## Finish loading setup script for JuliaCall.

JuliaCall::julia_command("using Pkg; Pkg.add(\"RCall\")")

One weird thing you need to know about operating in Julia, at least if you’re going to feed it into the UMAP implementation, is that the matrices you import must be transposed. I do that below.

dat_julia <- t(dat)
dat_julia[1:5, 1:5]

##            [,1]       [,2]      [,3]      [,4]      [,5]
## CD27  0.8933623 2.89656390 1.4540374 0.0000000 0.4142347
## CD3   1.0657127 4.36265231 3.4826318 0.0000000 0.8325753
## CD123 0.9946332 0.43642573 0.1763605 0.5037082 1.0501137
## CD33  0.3823967 0.08214474 0.4902337 4.0885739 4.4137900
## CD14  1.2311487 0.20033205 1.5299325 2.6906949 3.6192467

Note that when we pull the data directly into Julia it automatically prints what you have.

@rget dat_julia

## 19×10000 Matrix{Float64}:
##  0.893362   2.89656    1.45404   0.0        …  0.983802  0.324605   0.449814
##  1.06571    4.36265    3.48263   0.0           1.31625   0.577332   0.661805
##  0.994633   0.436426   0.17636   0.503708      1.35266   0.0        0.710178
##  0.382397   0.0821447  0.490234  4.08857       0.276608  0.10213    0.99656
##  1.23115    0.200332   1.52993   2.69069       0.849813  0.940879   0.947871
##  0.103708   0.69206    0.33269   0.685085   …  0.474183  0.0827106  0.486418
##  0.0444569  0.297682   0.302101  0.0548172     0.0       0.624719   0.953517
##  1.33058    2.31387    1.10763   1.89811       0.96037   1.25523    1.38368
##  1.72375    1.40033    3.51539   2.02925       2.10112   1.75524    2.30779
##  4.69467    1.40753    1.69445   2.44133       4.19829   4.25289    4.96401
##  2.41594    1.16012    1.45371   0.686639   …  2.18343   1.7767     2.50077
##  2.27385    0.941243   1.74271   1.95093       1.53315   2.33247    2.75057
##  1.48265    0.901397   0.910939  1.49271       1.56159   0.753175   1.71092
##  3.79011    4.56578    4.81816   5.09114       4.0469    4.31683    4.83699
##  2.70277    0.0        1.2289    3.71234       1.42651   0.879218   2.17822
##  2.98534    4.00644    1.18134   0.067049   …  3.19505   3.12861    4.59856
##  4.48206    3.51088    0.937635  3.11609       2.33949   3.20722    4.70172
##  4.34622    0.222827   0.449524  0.211046      3.50029   3.68516    3.66271
##  1.43241    0.25018    0.92001   2.34586       0.467577  1.37234    1.17008

Now dat_julia can be manipulated as you wish within Julia code blocks. And now we get it back into R. Let’s add 1 to the whole matrix so you’re convinced that it was indeed in Julia. Note that it’s .+ to add 1 to every element of the matrix.

dat_1 = dat_julia .+ 1

## 19×10000 Matrix{Float64}:
##  1.89336  3.89656  2.45404  1.0      …  1.9617   1.9838   1.32461  1.44981
##  2.06571  5.36265  4.48263  1.0         4.51007  2.31625  1.57733  1.66181
##  1.99463  1.43643  1.17636  1.50371     1.0      2.35266  1.0      1.71018
##  1.3824   1.08214  1.49023  5.08857     1.49436  1.27661  1.10213  1.99656
##  2.23115  1.20033  2.52993  3.69069     1.70067  1.84981  1.94088  1.94787
##  1.10371  1.69206  1.33269  1.68508  …  1.0      1.47418  1.08271  1.48642
##  1.04446  1.29768  1.3021   1.05482     1.14217  1.0      1.62472  1.95352
##  2.33058  3.31387  2.10763  2.89811     3.54422  1.96037  2.25523  2.38368
##  2.72375  2.40033  4.51539  3.02925     2.11413  3.10112  2.75524  3.30779
##  5.69467  2.40753  2.69445  3.44133     2.88941  5.19829  5.25289  5.96401
##  3.41594  2.16012  2.45371  1.68664  …  2.33553  3.18343  2.7767   3.50077
##  3.27385  1.94124  2.74271  2.95093     2.22709  2.53315  3.33247  3.75057
##  2.48265  1.9014   1.91094  2.49271     2.52791  2.56159  1.75317  2.71092
##  4.79011  5.56578  5.81816  6.09114     5.63253  5.0469   5.31683  5.83699
##  3.70277  1.0      2.2289   4.71234     1.18129  2.42651  1.87922  3.17822
##  3.98534  5.00644  2.18134  1.06705  …  3.93722  4.19505  4.12861  5.59856
##  5.48206  4.51088  1.93763  4.11609     1.35193  3.33949  4.20722  5.70172
##  5.34622  1.22283  1.44952  1.21105     1.24058  4.50029  4.68516  4.66271
##  2.43241  1.25018  1.92001  3.34586     1.0      1.46758  2.37234  2.17008

Now we will place this new matrix back into R.

@rput dat_1

## 19×10000 Matrix{Float64}:
##  1.89336  3.89656  2.45404  1.0      …  1.9617   1.9838   1.32461  1.44981
##  2.06571  5.36265  4.48263  1.0         4.51007  2.31625  1.57733  1.66181
##  1.99463  1.43643  1.17636  1.50371     1.0      2.35266  1.0      1.71018
##  1.3824   1.08214  1.49023  5.08857     1.49436  1.27661  1.10213  1.99656
##  2.23115  1.20033  2.52993  3.69069     1.70067  1.84981  1.94088  1.94787
##  1.10371  1.69206  1.33269  1.68508  …  1.0      1.47418  1.08271  1.48642
##  1.04446  1.29768  1.3021   1.05482     1.14217  1.0      1.62472  1.95352
##  2.33058  3.31387  2.10763  2.89811     3.54422  1.96037  2.25523  2.38368
##  2.72375  2.40033  4.51539  3.02925     2.11413  3.10112  2.75524  3.30779
##  5.69467  2.40753  2.69445  3.44133     2.88941  5.19829  5.25289  5.96401
##  3.41594  2.16012  2.45371  1.68664  …  2.33553  3.18343  2.7767   3.50077
##  3.27385  1.94124  2.74271  2.95093     2.22709  2.53315  3.33247  3.75057
##  2.48265  1.9014   1.91094  2.49271     2.52791  2.56159  1.75317  2.71092
##  4.79011  5.56578  5.81816  6.09114     5.63253  5.0469   5.31683  5.83699
##  3.70277  1.0      2.2289   4.71234     1.18129  2.42651  1.87922  3.17822
##  3.98534  5.00644  2.18134  1.06705  …  3.93722  4.19505  4.12861  5.59856
##  5.48206  4.51088  1.93763  4.11609     1.35193  3.33949  4.20722  5.70172
##  5.34622  1.22283  1.44952  1.21105     1.24058  4.50029  4.68516  4.66271
##  2.43241  1.25018  1.92001  3.34586     1.0      1.46758  2.37234  2.17008

Now back in R. Note again that we have to transpose the matrix. We lost the column names when we imported it into Julia, so we’re putting them back here. There are ways to keep the column names, but we can talk about that some other time. Right now, I’m just operating in base Julia so you can see how to move data around in the two environments.

dat_r1 <- dat_1 %>% t() %>% as_tibble()

## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

names(dat_r1) <- names(dat)
dat_r1

## # A tibble: 10,000 × 19
##     CD27   CD3 CD123  CD33  CD14  CD61  CD19   CD4  CD8A  CD16 CD235  CD20  CD66
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1.89  2.07  1.99  1.38  2.23  1.10  1.04  2.33  2.72  5.69  3.42  3.27  2.48
##  2  3.90  5.36  1.44  1.08  1.20  1.69  1.30  3.31  2.40  2.41  2.16  1.94  1.90
##  3  2.45  4.48  1.18  1.49  2.53  1.33  1.30  2.11  4.52  2.69  2.45  2.74  1.91
##  4  1     1     1.50  5.09  3.69  1.69  1.05  2.90  3.03  3.44  1.69  2.95  2.49
##  5  1.41  1.83  2.05  5.41  4.62  2.42  1.16  1.05  1.95  2.31  2.79  1.83  2.47
##  6  1.48  1     1.11  5.05  4.28  2.26  1     1.99  2.43  2.32  2.16  2.02  2.34
##  7  3.47  4.93  1.23  1.71  2.23  1.62  1     3.83  2.33  2.79  2.90  3.18  2.53
##  8  1.32  4.36  1.03  1     1.79  1     1.38  1.65  1.77  4.17  2.22  2.68  2.22
##  9  1.84  1.11  2.04  2.85  3.17  1     3.30  2.69  3.53  5.90  4.01  3.68  2.72
## 10  3.86  4.65  1     1.63  1.32  1     1     1.33  2.42  2.24  2.33  1.73  1.98
## # ℹ 9,990 more rows
## # ℹ 6 more variables: CD45 <dbl>, CD11C <dbl>, CD45RA <dbl>, CD38 <dbl>,
## #   CD56 <dbl>, HLADR <dbl>

C++

C++ is a low level language that is very fast. It is much older, but still very much in use. It is one of the first programming languages I learned, and I come back to it every once in a while, especially when I’m doing imaging work. R has an interesting interface to C++ called rcpp. We’re going to use that here.

What we’re going to do is make a function in C++ that we then use in R. We’re going to print the first 5 rows and columns of our CyTOF data. But the point is, if you want to do something in C++, you make the function directly in C++, and then call it in R.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
void printDataFrame(DataFrame df) {
  int numCols = df.size();
  int numRows = df.nrows();

  // Limit to first 5 columns
  numCols = numCols < 5 ? numCols : 5;

  for (int i = 0; i < numCols; i++) {
    Rcpp::Rcout << "Column " << i+1 << ": ";
    NumericVector col = df[i];
    // Print the first 5 rows of the column
    for (int j = 0; j < numRows && j < 5; j++) {
      Rcpp::Rcout << col[j] << " ";
    }
    Rcpp::Rcout << "\n";
  }
}

# This is R code. We're back in R.
printDataFrame(dat)

## Column 1: 0.893362 2.89656 1.45404 0 0.414235 
## Column 2: 1.06571 4.36265 3.48263 0 0.832575 
## Column 3: 0.994633 0.436426 0.17636 0.503708 1.05011 
## Column 4: 0.382397 0.0821447 0.490234 4.08857 4.41379 
## Column 5: 1.23115 0.200332 1.52993 2.69069 3.61925

It sure looks like a lot just to print a data frame. What I can tell you is that my primary use case for C++ (via Rcpp) is when I’m dealing with imaging data. But the big point is you can integrate the C++ code into the R markdowns rather than making separate scripts.

SQL

SQL is a language you use to query databases. You can also use it to query CyTOF data. Now, we can use sql durectly within R Markdown, but that would be more for if we had CyTOF data stored externally in a relational database. That’s for another time. But to at least show you the syntax, we can use the sqldf package.

So, let’s find some T cells. Note that this is similar to dplyr::filter(), but SQL can be used to make very complicated queries, so it’s good to know how to properly call it.

Below, the translation of the SQL is “select all the rows from the dat object where the column called CD3 has values greater than some number, which we’ll call x.” In dplyr terms, it’s the equivalent to dplyr::filter(dat, CD3 > x). Pick your poison.

library(sqldf)

## Loading required package: gsubfn

## Loading required package: proto

## Loading required package: RSQLite

result <- sqldf("SELECT * FROM dat WHERE CD3 > 5.3")
print(result)

##        CD27      CD3     CD123      CD33      CD14       CD61       CD19
## 1 3.1264681 5.370645 1.0021870 0.0000000 1.5759916 0.11341906 0.00000000
## 2 0.7103834 5.907235 1.2547949 0.3177335 1.8698172 0.04869975 0.00000000
## 3 2.9358062 5.308189 1.4780226 0.6108917 0.7769289 0.00000000 0.05806983
## 4 3.0865904 5.351840 0.8511912 0.1929881 2.2991696 0.00000000 0.00000000
## 5 3.4145370 5.346575 1.7870004 1.6294845 1.9240186 0.76869842 0.87898998
## 6 4.0330824 5.302419 1.2703423 0.0000000 1.6397655 0.51174505 0.00000000
## 7 3.3052528 5.359280 0.6599960 0.3097051 1.3083458 0.14727819 0.00000000
##        CD4     CD8A     CD16    CD235     CD20     CD66     CD45     CD11C
## 1 3.120505 1.974160 2.565491 2.008806 1.956707 1.245664 5.519997 1.0732701
## 2 1.775879 2.758985 3.391104 2.197680 2.198716 2.282003 6.094073 0.6973995
## 3 3.295261 4.242156 2.819916 2.289628 2.330265 1.858125 6.125014 2.8182823
## 4 2.869384 2.195726 2.665933 1.965704 2.264631 2.394309 5.685564 1.6070940
## 5 2.871258 4.278365 3.158603 2.988147 2.130051 1.747186 5.737861 1.6370463
## 6 3.395235 2.172157 2.884782 1.880756 2.533725 1.932959 5.753517 0.0000000
## 7 2.866402 1.797314 2.241456 1.695375 2.085744 1.241980 5.382069 0.0000000
##     CD45RA     CD38      CD56      HLADR
## 1 5.293691 2.857672 0.4629084 0.06160644
## 2 5.871059 1.526704 1.1450537 0.14262863
## 3 5.848043 3.578526 0.7595501 1.24007899
## 4 5.852604 3.842490 0.4825257 0.80452049
## 5 5.576879 4.148261 3.4024757 1.12120975
## 6 6.091201 3.417513 0.8275273 0.23711468
## 7 5.344108 1.254631 0.3267140 0.22950380

Rust

A programmer friend of mine recommended that I look into Rust, so here we are. Rust is an exciting new low-level programming language that emphasizes performance. Thus, Rust is a nice alternative to think about if there’s something in R that you were going to offload to C++. Here is an article saying that Rust is going to become a thing in data science, so I’m keeping my eyes open.

Let’s how we process our CyTOF data using Rust, in the R Markdown.

We’re going to use this tutorial.

First, we install the package rextendr. Let’s make sure rust actually works. Here’s a simple example from the tutorial. Note that we’re still using R, but we’re using a rust function that is called within R.

library(rextendr)

# create a Rust function
rextendr::rust_function("fn add(a:f64, b:f64) -> f64 { a + b }")

## ℹ build directory: '/private/var/folders/6y/qgvrbn_j2v35vfy946hlhh9c0000gn/T/Rtmp1enwYP/file45e46ab0dc32'

## ✔ Writing '/private/var/folders/6y/qgvrbn_j2v35vfy946hlhh9c0000gn/T/Rtmp1enwYP/file45e46ab0dc32/target/extendr_wrappers.R'

# call it from R
add(2.5, 4.7)

## [1] 7.2

And as another way of doing it, let’s use a direct Rust code block. You can’t see this, but I’m using {extendr} as the name. Again, this block is directly from the tutorial.

rprintln!("Hello from Rust!");

let x = 5;
let y = 7;
let z = x*y;

z

## Hello from Rust!
## [1] 35

Ok, now let’s do something with our CyTOF data. Say I want to do something to each column. Here is a function rust_get_sum from the tutorial that takes the sum of either integers or doubles (essentially numbers with decimal points).

I tried to get the whole CyTOF matrix into Rust, but that turned out to be more trouble than it was worth, so the easier path is to just leverage Rust in the things you otherwise were going to do in R.

Anyway, here’s the function, and this is an R code block.

rust_function(
  "fn rust_get_sum(x : Either<Integers, Doubles>) -> Either<Rint, Rfloat> {
      match x {
          Either::Left(x) => Either::Left(x.iter().sum()),
          Either::Right(x) => Either::Right(x.iter().sum()),
      }
  }",
  use_dev_extendr = TRUE,                        # Use development version of extendr from GitHub
  features = "either",                           # Enable support for Either crate
  extendr_fn_options = list(use_try_from = TRUE) # Enable advanced type conversion
)

## ℹ build directory: '/private/var/folders/6y/qgvrbn_j2v35vfy946hlhh9c0000gn/T/Rtmp1enwYP/file45e46ab0dc32'

## ✔ Writing '/private/var/folders/6y/qgvrbn_j2v35vfy946hlhh9c0000gn/T/Rtmp1enwYP/file45e46ab0dc32/target/extendr_wrappers.R'

So here, we get the sums of each column of our dataset.

dat_sums <- purrr::map(dat, rust_get_sum) %>% unlist()
dat_sums

##      CD27       CD3     CD123      CD33      CD14      CD61      CD19       CD4 
## 14423.692 26452.568  6687.240  7560.231 11253.725  3224.887  5598.934 16612.703 
##      CD8A      CD16     CD235      CD20      CD66      CD45     CD11C    CD45RA 
## 23923.903 22982.456 18212.965 18450.185 14112.381 48206.362 10295.189 29021.748 
##      CD38      CD56     HLADR 
## 21722.920 10319.074 11859.311

So out of curiosity, given I haven’t worked with Rust, let’s compare how long it took.

start_time <- Sys.time()
dat_sums <- purrr::map(dat, rust_get_sum) %>% unlist()
end_time <- Sys.time()
time_taken <- end_time - start_time
time_taken

## Time difference of 0.01321387 secs

start_time <- Sys.time()
dat_sums <- purrr::map(dat, sum)
end_time <- Sys.time()
time_taken <- end_time - start_time
time_taken

## Time difference of 0.001747847 secs

Ok, it looks like the R implementation is faster. Note that a lot of these functions interface with C++, so I’m guessing this is why.

As I learn Rust, I’ll update the document as I find better use cases for using Rust for CyTOF and related data.

Conclusing remarks

There are a few other things you can do within R Markdown, like run bash scripts. I tried getting R Markdown to run other languages like fortran and Lisp, just for fun. What I found was that you can execute external scripts within R Markdown, but you can’t make code chunks specifically for those languages.

There are other literate programming environments, like Jupyter notebooks and Emacs Org Mode, that allow for a much more robust selection of languages. But that is for another time. In my experience, most of my CyTOF analysis is done in R, with the occasional python, and increasingly Julia to speed up code. C++ has been used only for imaging datasets for now, but it’s nice to know that I have it in the toolkit. I have not used SQL yet with my CyTOF projects, but I have worked with large databases, and it’s nice to know that it’s possible to use those commands as needed. Rust is an interesting case, that could become a thing for data science down the line, so I’m paying attention to it.

CyTOF analysis language tour

2023-05-10

R

Python

Julia

C++

SQL

Rust

Conclusing remarks