The R Rabbit Hole
I almost wish I hadn't gone down that rabbit-hole—and yet—and yet—it's rather curious, you know, this sort of life!
Lewis Carroll, Alice's Adventures in Wonderland
From R to S
What happens under the hood when we do simple arithmetic in R? Let's begin with a simple expression. Adding two numbers together.
5 + 2
[1] 7
Now what we are going to do is ask the simple question: what is actually happening under the hood here? In my Advanced R class that I took around 2016, I learned that R uses an old concept called S-expressions, or symbolic expressions. The way this works is you have a list whereby the first element is a function, followed by the function arguments.
We know of this in the form: function(arg1, arg2, …)
But here with "+" we can actually express this in the same format and we will get the same result. Directly in R. You just have to put back ticks around the operator. Try it yourself.
`+`(5, 2)
[1] 7
It turns out that R will internally parse 5 + 2 into the S-expression above. That is how 5 + 2 is actually represented behind the scenes in R.
S-expressions are a hallmark of an even older programming language called LISP (something Emacs users will still be famililar with today). If we want to see what S-expressions look like in this regard, consider the following LISP code below:
(+ 5 2)
7
You can see that it's an S-expression, like the one we made in R, but using parentheses. S-expressions are a completely different notation than we are used to, so they can be difficult to read at first. Let's say I wanted to do something like this:
2 * (5 + 2)
Which equals 14.
What that looks like as an S-expression is something like this, in LISP:
(* 2 (+ 5 2))
14
And lets say we wanted to divide the resulting answer by 7, which would give us 2. We have, in LISP:
(/ (* 2 (+ 5 2)) 7)
2
So you can imagine if you want to in turn take this product and do something else with it, and so forth, you will end up with parentheses wrapped in parentheses, and so forth. LISP programmers are used to it, and it is a bit of a running joke.
But here's the rub: R operates like this too. Let me show you. Below is R, not LISP. Again, try this yourself:
`/`(`*`(2, `+`(5, 2)), 7)
[1] 2
So now you can see objectively that R's arithmetic is using S-expressions, the way you would otherwise do in LISP.
From S to C
Ok, this is all well and good, but how do I know that R specifically has S-expressions in mind? In order to answer that, we can download the source code of the R programming language and have a look at it. We can see if any S-expressions are explicitly declared here. Below is the source code from R 4.4.0.
The directory looks like this:
drwxr-xr-x 27 tylerburns staff 864 Apr 24 2024 doc drwxr-xr-x 8 tylerburns staff 256 Apr 24 2024 etc drwxr-xr-x 19 tylerburns staff 608 Apr 24 2024 m4 drwxr-xr-x 7 tylerburns staff 224 Mar 27 2024 po drwxr-xr-x 12 tylerburns staff 384 Apr 24 2024 share drwxr-xr-x 13 tylerburns staff 416 Apr 24 2024 src drwxr-xr-x 129 tylerburns staff 4.1K Apr 24 2024 tests drwxr-xr-x 24 tylerburns staff 768 Apr 24 2024 tools -rw-r--r-- 1 tylerburns staff 18K Sep 25 2018 COPYING -rw-r--r-- 1 tylerburns staff 210 Sep 25 2018 ChangeLog -rw-r--r-- 1 tylerburns staff 1.8K Sep 25 2018 INSTALL -rw-r--r-- 1 tylerburns staff 5.4K Mar 24 2023 Makeconf.in -rw-r--r-- 1 tylerburns staff 7.1K Mar 25 2022 Makefile.fw -rw-r--r-- 1 tylerburns staff 11K Mar 27 2024 Makefile.in -rw-r--r-- 1 tylerburns staff 4.1K Sep 25 2018 README -rw-r--r-- 1 tylerburns staff 46 Apr 24 2024 SVN-REVISION -rw-r--r-- 1 tylerburns staff 6 Apr 24 2024 VERSION -rw-r--r-- 1 tylerburns staff 10 Apr 10 2024 VERSION-NICK -rw-r--r-- 1 tylerburns staff 19K Mar 27 2024 config.site -rwxr-xr-x 1 tylerburns staff 1.7M Apr 24 2024 configure -rw-r--r-- 1 tylerburns staff 104K Mar 27 2024 configure.ac -rw-r--r-- 1 tylerburns staff 356 Mar 24 2023 configure.patch
From here, you're going to go into src. And then main. The first observation is that a whole lot of this is written in C. Here is a slice of the "main" directory:
-rw-r--r-- 1 tylerburns staff 8.0K Mar 27 2024 CommandLineArgs.c -rw-r--r-- 1 tylerburns staff 8.4K Mar 27 2024 Makefile.in -rw-r--r-- 1 tylerburns staff 2.8K Mar 27 2024 Makefile.win -rw-r--r-- 1 tylerburns staff 1.3K Sep 25 2018 RBufferUtils.h -rw-r--r-- 1 tylerburns staff 27K Mar 27 2024 RNG.c -rw-r--r-- 1 tylerburns staff 1.9K Mar 29 2019 Rcomplex.h -rw-r--r-- 1 tylerburns staff 51K Mar 27 2024 Rdynload.c -rw-r--r-- 1 tylerburns staff 13K Mar 27 2024 Renviron.c -rw-r--r-- 1 tylerburns staff 1.3K Sep 25 2018 Rmain.c -rw-r--r-- 1 tylerburns staff 34K Mar 24 2023 Rstrptime.h -rw-r--r-- 1 tylerburns staff 25K Mar 27 2024 agrep.c -rw-r--r-- 1 tylerburns staff 14K Sep 25 2018 alloca.c -rw-r--r-- 1 tylerburns staff 59K Mar 27 2024 altclasses.c -rw-r--r-- 1 tylerburns staff 34K Mar 24 2023 altrep.c -rw-r--r-- 1 tylerburns staff 14K Mar 27 2024 apply.c -rw-r--r-- 1 tylerburns staff 65K Mar 27 2024 arithmetic.c -rw-r--r-- 1 tylerburns staff 3.5K Mar 25 2022 arithmetic.h -rw-r--r-- 1 tylerburns staff 64K Mar 27 2024 array.c -rw-r--r-- 1 tylerburns staff 58K Mar 27 2024 attrib.c -rw-r--r-- 1 tylerburns staff 1.1K Sep 25 2018 basedecl.h
Notice the .c and .h files. These are both associated with the C programming language. We note here that C is a lower level language. It is much more verbose than what you see in R (or python for that matter), and you have to worry about a lot of things that get swept under the rug in R (like memory management) but it runs much faster.
If we then go into the file arithmetic.c, we find the place the addition operator is defined. The code below is C:
attribute_hidden SEXP do_arith(SEXP call, SEXP op, SEXP args, SEXP env)
Where this function takes care of many of the arithmetic operations. The SEXP means S-expression. The "op" is the operator of interest, which is +, -, *, or /.
Within this function, we can find a switch statement (think of this as an if statement across multiple cases), where the operators and actions are defined [1]. Again, the code below is C:
switch (PRIMVAL(op)) { case PLUSOP: SET_SCALAR_DVAL(ans, x1 + x2); return ans; case MINUSOP: SET_SCALAR_DVAL(ans, x1 - x2); return ans; case TIMESOP: SET_SCALAR_DVAL(ans, x1 * x2); return ans; case DIVOP: SET_SCALAR_DVAL(ans, x1 / x2); return ans; }
Where you can see that "+" is PLUSOP, where x1 + x2 is defined.
From C to Assembly
So we started with 5 + 2 and now we are looking at an obscure switch statement in C. But if it is written in C, that means it is indirectly written in something deeper: machine code.
C code compiles down to what is known as machine code, which is the zeros and ones that allow the program to execute. There is a human-readable representation of machine code called Assembly that allows one to actually read and understand what is going on, with proper training.
In order for us to understand what is going on there, let's go ahead and return to 5 + 2, but write it in C. I note that below I am using a literate programming environment to run C, and the standard way to do it is to make a c script (eg. addtwonumbers.c) and then compile it using a tool like gcc or clang, depending on what kind of computer you have. But anyway, here is the C code and its execution:
#include <stdio.h> int main() { printf("%d", 5 + 2); return 0; }
7
We did a simple arithmetic operation. Add two numbers together and print it. "int main" is similar to "def main" in python. "printf" is the C way of printing, where you declare what data type you are printing ("%d" is going to be an integer), and then what you're printing. Here, C adds 5 and 2 and then prints it.
Then it returns 0, which is a way of explicitly saying "exit the program." Exiting with 0 indicates that the program ran successfully, by convention. R does this too under the hood, and this becomes important if you are running R from the command line. The function quit(), which you can read about in the documentation here, allows you to explicitly exit in R, also retruning an exit code with the same convention of 0 meaning no errors.
Anyway, we have looked at C, but what is actually happening in Assembly?
To answer that question, we will go ahead and print out the Assembly code associated with the function, and have a look at it. It is going to be really complicated looking but we will just look at a few pieces of it so you can have a feel for what is going on here.
We do this in the literate programming environment we have here by running the following shell script, which takes the code we wrote above and prints out the assembly.
echo '#include <stdio.h> int main() { printf("%d", 5 + 2); return 0; }' > temp.c gcc -S -o temp.s -fno-asynchronous-unwind-tables -fno-verbose-asm -O2 temp.c cat temp.s
Which we then put into another code block to make it a bit less ugly. This is what Assembly looks like. This is the kind of thing being swept under the rug when you do 5 + 2 and print it out. This is going to look messy and you don't have to fully understand it. I just want you to see what it looks like. Note also that the code below is ARM Assembly, because this was made on a MacBook Pro with a M1 architecture. Those who have another architecture, like x86, will have Assembly that looks similar in concept, but has slightly different syntax.
Anyway, here's the Assembly from the C code above:
.section __TEXT,__text,regular,pure_instructions .build_version macos, 14, 0 sdk_version 14, 2 .globl _main .p2align 2 _main: .cfi_startproc sub sp, sp, #32 .cfi_def_cfa_offset 32 stp x29, x30, [sp, #16] add x29, sp, #16 .cfi_def_cfa w29, 16 .cfi_offset w30, -8 .cfi_offset w29, -16 mov w8, #7 str x8, [sp] Lloh0: adrp x0, l_.str@PAGE Lloh1: add x0, x0, l_.str@PAGEOFF bl _printf mov w0, #0 ldp x29, x30, [sp, #16] add sp, sp, #32 ret .loh AdrpAdd Lloh0, Lloh1 .cfi_endproc .section __TEXT,__cstring,cstring_literals l_.str: .asciz "%d" .subsections_via_symbols
Yes, this is kindof a lot. I get it. Again, you don't have to fully understand it. I just want to zoom in on a few things. The first is that it's basically a recipe. There is a sort of step by step execution taking place. A command followed by some symbols. Similar to the format of the S-expression from earlier:
(function arg1 arg2 …)
Now let's find 5 + 2. We note that it is pre-computed by the compiler as an optimization, so the remaining assembly instructions start with the number 7 (I actually didn't expect this…call it learn by doing). Many modern compilers do this, and it's known as constant folding.
There are other instances (beyond the scope of this document) where a function called "add" is called directly within Assembly. Below is what happens with the number 7 in our Assembly code:
mov w8, #7 ; Equivalent to x <- 5 + 2 in R (precomputed by the compiler) str x8, [sp] ; Store value (7) on the stack
As a quick primer, ARM64 (my computer architecture) has 64-bit memory registers x0 to x30. We can refer to them directly, but if we only want to refer to the lower 32 bits of the registers, we can use w0 to w30 to refer to each of the registers, where the numbers denote the same physical register. So x8 and w8 both are referring to the same physical register. At the time of writing
, I have not done enough Assembly to tell you when you would refer to "w" versus "x" for a given physical register.Anyway in this chunk, you have "mov" which is a function that moves the second argument, 7, into a memory register called w8. From there, the value gets stored on the stack, referring to x8, which is valid as we have explained above.
The stack is a region in the computer's memory where local variables can be stored. It is analogous to the "local environment" in R/Python.
From here, the main task is to print it out. Where does that take place? Here:
bl _printf ; Call `printf` (similar to calling `print` in R)
And then there is that "return 0" from C that we had seen before. Again, this is an interesting concept whereby you have to explicitly tell the program to stop running, and indicate whether the program ran successfully. A return of 0 indicates that it ran successfully. This is something we don't worry about in R, but it is something that happens under the hood. Below is what is actually happening in Assembly:
mov w0, #0 ; Set return value to 0 ldp x29, x30, [sp] ; Restore the frame pointer and return address add sp, sp, #32 ; Clean up the stack ret ; Return to the caller (like `return()`
Notice the "mov" function again, where here we are moving the number 0 into another memory register called w0. Again, the exit code of zero is a convention that typically means "no errors."
Ok, wrapping up this section, we can see a few important things. The first is that the lowest level of code, well underneath R, is all a big recipe, where instructions are computed line by line, as explicit as possible, all the way to telling the computer to exit the program and that there was no errors.
Another important thing is that within this recipe, while there are a number of commands, these generally boil down to simple actions, like moving things into registers, running functions like printf, managing memory (the stack in our example), and returning with an exit code.
Conclusion
Ok, let's go back to familiarity now.
5 + 2
[1] 7
And you can now see just how much is swept under the rug. We started with 5 + 2. We then showed that we could make the equivalent statement as an S-expression and have it successfully run in R. This is because R will parse 5 + 7 into an S-expression internally. We then showed that R's built-in arithmetic functions are written in C. We then looked at the C code behind the S-expression that defines arithmetic operations in R. And from there, we looked at the Assembly code that underlies C.
To summarize as a list:
- R code (5 + 2) -> internally becomes an S-expression (`+`(5,2)).
- Arithmetic in R -> implemented in C code.
- C code -> compiled to machine code, which we can view as Assembly.
- Assembly -> direct instructions for the CPU: move data into registers, call functions, manage the stack, exit.
What is the point?
What has helped me solve the truly difficult problems, where no one has written the book on the thing, is understanding first principles. This means understanding the concepts that underlie whatever I'm doing. This is similar to how as a biologist, you have to learn chemistry and physics as part of the curriculum.
Similar to here, if you started in R and Python, you should have familiarity with the languages that are lower level, like C (even if it's literally at the level of very basic operations like printing something or writing a simple function). And you should at least know what Assembly looks like and what it generally does.
Be familiar with the full stack. Whatever it is. Go down the rabbit hole, as far as you can. And you will be a better problem solver, able to see the high-level, and able to reason from first principles.
And of course, we can be grateful for the sheer volume of things we don't have to worry about now because of the coders before us, who innovated so we didn't have to worry about analyzing single-cell sequencing data in Assembly.
We truly stand on the shoulders of giants before us.
[1] We note that R has a switch as well, which you can read about in the documentation here. Note that in R, switch is a function, and not a statement.