The R Rabbit Hole

Home


I almost wish I hadn't gone down that rabbit-hole—and yet—and yet—it's rather curious, you know, this sort of life!

Lewis Carroll, Alice's Adventures in Wonderland


From R to S

What happens under the hood when we do simple arithmetic in R? Let's begin with a simple expression. Adding two numbers together.

5 + 2
[1] 7

Now what we are going to do is ask the simple question: what is actually happening under the hood here? In my Advanced R class that I took around 2016, I learned that R uses an old concept called S-expressions, or symbolic expressions. The way this works is you have a list whereby the first element is a function, followed by the function arguments.

We know of this in the form: function(arg1, arg2, …)

But here with "+" we can actually express this in the same format and we will get the same result. Directly in R. You just have to put back ticks around the operator. Try it yourself.

`+`(5, 2)
[1] 7

It turns out that R will internally parse 5 + 2 into the S-expression above. That is how 5 + 2 is actually represented behind the scenes in R.

S-expressions are a hallmark of an even older programming language called LISP (something Emacs users will still be famililar with today). If we want to see what S-expressions look like in this regard, consider the following LISP code below:

(+ 5 2)
7

You can see that it's an S-expression, like the one we made in R, but using parentheses. S-expressions are a completely different notation than we are used to, so they can be difficult to read at first. Let's say I wanted to do something like this:

2 * (5 + 2)

Which equals 14.

What that looks like as an S-expression is something like this, in LISP:

(* 2 (+ 5 2))
14

And lets say we wanted to divide the resulting answer by 7, which would give us 2. We have, in LISP:

(/ (* 2 (+ 5 2)) 7)
2

So you can imagine if you want to in turn take this product and do something else with it, and so forth, you will end up with parentheses wrapped in parentheses, and so forth. LISP programmers are used to it, and it is a bit of a running joke.

But here's the rub: R operates like this too. Let me show you. Below is R, not LISP. Again, try this yourself:

`/`(`*`(2, `+`(5, 2)), 7)
[1] 2

So now you can see objectively that R's arithmetic is using S-expressions, the way you would otherwise do in LISP.

From S to C

Ok, this is all well and good, but how do I know that R specifically has S-expressions in mind? In order to answer that, we can download the source code of the R programming language and have a look at it. We can see if any S-expressions are explicitly declared here. Below is the source code from R 4.4.0.

The directory looks like this:

drwxr-xr-x  27 tylerburns staff  864 Apr 24  2024 doc
drwxr-xr-x   8 tylerburns staff  256 Apr 24  2024 etc
drwxr-xr-x  19 tylerburns staff  608 Apr 24  2024 m4
drwxr-xr-x   7 tylerburns staff  224 Mar 27  2024 po
drwxr-xr-x  12 tylerburns staff  384 Apr 24  2024 share
drwxr-xr-x  13 tylerburns staff  416 Apr 24  2024 src
drwxr-xr-x 129 tylerburns staff 4.1K Apr 24  2024 tests
drwxr-xr-x  24 tylerburns staff  768 Apr 24  2024 tools
-rw-r--r--   1 tylerburns staff  18K Sep 25  2018 COPYING
-rw-r--r--   1 tylerburns staff  210 Sep 25  2018 ChangeLog
-rw-r--r--   1 tylerburns staff 1.8K Sep 25  2018 INSTALL
-rw-r--r--   1 tylerburns staff 5.4K Mar 24  2023 Makeconf.in
-rw-r--r--   1 tylerburns staff 7.1K Mar 25  2022 Makefile.fw
-rw-r--r--   1 tylerburns staff  11K Mar 27  2024 Makefile.in
-rw-r--r--   1 tylerburns staff 4.1K Sep 25  2018 README
-rw-r--r--   1 tylerburns staff   46 Apr 24  2024 SVN-REVISION
-rw-r--r--   1 tylerburns staff    6 Apr 24  2024 VERSION
-rw-r--r--   1 tylerburns staff   10 Apr 10  2024 VERSION-NICK
-rw-r--r--   1 tylerburns staff  19K Mar 27  2024 config.site
-rwxr-xr-x   1 tylerburns staff 1.7M Apr 24  2024 configure
-rw-r--r--   1 tylerburns staff 104K Mar 27  2024 configure.ac
-rw-r--r--   1 tylerburns staff  356 Mar 24  2023 configure.patch

From here, you're going to go into src. And then main. The first observation is that a whole lot of this is written in C. Here is a slice of the "main" directory:

-rw-r--r--   1 tylerburns staff 8.0K Mar 27  2024 CommandLineArgs.c
  -rw-r--r--   1 tylerburns staff 8.4K Mar 27  2024 Makefile.in
  -rw-r--r--   1 tylerburns staff 2.8K Mar 27  2024 Makefile.win
  -rw-r--r--   1 tylerburns staff 1.3K Sep 25  2018 RBufferUtils.h
  -rw-r--r--   1 tylerburns staff  27K Mar 27  2024 RNG.c
  -rw-r--r--   1 tylerburns staff 1.9K Mar 29  2019 Rcomplex.h
  -rw-r--r--   1 tylerburns staff  51K Mar 27  2024 Rdynload.c
  -rw-r--r--   1 tylerburns staff  13K Mar 27  2024 Renviron.c
  -rw-r--r--   1 tylerburns staff 1.3K Sep 25  2018 Rmain.c
  -rw-r--r--   1 tylerburns staff  34K Mar 24  2023 Rstrptime.h
  -rw-r--r--   1 tylerburns staff  25K Mar 27  2024 agrep.c
  -rw-r--r--   1 tylerburns staff  14K Sep 25  2018 alloca.c
  -rw-r--r--   1 tylerburns staff  59K Mar 27  2024 altclasses.c
  -rw-r--r--   1 tylerburns staff  34K Mar 24  2023 altrep.c
  -rw-r--r--   1 tylerburns staff  14K Mar 27  2024 apply.c
  -rw-r--r--   1 tylerburns staff  65K Mar 27  2024 arithmetic.c
  -rw-r--r--   1 tylerburns staff 3.5K Mar 25  2022 arithmetic.h
  -rw-r--r--   1 tylerburns staff  64K Mar 27  2024 array.c
  -rw-r--r--   1 tylerburns staff  58K Mar 27  2024 attrib.c
  -rw-r--r--   1 tylerburns staff 1.1K Sep 25  2018 basedecl.h

Notice the .c and .h files. These are both associated with the C programming language. We note here that C is a lower level language. It is much more verbose than what you see in R (or python for that matter), and you have to worry about a lot of things that get swept under the rug in R (like memory management) but it runs much faster.

If we then go into the file arithmetic.c, we find the place the addition operator is defined. The code below is C:

attribute_hidden SEXP do_arith(SEXP call, SEXP op, SEXP args, SEXP env)

Where this function takes care of many of the arithmetic operations. The SEXP means S-expression. The "op" is the operator of interest, which is +, -, *, or /.

Within this function, we can find a switch statement (think of this as an if statement across multiple cases), where the operators and actions are defined [1]. Again, the code below is C:

switch (PRIMVAL(op)) {
        case PLUSOP: SET_SCALAR_DVAL(ans, x1 + x2); return ans;
        case MINUSOP: SET_SCALAR_DVAL(ans, x1 - x2); return ans;
        case TIMESOP: SET_SCALAR_DVAL(ans, x1 * x2); return ans;
        case DIVOP: SET_SCALAR_DVAL(ans, x1 / x2); return ans;
}

Where you can see that "+" is PLUSOP, where x1 + x2 is defined.

From C to Assembly

So we started with 5 + 2 and now we are looking at an obscure switch statement in C. But if it is written in C, that means it is indirectly written in something deeper: machine code.

C code compiles down to what is known as machine code, which is the zeros and ones that allow the program to execute. There is a human-readable representation of machine code called Assembly that allows one to actually read and understand what is going on, with proper training.

In order for us to understand what is going on there, let's go ahead and return to 5 + 2, but write it in C. I note that below I am using a literate programming environment to run C, and the standard way to do it is to make a c script (eg. addtwonumbers.c) and then compile it using a tool like gcc or clang, depending on what kind of computer you have. But anyway, here is the C code and its execution:

#include <stdio.h>

int main() {
    printf("%d", 5 + 2);
    return 0;
}
7

We did a simple arithmetic operation. Add two numbers together and print it. "int main" is similar to "def main" in python. "printf" is the C way of printing, where you declare what data type you are printing ("%d" is going to be an integer), and then what you're printing. Here, C adds 5 and 2 and then prints it.

Then it returns 0, which is a way of explicitly saying "exit the program." Exiting with 0 indicates that the program ran successfully, by convention. R does this too under the hood, and this becomes important if you are running R from the command line. The function quit(), which you can read about in the documentation here, allows you to explicitly exit in R, also retruning an exit code with the same convention of 0 meaning no errors.

Anyway, we have looked at C, but what is actually happening in Assembly?

To answer that question, we will go ahead and print out the Assembly code associated with the function, and have a look at it. It is going to be really complicated looking but we will just look at a few pieces of it so you can have a feel for what is going on here.

We do this in the literate programming environment we have here by running the following shell script, which takes the code we wrote above and prints out the assembly.

echo '#include <stdio.h>
int main() {
    printf("%d", 5 + 2);
    return 0;
}' > temp.c
gcc -S -o temp.s -fno-asynchronous-unwind-tables -fno-verbose-asm -O2 temp.c
cat temp.s

Which we then put into another code block to make it a bit less ugly. This is what Assembly looks like. This is the kind of thing being swept under the rug when you do 5 + 2 and print it out. This is going to look messy and you don't have to fully understand it. I just want you to see what it looks like. Note also that the code below is ARM Assembly, because this was made on a MacBook Pro with a M1 architecture. Those who have another architecture, like x86, will have Assembly that looks similar in concept, but has slightly different syntax.

Anyway, here's the Assembly from the C code above:

        .section        __TEXT,__text,regular,pure_instructions
        .build_version macos, 14, 0     sdk_version 14, 2
        .globl  _main
        .p2align        2
_main:
        .cfi_startproc
        sub     sp, sp, #32
        .cfi_def_cfa_offset 32
        stp     x29, x30, [sp, #16]
        add     x29, sp, #16
        .cfi_def_cfa w29, 16
        .cfi_offset w30, -8
        .cfi_offset w29, -16
        mov     w8, #7
        str     x8, [sp]
Lloh0:
        adrp    x0, l_.str@PAGE
Lloh1:
        add     x0, x0, l_.str@PAGEOFF
        bl      _printf
        mov     w0, #0
        ldp     x29, x30, [sp, #16]
        add     sp, sp, #32
        ret
        .loh AdrpAdd    Lloh0, Lloh1
        .cfi_endproc

        .section        __TEXT,__cstring,cstring_literals
l_.str:
        .asciz  "%d"

.subsections_via_symbols

Yes, this is kindof a lot. I get it. Again, you don't have to fully understand it. I just want to zoom in on a few things. The first is that it's basically a recipe. There is a sort of step by step execution taking place. A command followed by some symbols. Similar to the format of the S-expression from earlier:

(function arg1 arg2 …)

Now let's find 5 + 2. We note that it is pre-computed by the compiler as an optimization, so the remaining assembly instructions start with the number 7 (I actually didn't expect this…call it learn by doing). Many modern compilers do this, and it's known as constant folding.

There are other instances (beyond the scope of this document) where a function called "add" is called directly within Assembly. Below is what happens with the number 7 in our Assembly code:

mov w8, #7             ; Equivalent to x <- 5 + 2 in R (precomputed by the compiler)
str x8, [sp]           ; Store value (7) on the stack

As a quick primer, ARM64 (my computer architecture) has 64-bit memory registers x0 to x30. We can refer to them directly, but if we only want to refer to the lower 32 bits of the registers, we can use w0 to w30 to refer to each of the registers, where the numbers denote the same physical register. So x8 and w8 both are referring to the same physical register. At the time of writing [2025-01-19 Sun], I have not done enough Assembly to tell you when you would refer to "w" versus "x" for a given physical register.

Anyway in this chunk, you have "mov" which is a function that moves the second argument, 7, into a memory register called w8. From there, the value gets stored on the stack, referring to x8, which is valid as we have explained above.

The stack is a region in the computer's memory where local variables can be stored. It is analogous to the "local environment" in R/Python.

From here, the main task is to print it out. Where does that take place? Here:

bl _printf ; Call `printf` (similar to calling `print` in R)

And then there is that "return 0" from C that we had seen before. Again, this is an interesting concept whereby you have to explicitly tell the program to stop running, and indicate whether the program ran successfully. A return of 0 indicates that it ran successfully. This is something we don't worry about in R, but it is something that happens under the hood. Below is what is actually happening in Assembly:

mov w0, #0             ; Set return value to 0
ldp x29, x30, [sp]     ; Restore the frame pointer and return address
add sp, sp, #32        ; Clean up the stack
ret                    ; Return to the caller (like `return()`

Notice the "mov" function again, where here we are moving the number 0 into another memory register called w0. Again, the exit code of zero is a convention that typically means "no errors."

Ok, wrapping up this section, we can see a few important things. The first is that the lowest level of code, well underneath R, is all a big recipe, where instructions are computed line by line, as explicit as possible, all the way to telling the computer to exit the program and that there was no errors.

Another important thing is that within this recipe, while there are a number of commands, these generally boil down to simple actions, like moving things into registers, running functions like printf, managing memory (the stack in our example), and returning with an exit code.

Conclusion

Ok, let's go back to familiarity now.

5 + 2
[1] 7

And you can now see just how much is swept under the rug. We started with 5 + 2. We then showed that we could make the equivalent statement as an S-expression and have it successfully run in R. This is because R will parse 5 + 7 into an S-expression internally. We then showed that R's built-in arithmetic functions are written in C. We then looked at the C code behind the S-expression that defines arithmetic operations in R. And from there, we looked at the Assembly code that underlies C.

To summarize as a list:

  1. R code (5 + 2) -> internally becomes an S-expression (`+`(5,2)).
  2. Arithmetic in R -> implemented in C code.
  3. C code -> compiled to machine code, which we can view as Assembly.
  4. Assembly -> direct instructions for the CPU: move data into registers, call functions, manage the stack, exit.

What is the point?

What has helped me solve the truly difficult problems, where no one has written the book on the thing, is understanding first principles. This means understanding the concepts that underlie whatever I'm doing. This is similar to how as a biologist, you have to learn chemistry and physics as part of the curriculum.

Similar to here, if you started in R and Python, you should have familiarity with the languages that are lower level, like C (even if it's literally at the level of very basic operations like printing something or writing a simple function). And you should at least know what Assembly looks like and what it generally does.

Be familiar with the full stack. Whatever it is. Go down the rabbit hole, as far as you can. And you will be a better problem solver, able to see the high-level, and able to reason from first principles.

And of course, we can be grateful for the sheer volume of things we don't have to worry about now because of the coders before us, who innovated so we didn't have to worry about analyzing single-cell sequencing data in Assembly.

We truly stand on the shoulders of giants before us.


[1] We note that R has a switch as well, which you can read about in the documentation here. Note that in R, switch is a function, and not a statement.

Date: January 19, 2025 - January 20, 2025

Emacs 28.1 (Org mode 9.5.2)