Introduction


“Whenever you’re called on to make up your mind,
and you’re hampered by not having any,
the best way to solve the dilemma, you’ll find,
is simply by spinning a penny.

No - not so that chance shall decide the affair
while you’re passively standing there moping;
but the moment the penny is up in the air,
you suddenly know what you’re hoping.”

Piet Hein, Grooks


One way I have developed intuition around basic probability and statistics very fast is by modeling simple random processes on a personal computer. In the case of this markdown, I will start by simulating coin tosses, something I did in grad school when I was learning bioinformatics as a pure wet lab biologist. This allows us to ask questions about the behavior of millions of coin tosses, which would otherwise take a very long time to do by hand (if we flip one coin every 10 seconds and record the results, a million tosses takes 3.81 months. By flipping coins in R, I indpendently ran headfirst into the law of large numbers, the central limit theorem, the concept of Bernouli processes and Bernouli trials, the binomial distribution and its basic characteristics. This might seem overwhelming to the reader, but the act of discovering these concepts independently radically changes your relationship with them and helps you retain and truly understand them. It was this computational intuition that slowly allowed me to understand the math behind the processes I was modeling.

Coin toss simulation


“Platon Karataev knew nothing by heart except his prayers. When he began to speak he seemed not to know how he would conclude. Sometimes Pierre, struck by the meaning of his words, would ask him to repeat them, but Platon could never recall what he had said a moment before, just as he never could repeat to Pierre the words of his favorite song.”

Leo Tolstoy, War and Peace


We will start by having the computer flip a coin. While the user could easily just get out a coin and start flipping it, doing this using programming can allow for millions of coin tosses and subsequent analysis of the results. Let’s begin by flipping a coin a few times. The output is going to be 0 or 1, where 0 is tails or 1 is heads (or the other way around). The coin will be a fair coin.

Discovering the law of large numbers

What I’m telling R to do below: take a bag that has 0 and 1 in it. Pull one item out from the bag, and store it. Whatever I pulled from the bag, replace it. Repeat the process 20 times. After this, we sum up the 20 0 and 1’s, which gives us the number of times the result is heads, which should be around (though not necessarily exactly) 50%.

library(tidyverse)
set.seed(1)

# In case you haven't seen this before
c(0, 1)
## [1] 0 1
# Sample from 0 and 1, ten times, with replacement.
flips <- sample(x = c(0, 1), size = 20, replace = TRUE)
flips
##  [1] 0 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 1 1 1 0
sum(flips)/length(flips)
## [1] 0.4

Now as we can see above, the number of heads is not exactly 50%. This is a random process, so we expect it to be not perfect. Let’s really utilize the power of R and see what happens when we flip a coin more times. Does the number of heads get closer to 50%? To do this, we’re going to modify the code to add a loop. We’re going to use the lapply function, which is very similar to a basic for loop. We’re going to flip a coin various numbers of times between 10 and 1 million. We’re going to hypothesize that the number of heads gets closer to 50% as we go higher. Even if this seems trivially intuitive to the reader, what might not be intuitive is the rate at which the result is closer to 50%, and how much closer to 50%, as a function of the number of tosses.

set.seed(1)

# Sample from 0 and 1, ten times, with replacement.
flip_number <- c(10, 100, 1000, 10000, 100000, 1000000) 

num_heads <- lapply(flip_number, function(i) {
    result <- sample(x = c(0, 1), size = i, replace = TRUE)
    result <- sum(result)/length(result)
    return(result)
}) %>% unlist()
    
# Run of luck check
num_heads
## [1] 0.400000 0.510000 0.504000 0.499000 0.499670 0.499543
# Absolute value (distance from zero) of the fraction heads minus 0.5. 
# In other words how close to one half was the number of heads? 
plot(abs(num_heads - 0.5))

We can see that as we go up in number of flips, the total number of heads gets closer and closer to 50%. This is the aforementioned Law of Large Numbers. If we consider any random process and a coin flip to be a “trial” of the process, it states that the average value of the trials will get closer to the expected value (average by doing the probability calculations), as the number of trials increases. But we can also see that the march to 50% is not linear. It takes on an L-shape.

Discovering the binomial distribution and the central limit theorem

If we want to be precise about how the law of large numbers drives the coin tosses to 50%, particularly the shape of the curve above, we need to take into account the natural variation of the results of n coin tosses. 10 tosses might have 5 heads one time, and 4 heads the next time, and 7 heads the next time. We need to take this into account if we want to make the curve more exact. In order to do this, we have to first characterize this variation.

So we have to ask the following question: if I toss a coin 10 times, record the number of heads, toss it again 10 times, record the number of heads again, and repeat this over and over until I have a nice histogram of the results, what exactly will that histogram look like, and how will it vary if do the same thing for sets of 100 tosses, of a million tosses? Below, we ask this question and visualize the histogram for the “sets of 100 tosses” case.

n_sessions <- 100000

seq(n_sessions)[1:10] # in case you haven't seen this command before
##  [1]  1  2  3  4  5  6  7  8  9 10
num_heads <- lapply(seq(n_sessions), function(i) {
    result <- sample(x = c(0, 1), size = 100, replace = TRUE)
    result <- sum(result)/length(result)
    return(result)
}) %>% unlist()

# Example of output
num_heads[1:5]
## [1] 0.51 0.50 0.48 0.48 0.47
# Histogram
hist(num_heads, breaks = 50, xlim = c(0, 1))

# Basic statistics
mean(num_heads)
## [1] 0.4999155
sd(num_heads)
## [1] 0.04991654

It forms a bell curve. You have probably seen this shape before. It is a consequence of what is called the Central Limit Theorem. The Central Limit Theorem states that if you take a large number of random samples from data, and you take the means from each of those samples, these sample means will be normally distributed (form a bell curve in the shape above). This is fundamental to understanding statistics, and we arrived at it with simple coin tosses. A large amount of statistics involve classifying families of distributions that particular types of data or processes could fall into. This goes well beyond the scope of this post, but hopefully you now have some intuition around the process of “discovering” and characterizing a distribution from scratch.

What information can we obtain from this distribution? The mean is 50% heads (as expected) with a standard deviation of 5%. The exact name for this “coin toss” distribution is the binomial distribution mentioned in the introduction. This and variations of this are widely used in bioinformatics. The generalized process of tossing a coin is called a Bernouli trial. A set of 100 coin tosses (or Bernouli trials) is called a Bernouli process. For the remainder of the markdown, I will continue to use the term “coin toss” but I will call sets of coin tosses Bernouli processes.

The binomial distribution as combinatorics: computational and mathematical approach


Most people, mathematicians and others, will agree that mathematics is not an empirical science, or at least that it is practiced in a manner which differs in several decisive respects from the techniques of the empirical sciences. And, yet, its development is very closely linked with the natural sciences John Von Neumann, The Mathematician


What is the probability mass function of the binomial distribution? It turns out to be a combinatoric exercise, something we will “discover” in future markdowns. But as a primer, how do you find the probability of getting 50 heads in 100 tosses with a fair coin? Or 48 tosses? Or 10 tosses?

How would we go about finding the answer to that question? If we’re looking at 40 tosses, we’d have to look at all possible combinations of 100 tosses, where 40 of them would be heads. Mathematically, you would say 100 choose 40. We will get into how to compute that in later posts. For now, we’ll look at the syntax and how to calculate it. In mathematical syntax looks like this:

\[ \binom{100}{40} \] In R, it you compute it like this:

choose(100, 40)
## [1] 1.374623e+28

Abstractly, this gives us all possible binary strings of size 100 totaling to 40. We now have to add the information that these zeros and ones are tosses of a fair coin. It will involve the probability of getting heads and tails, which are both 0.5. The probability of getting any given combination of heads and tails for 100 tosses will be:

\[ (1/2)^{100} \] Computationally, we get (a very small number):

(1/2)^100
## [1] 7.888609e-31

If we multiply this number by the number of combinations of 100 tosses where 40 of them are heads, then that would give us the sum total of probabilities leading to an outcome of 40. This looks like:

\[ \binom{100}{40}(1/2)^{100} \] If we compute it in R, we get:

choose(100, 40) * (1/2)^100
## [1] 0.01084387

So about 1%. Let’s check our work by doing the brute force simulation.

# Using the equation we derived
math_sol <- choose(100, 40) * 0.5^100

# Using our computational solution
n_sessions <- 100000
num_heads <- lapply(seq(n_sessions), function(i) {
    result <- sample(x = c(0, 1), size = 100, replace = TRUE)
    result <- sum(result)
    return(result)
}) %>% unlist()
comp_sol <- sum(num_heads == 40)/length(num_heads)

# We compare. Should be almost equal but not quite.
math_sol
## [1] 0.01084387
comp_sol
## [1] 0.01131

And we turn out to be right. The computational solution was able to successfully predict the probability, and we’d be able to do this even if we didn’t know the math behind the binomial distribution. It’s good to have both tools in the toolbox. If you know a formula for a given aspect of probability or can derive it, then great! If not, and you’re quick with code, then you can run a numerical simulation and get at the answer. There are plenty of cases, like the three body problem, with an associated science fiction novel recommendation as an aside, where there is no easy mathematical solution and you’re left with numerical simulations to get at the answer.

Now, we’re going to abstract the formula we just calculated for a fair coin, and see if we can get to it using the actual formula for the binomial distribution probability mass function. If let n be the number of coin tosses, in our case 100, and we let k be the number of heads, in our case 40, then:

\[ \binom{100}{40}(1/2)^{100} \] becomes:

\[ \binom{n}{k}(1/2)^n \]

Now we add a new variable: we let p be the probability of getting heads. In our case it’s 1/2 because we’re dealing with a fair coin. If we allow for unfair coins in the formula, we get the probability mass function for the binomial distribution in full, below.

\[ \binom{n}{k} p^k (1-p)^{n-k} \]

Now, let’s try to work back to our original formula. We can simplify by assuming the coin is fair, setting p to 1/2. Our equation becomes:

\[ \binom{n}{k} (1/2)^k (1/2)^{n-k} \]

We can combine the two 1/2’s by adding the expoents, reducing the k’s to 0 and leaving the n. We get:

\[ \binom{n}{k}(1/2)^n \]

Which gives us the formula we derived initially. We now have some intuition around the probability mass function for the binomial distribution. We have use both our mathematical and computational toolkit to solve a coin toss probability problem. This exercise is very similar to how I approach data analysis problems day to day. I chisel away at it computationally, then mathematically, then computationally.

The standard deviation of the binomial distribution

Computational solution

Armed with this new terminology and knowledge that we arrived at from the bottom-up, let’s connect it with the law of large numbers and our original coin toss by distance-from-50%-heads simulation. Based on our intital results, we can hypothesize that if we do the same Bernouli process generation above with a larger number of coin tosses, 1000 instead of 100, the distribution will be “thinner” corresponding to each Bernouli trail being closer to 50% heads on average. We test this hypothesis below by modifying our previous code to a 1000-toss Bernouili process rather than a 100-toss Bernouli process, and we visualize the histogram below.

n_sessions <- 100000

seq(n_sessions)[1:10] # in case you haven't seen this command before
##  [1]  1  2  3  4  5  6  7  8  9 10
num_heads <- lapply(seq(n_sessions), function(i) {
    result <- sample(x = c(0, 1), size = 1000, replace = TRUE)
    result <- sum(result)/length(result)
    return(result)
}) %>% unlist()

# Example of output
num_heads[1:5]
## [1] 0.501 0.485 0.517 0.510 0.505
# Histogram
hist(num_heads, breaks = 50, xlim = c(0, 1))

# Basic statistics
mean(num_heads)
## [1] 0.5000197
sd(num_heads)
## [1] 0.0157715

We see above that the distribution is indeed thinner. We have a mean of nearly 50% and a standard deviation of around 1.6%, as compared to a standard deviation of 5% for the 100-toss Bernouli process. Finally, let’s look at the relationship between the number of coin tosses per iteration and the standard deviation. We can hypothesize that the curve will be a rounded L, as we saw in our initial plot, but will have much more precision in this result.

n_sessions <- 100000

seq(n_sessions)[1:10] # in case you haven't seen this command before
##  [1]  1  2  3  4  5  6  7  8  9 10
flips_per_session <- c(10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000)
binomial_sd <- lapply(flips_per_session, function(i) {
    num_heads <- lapply(seq(n_sessions), function(j) {
        result <- sample(x = c(0, 1),
                         size = i,
                         replace = TRUE)
        result <- sum(result) / length(result)
        return(result)
    }) %>% unlist() %>% sd()
    return(num_heads)
}) %>% unlist()
names(binomial_sd) <- flips_per_session

plot(flips_per_session, binomial_sd)

Mathematical solution: the DeMoivre-Laplace Theorem

We do indeed see an L-shaped curve, which starts to flatten out before groups of 1000 coin tosses, though still approaching zero as we go up. The formula for the curve above is known from what is called the De Moivre-Laplace Theorem, published in 1738, which in the case of a fair coin states that for n flips, the standard deviation of the resulting binomial distribution can be approximated as the standard deviation of a normal distribution, which is \(\sqrt{0.25n}\). If we want to normalize the standard deviation between 0 and 1, we divide the result by n. In other words, the formula for the curve above is \(y = \sqrt{0.25n}/n\). Below, we check our results against this formula.

# SD calculation by DeMoivre-Laplace Theorem
dem_lap_sd <- lapply(flips_per_session, function(i) {
  result <- sqrt(i/4)
  result <- result/i # Normalize to between 0 and 1
  return(result)
}) %>% unlist()

# Compare to our caluclation
sd_compare = tibble(dem_lap_sd = dem_lap_sd, binomial_sd = binomial_sd)
sd_compare
## # A tibble: 10 × 2
##    dem_lap_sd binomial_sd
##         <dbl>       <dbl>
##  1    0.158       0.158  
##  2    0.112       0.112  
##  3    0.0707      0.0704 
##  4    0.05        0.0502 
##  5    0.0354      0.0354 
##  6    0.0224      0.0223 
##  7    0.0158      0.0158 
##  8    0.0112      0.0112 
##  9    0.00707     0.00707
## 10    0.005       0.00501

You can see the results by the De Moivre-Laplace Theorem and our calculations are almost the same. The very slight variation is from the fact that we use a finite number of Bernouli processes (100000), and the theroem gives us what the standard deviation would be if the number of Bernouli processes approached infinity. Unlike the year 1738, when the theorem was published, we can quickly with a few lines of code get a good approaximation as to what these standard deviations are. To wrap this section up, we started with simple coin toss simulations, and from that, we ran into the Law of Large Numbers, the Central Limit Theorem, the Bernouli process and Bernouli trials, the Binomial distribution, and the DeMoivre-Laplace Theorem.