library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.1     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

The random walk

We will now look at a special case of coin tosses. Here, we’ll start with the number 0. We’ll flip a coin. If it’s heads, we’ll add 1. If it’s tails, we’ll subtract 1. We’ll then track how the number changes over time. This will resemble something like a drunk person walking around.

set.seed(1)
tosses <- sample(x = c(-1, 1),
                 size = 1000,
                 replace = TRUE)
walk <- lapply(seq(length(tosses)), function(j) {
    sum(tosses[1:j])
}) %>% unlist()

plot(walk)

This is technically called a random walk. Being able to run this, and having intuition around is is extremely useful in many fields. This might remind you of stock market indices. It turns out that random walks are used by some to model this type of behavior. Random walks are also used often in signal transduction pathway modeling, in terms of walkers traversing models of networks, and are relatively easy to implement if you know what you’re doing. Google’s original algorithm, Page Rank, takes uses the same concept with the network being websites and their hyperlinks.

It’s nice that the concept of a random walk arose naturally from simple playing around with coin toss simulations. Now, we’ve seen two distributions so far: the binomial and the extreme value distributions. Let’s do a similar analysis here with a number of simulated random walks. We’ll start by asking if the final value of the random walk, if repeated thousands of times, fits some sort of distribution. We’ll hypothesize that it’s a bell curve that looks similar to the binomial distribution.

We simple take the code in the above block and encase it in a loop, and iterate ten thousand times, making ten thousand random walks for further analysis.

walk_list <- lapply(seq(10000), function(i) {
    tosses <- sample(x = c(-1, 1),
                     size = 1000,
                     replace = TRUE)
    walk <- lapply(seq(length(tosses)), function(j) {
        sum(tosses[1:j])
    }) %>% unlist()
    return(walk)
})

We then pull out what I’ll call the final-value distribution.

final_values <- lapply(walk_list, function(i) {
    i[length(i)]
}) %>% unlist()

hist(final_values, breaks = 100)

sd(final_values)
## [1] 31.93866
summary(final_values)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -110.0000  -22.0000    0.0000   -0.2076   22.0000  128.0000

We see a binomial-like distribution, similar to our coin toss examples above, which speaks again to the Central Limit Theorem. The mean is at zero, which means that the final value of the random walker is most often zero. We however that the walker can end up as far away from zero as 100 or -100, in a walk length of 1000, with the standard deviation at 31.9, or about a third of the max. It’s nice to see that these types of distributions show up in multiple contexts.

Maximum distance from zero

Now, we’re going to check the max or min value of each of the random walks, and we’re going to hypothesize that this is tailed as we saw from the “runs of luck” extreme value distribution before. Specifically, we’re going to look at the farthest away any given random walk goes from zero, whether it’s positive or negative.

max_values <- lapply(walk_list, function(i) {
    max(abs(i))
}) %>% unlist()

hist(max_values, breaks = 100)

summary(max_values)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   27.00   36.00   39.19   48.00  131.00

We see the fat-tailed distribution that we saw with our run of luck analysis. It’s nice to see that both cases of so-called extreme values are distributed the same way.

Maximum distance from zero: when?

We will now see if there is any pattern as to when the random walkers reach their maximum distance from zero. We will hypothesize that this happens more often than not toward the end of the walk, but even if we get this right, we need to gain some intuition around the respective distribution. Notice that I simply take the code from the previous block and add the “which” command along with making the max(abs(i)) command a boolean. This gives you the index associated with the TRUE value or values. In this case, we will get only one number.

max_when <- lapply(walk_list, function(i) {
    mean(which(abs(i) == max(abs(i))))
}) %>% unlist()

hist(max_when, breaks = 100)

summary(max_when)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    31.0   518.0   765.0   707.3   937.0  1000.0

We can see that we have a long tail to the left, with a substantial increase in frequency as we appraoch the maximum length of the walk at the right. The majority of walkers reach their max value at the end of the walk.

Crossing zero

We will now ask a basic question about the volatility of the random walker. How often does it cross zero? We will then make a distribution of the result. We will use the walk list generated above, to get the distribution for a fixed number of walks. We will simply look at the number of times the number zero shows up in each entry of the list.

num_crossing_zero <- lapply(walk_list, function(i) {
    result <- sum(i == 0)
    return(result)
}) %>% unlist()

summary(num_crossing_zero)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   21.00   24.18   35.00  117.00
hist(num_crossing_zero, breaks = 100)

We see a distribution that has a single tail, highest at zero, and substantially higher than the rest of the distribution at zero. In other words, the majority of walkers do not cross zero at all. There is a tail to the right. Based on the summary statistics, the max value is roughly five times higher than the mean. Contrast this distribution with the max values in the previous section, where it was tailed to the right as well, but with a clear tail on the left side as well.

We now check the relationship between the total number of zero crossings and the max value obtained in the walk, with the hypothesis that there will a negative correlation. If a walker is crossing zero more often, it’s spending more time at zero and less time moving toward a maximum distance from zero. We check the exact relationship below.

cor(num_crossing_zero, max_values, method = "spearman")
## [1] -0.5707451
qplot(num_crossing_zero, max_values)

qplot(num_crossing_zero, max_values) + geom_bin2d(bins = 40)

We see an inverse relationship that is not linear. It seems to flatten as the total number of crossings increase.

Crossing zero: earlier vs later

We will now drill into the walks again and look at the specific indices in which zero is crossed. We hypothesize that the majority of these crossing take place earlier in the walk as opposed to later.

num_crossing_zero <- lapply(walk_list, function(i) {
    result <- list(mean_cross = mean(which(i == 0)), total_cross = sum(i == 0))
    return(result)
}) %>% bind_rows()

qplot(num_crossing_zero$mean_cross, num_crossing_zero$total_cross) + 
    labs(x = "Mean index of all crossings", y = "Total number of crossings")
## Warning: Removed 256 rows containing missing values (geom_point).

The plot above shows that if the total number of crossings is higher, the average index of crossings is in the middle. This means that the crossings were not skewed to the left or right in this case. In a way, this speaks to the law of large numbers, in that the more zero crossings you have, the more likely they are distributed in such a way that the average index is half the total number of walks.

Random walk in 2-D

I’ll leave you with an extension of random walk analysis, in 2D. We’ll now have the random walker move around on a 2-D grid rather than a 1-D line, and we will track where the given walker has been.

You’ll see that this is actually very easy to implement in R, as an extension of how we implemented the 1d random walk.

set.seed(1)
tosses1 <- sample(x = c(-1, 1),
                 size = 10000,
                 replace = TRUE)
tosses2 <- sample(x = c(-1, 1),
                 size = 10000,
                 replace = TRUE)

walk <- lapply(seq(length(tosses1)), function(j) {
    result <- list(x = sum(tosses1[1:j]), y = sum(tosses2[1:j]))
}) %>% bind_rows()
    
plot(walk[[1]], walk[[2]])

And if the above reminds you of brownian motion, you’ll be happy to know that this exact type of analysis has been used to model brownian motion. It’s interesting to see that a 2-d random walk is really just two separate random walks, with one placed on the x axis and one placed on the y axis. You can do a similar approach to do random walks in 3 or even more dimensions.