scRNA seq pipeline with one file

Home

“He (my father) knew the difference between knowing the name of something and knowing something…which I learned very early”

Richard Feynman, Interview “On knowing the name of something”

The purpose of this markdown is to take the Seurat PBMC 3K tutorial go into depth on each piece. Specifically, we’re going to take some of the core high-level Seurat functions in the pipeline, and get to the guts of what is happening at each transformation.

Load data

So let’s start by reading in the Seurat object. This piece is pasted directly in from the vignette.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(Seurat)

## The legacy packages maptools, rgdal, and rgeos, underpinning this package
## will retire shortly. Please refer to R-spatial evolution reports on
## https://r-spatial.org/r/2023/05/15/evolution4.html for details.
## This package is now running under evolution status 0

## Attaching SeuratObject

library(patchwork)
library(here)

## here() starts at /Users/tylerburns/Documents/projects/scrna_seq_pipeline_one_file

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.2     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

set.seed(42)

# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = here::here('data', 'pbmc_3k', 'filtered_gene_bc_matrices', 'hg19'))
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

First, let’s look at what pbmc.data is. It’s a matrix, but it’s compressed. Because single-cell sequencing data is very large (20k dimensions, compared to 40 for CyTOF), it helps to compress it. You can explit the fact that single-cell sequencing data is very sparse. There are a lot of zeros. This is a major contrast to flow and CyTOF data.

I’m showing you a section that has at least one non-zero value, so you can see that its a count matrix.

pbmc.data[90:95, 1:5]

## 6 x 5 sparse Matrix of class "dgCMatrix"
##              AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1
## SLC35E2B                    .                .                .
## RP11-345P4.7                .                .                .
## CDK11A                      .                .                .
## SLC35E2                     .                .                .
## NADK                        .                .                .
## GNB1                        .                .                .
##              AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1
## SLC35E2B                    .                .
## RP11-345P4.7                .                .
## CDK11A                      .                .
## SLC35E2                     .                .
## NADK                        1                .
## GNB1                        .                .

Another thing worth noticing, especially for flow and CyTOF users, is that the dimensions are rows and the cells are columns. The format of the matrix is transposed, in comparison to how we’re used to viewing it. Note that the columns have cell ID names. This is something we don’t have for flow and CyTOF data.

Now let’s look at the Seurat object that is created from this.

pbmc

## An object of class Seurat 
## 13714 features across 2700 samples within 1 assay 
## Active assay: RNA (13714 features, 0 variable features)

Ok, but what is this exactly? Well, for starts, it can be subsetted like a matrix. Note that the “matrix” is in the same format as the first one. Cells are columns and genes are rows. Let’s just take the first five genes and keep all the cells.

pbmc[1:5,]

## An object of class Seurat 
## 5 features across 2700 samples within 1 assay 
## Active assay: RNA (5 features, 0 variable features)

Now we’ll take the first five cells and all of the genes.

pbmc[,1:5]

## An object of class Seurat 
## 13714 features across 5 samples within 1 assay 
## Active assay: RNA (13714 features, 0 variable features)

Ok, great. But what’s in here? How can I just look at my data?

pbmc@assays$RNA@data[90:95, 1:5]

## 6 x 5 sparse Matrix of class "dgCMatrix"
##         AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
## CAMTA1                 .                1                .                .
## VAMP3                  .                .                .                .
## PER3                   .                .                .                .
## UTS2                   .                .                .                .
## TNFRSF9                .                .                .                .
## PARK7                  .                .                2                .
##         AAACCGTGTATGCG-1
## CAMTA1                 .
## VAMP3                  .
## PER3                   .
## UTS2                   .
## TNFRSF9                .
## PARK7                  .

Ok, there’s our matrix. It still appears to be compressed. We note that it’s in a different order now.

There are some other things in the object. One worth looking at is the meta data.

pbmc@meta.data[1:5,]

##                  orig.ident nCount_RNA nFeature_RNA
## AAACATACAACCAC-1     pbmc3k       2419          779
## AAACATTGAGCTAC-1     pbmc3k       4903         1352
## AAACATTGATCAGC-1     pbmc3k       3147         1129
## AAACCGTGCTTCCG-1     pbmc3k       2639          960
## AAACCGTGTATGCG-1     pbmc3k        980          521

We have the cell ID as a row now (not a column as before…kinda confusing). Then we have a couple of features. Now, this meta.data matrix can get pretty big in more complex studies and contain anything from treatment condition to cluster IDs.

Quality control

Ok, the first major thing we have to do is filter by percent mitochondrial genes. Here’s what that looks like.

# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

Ok, so this high-level function created a new meta data feature, called “percent.mt” that we can filter on.

pbmc@meta.data[1:5,]

##                  orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC-1     pbmc3k       2419          779  3.0177759
## AAACATTGAGCTAC-1     pbmc3k       4903         1352  3.7935958
## AAACATTGATCAGC-1     pbmc3k       3147         1129  0.8897363
## AAACCGTGCTTCCG-1     pbmc3k       2639          960  1.7430845
## AAACCGTGTATGCG-1     pbmc3k        980          521  1.2244898

Let’s look at it.

summary(pbmc@meta.data$percent.mt)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.537   2.031   2.217   2.643  22.569

Ok, so general very low.

hist(pbmc@meta.data$percent.mt, breaks = 100)

Now we set a threshold for this, as well as the as the other meta data columns. Let’s look at the Seurat code for this.

First some built-in visualizations.

# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

# FeatureScatter is typically used to visualize feature-feature relationships, but can be used
# for anything calculated by the object, i.e. columns in object metadata, PC scores etc.

plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2

Ok, fine. Now we use the function “subset” to subset on multiple things at a time. Here are the default values.

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

Note that subsetting the matrix is more complicated than just filtering on the metadata.

Filtering on the metadata gives us the metadata column, not the filtered Seurat object. Welcome to object-oriented programming.

tmp <- dplyr::filter(pbmc@meta.data, percent.mt < 5)
tmp[1:5,]

##                  orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC-1     pbmc3k       2419          779  3.0177759
## AAACATTGAGCTAC-1     pbmc3k       4903         1352  3.7935958
## AAACATTGATCAGC-1     pbmc3k       3147         1129  0.8897363
## AAACCGTGCTTCCG-1     pbmc3k       2639          960  1.7430845
## AAACCGTGTATGCG-1     pbmc3k        980          521  1.2244898

Normalization

Ok, so now we go into normalizing the data. Seurat has a high-level function for it. First, we’re going to keep the raw PBMC object so we can make comparisons, while I reverse engineer what’s going on here.

pbmc_raw <- pbmc

Ok, now for the high-level Seurat function.

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

What did that do?

pbmc@assays$RNA@data[1:20, 1:5]

## 20 x 5 sparse Matrix of class "dgCMatrix"
##               AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1
## AL627309.1                   .         .                .       
## AP006222.2                   .         .                .       
## RP11-206L10.2                .         .                .       
## RP11-206L10.9                .         .                .       
## LINC00115                    .         .                .       
## NOC2L                        .         .                .       
## KLHL17                       .         .                .       
## PLEKHN1                      .         .                .       
## RP11-54O7.17                 .         .                .       
## HES4                         .         .                .       
## RP11-54O7.11                 .         .                .       
## ISG15                        .         .                1.429744
## AGRN                         .         .                .       
## C1orf159                     .         .                .       
## TNFRSF18                     .         1.625141         .       
## TNFRSF4                      .         .                .       
## SDF4                         .         .                1.429744
## B3GALT6                      .         .                .       
## FAM132A                      .         .                .       
## UBE2J2                       .         .                .       
##               AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1
## AL627309.1             .                      .
## AP006222.2             .                      .
## RP11-206L10.2          .                      .
## RP11-206L10.9          .                      .
## LINC00115              .                      .
## NOC2L                  .                      .
## KLHL17                 .                      .
## PLEKHN1                .                      .
## RP11-54O7.17           .                      .
## HES4                   .                      .
## RP11-54O7.11           .                      .
## ISG15                  3.55831                .
## AGRN                   .                      .
## C1orf159               .                      .
## TNFRSF18               .                      .
## TNFRSF4                .                      .
## SDF4                   .                      .
## B3GALT6                .                      .
## FAM132A                .                      .
## UBE2J2                 .                      .

Notice that the values are now real numbers as opposed to integers (counts). What exactly happened here? For flow and CyTOF data alike, there is some sort of log-like transformation that takes place. It’s no different here. With this one, it’s log(x + 1), or log1p.

But it’s slightly more complicated than that. Let me show you.

Let’s get out our raw matrix.

tmp_raw <- pbmc_raw@assays$RNA@data %>% as.data.frame()
tmp_raw[1:20, 1:5]

##               AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1
## AL627309.1                   0                0                0
## AP006222.2                   0                0                0
## RP11-206L10.2                0                0                0
## RP11-206L10.9                0                0                0
## LINC00115                    0                0                0
## NOC2L                        0                0                0
## KLHL17                       0                0                0
## PLEKHN1                      0                0                0
## RP11-54O7.17                 0                0                0
## HES4                         0                0                0
## RP11-54O7.11                 0                0                0
## ISG15                        0                0                1
## AGRN                         0                0                0
## C1orf159                     0                0                0
## TNFRSF18                     0                2                0
## TNFRSF4                      0                0                0
## SDF4                         0                0                1
## B3GALT6                      0                0                0
## FAM132A                      0                0                0
## UBE2J2                       0                0                0
##               AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1
## AL627309.1                   0                0
## AP006222.2                   0                0
## RP11-206L10.2                0                0
## RP11-206L10.9                0                0
## LINC00115                    0                0
## NOC2L                        0                0
## KLHL17                       0                0
## PLEKHN1                      0                0
## RP11-54O7.17                 0                0
## HES4                         0                0
## RP11-54O7.11                 0                0
## ISG15                        9                0
## AGRN                         0                0
## C1orf159                     0                0
## TNFRSF18                     0                0
## TNFRSF4                      0                0
## SDF4                         0                0
## B3GALT6                      0                0
## FAM132A                      0                0
## UBE2J2                       0                0

Ok, now we have to get the total counts per cell.

total_counts <- colSums(tmp_raw)
total_counts[1:5]

## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 
##             2419             4903             3147             2639 
## AAACCGTGTATGCG-1 
##              980

Ok, and now we make our per-cell scaling factor. Default is 10,000.

scaling_factors <- 10000/total_counts
scaling_factors[1:5]

## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 
##         4.133940         2.039568         3.177629         3.789314 
## AAACCGTGTATGCG-1 
##        10.204082

Ok, now we normalize the matrix, given the per-cell scaling factors.

# Define a function to normalize a single column
normalize_column <- function(column, scale_factor) {
  total_counts <- sum(column)
  scaling_factor <- scale_factor / total_counts
  normalized_column <- column * scaling_factor
  return(normalized_column)
}

# Apply the function to each column of the count matrix
tmp_norm <- apply(tmp_raw, 2, normalize_column, scale_factor = 10000) %>% as.data.frame()
tmp_norm[1:5, 1:5]

##               AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1
## AL627309.1                   0                0                0
## AP006222.2                   0                0                0
## RP11-206L10.2                0                0                0
## RP11-206L10.9                0                0                0
## LINC00115                    0                0                0
##               AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1
## AL627309.1                   0                0
## AP006222.2                   0                0
## RP11-206L10.2                0                0
## RP11-206L10.9                0                0
## LINC00115                    0                0

Ok, and now we do the log1p transformation on the whole matrix.

tmp_norm <- log1p(tmp_norm)
tmp_norm[1:5, 1:5]

##               AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1
## AL627309.1                   0                0                0
## AP006222.2                   0                0                0
## RP11-206L10.2                0                0                0
## RP11-206L10.9                0                0                0
## LINC00115                    0                0                0
##               AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1
## AL627309.1                   0                0
## AP006222.2                   0                0
## RP11-206L10.2                0                0
## RP11-206L10.9                0                0
## LINC00115                    0                0

And now we compare it. Let’s pick on ISG15.

isg15_tyler <- tmp_norm[rownames(tmp_norm) == "ISG15",]
isg15_tyler[1:15]

##       AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
## ISG15                0                0         1.429744          3.55831
##       AAACCGTGTATGCG-1 AAACGCACTGGTAC-1 AAACGCTGACCAGT-1 AAACGCTGGTTCTT-1
## ISG15                0         1.726902                0                0
##       AAACGCTGTAGCCA-1 AAACGCTGTTTCTG-1 AAACTTGAAAAACG-1 AAACTTGATCCAGA-1
## ISG15                0         3.339271                0                0
##       AAAGAGACGAGATA-1 AAAGAGACGCGAGA-1 AAAGAGACGGACTT-1
## ISG15         1.638876         2.861362                0

isg15_seurat <- pbmc@assays$RNA@data %>% as.data.frame() %>% .[rownames(.) == "ISG15",]
isg15_seurat[1:15]

##       AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
## ISG15                0                0         1.429744          3.55831
##       AAACCGTGTATGCG-1 AAACGCACTGGTAC-1 AAACGCTGACCAGT-1 AAACGCTGGTTCTT-1
## ISG15                0         1.726902                0                0
##       AAACGCTGTAGCCA-1 AAACGCTGTTTCTG-1 AAACTTGAAAAACG-1 AAACTTGATCCAGA-1
## ISG15                0         3.339271                0                0
##       AAAGAGACGAGATA-1 AAAGAGACGCGAGA-1 AAAGAGACGGACTT-1
## ISG15         1.638876         2.861362                0

And now we check equality, with tolerance to 10^-13, because sometimes the precision of floating point numbers in R versus C++ (which some of Seurat’s code is one in) will differ by a very small amount.

all.equal(isg15_seurat, isg15_tyler, tolerance = 10e-13)

## [1] TRUE

Bottom line is it worked. You now know how data normalization in Seurat works.

Find most variable genes

Ok, now let’s go into a very important piece. We have to find the genes that are the most variable. This narrows our dataset from 10,000+ genes to around 2000.

The function is a single line of code below.

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

We can find our variable features here. Here are the first ten of them.

pbmc@assays$RNA@var.features[1:10]

##  [1] "PPBP"   "LYZ"    "S100A9" "IGLL5"  "GNLY"   "FTL"    "PF4"    "FTH1"  
##  [9] "GNG11"  "S100A8"

The vst method normalizes the variances using a negative binomial regression (standard for counts data). It’s relatively complicated. Let’s see what happens if we go by variance alone of the count data.

gene_var <- apply(pbmc@assays$RNA@counts, 1, var)
gene_var[1:5]

##    AL627309.1    AP006222.2 RP11-206L10.2 RP11-206L10.9     LINC00115 
##   0.003401325   0.001136363   0.001892500   0.001136363   0.006779363

exp_genes <- sort(gene_var, decreasing = TRUE)[1:2000] %>% names()
length(which(exp_genes %in% pbmc@assays$RNA@var.features))/2000

## [1] 0.5905

Two thirds of genes. Not bad.

Ok, now we apply the variance stabilizing transform (vst) itself. This comes from the sctransform package in R. It’s a negative binomial regression that stabilizes the variance as a function of the expression level across genes. The function itself is actually quite complicated (type sctransform::vst in R and you can have a look at the code).

Here is the function, and some of its output.

vst_var <- sctransform::vst(pbmc@assays$RNA@counts)

## Calculating cell attributes from input UMI matrix: log_umi

## Variance stabilizing transformation of count matrix of size 12519 by 2638

## Model formula is y ~ log_umi

## Get Negative Binomial regression parameters per gene

## Using 2000 genes, 2638 cells

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%

## Found 138 outliers - those will be ignored in fitting/regularization step

## Second step: Get residuals using fitted parameters for 12519 genes

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |==============================                                        |  42%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |========================================                              |  58%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |==============================================                        |  65%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |=================================================================     |  92%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |======================================================================| 100%

## Calculating gene attributes

## Wall clock passed: Time difference of 12.48259 secs

vst_var$gene_attr[1:5,]

##               detection_rate       gmean       amean    variance residual_mean
## AL627309.1       0.003411676 0.002367592 0.003411676 0.003401325   0.003785778
## RP11-206L10.2    0.001895375 0.001314637 0.001895375 0.001892500  -0.010932678
## LINC00115        0.006823351 0.004740789 0.006823351 0.006779363  -0.001366130
## NOC2L            0.096285064 0.072739766 0.107278241 0.159514698  -0.009376245
## KLHL17           0.003032600 0.002104249 0.003032600 0.003024550   0.001541438
##               residual_variance
## AL627309.1            1.0041101
## RP11-206L10.2         0.6320025
## LINC00115             0.9102364
## NOC2L                 1.1328134
## KLHL17                0.9725364

The transformed values for variance calculation is stored in y.

vst_var$y[1:5, 1:5]

##               AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1
## AL627309.1         -0.05613019      -0.07437089      -0.06236262
## RP11-206L10.2      -0.04587112      -0.06208717      -0.05137991
## LINC00115          -0.08231837      -0.10975224      -0.09171929
## NOC2L              -0.30607566      -0.40334948      -0.34105449
## KLHL17             -0.05322630      -0.07084666      -0.05923795
##               AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1
## AL627309.1         -0.05812327      -0.03897974
## RP11-206L10.2      -0.04762861      -0.03094908
## LINC00115          -0.08532661      -0.05643377
## NOC2L              -0.31743311      -0.20378746
## KLHL17             -0.05514767      -0.03674340

gene_variance <- apply(vst_var$y, 1, var)
gene_variance[1:5]

##    AL627309.1 RP11-206L10.2     LINC00115         NOC2L        KLHL17 
##     1.0041101     0.6320025     0.9102364     1.1328134     0.9725364

final_genes <- sort(gene_variance, decreasing = TRUE)[1:2000]
final_genes[1:5]

##   S100A9     GNLY      LYZ   S100A8     NKG7 
## 74.97440 65.80410 56.64264 48.98671 42.34337

length(which(names(final_genes) %in% pbmc@assays$RNA@var.features))/2000

## [1] 0.8365

Ok, a little better. We’re going to step here. There’s a bit more going on under the hood here, that you can read about in the documentation. But hopefully you get the idea of how to find the variable genes.

Scaling

Ok, now we have to scale the data. This is something that is standard practice before PCA, which is used as pre-processing for clustering and nonlinear dimensionality reduction (eg. UMAP). We’re trying to compress the data from the current 2000 dimensions to something closer to 10-50 dimensions. Comparable to flow and CyTOF data.

We have:

all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes, do.center = TRUE, do.scale = TRUE)

## Centering and scaling data matrix

Which creates a new slot:

pbmc@assays$RNA@scale.data[1:5, 1:5]

##               AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1
## AL627309.1         -0.05812316      -0.05812316      -0.05812316
## AP006222.2         -0.03357571      -0.03357571      -0.03357571
## RP11-206L10.2      -0.04166819      -0.04166819      -0.04166819
## RP11-206L10.9      -0.03364562      -0.03364562      -0.03364562
## LINC00115          -0.08223981      -0.08223981      -0.08223981
##               AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1
## AL627309.1         -0.05812316      -0.05812316
## AP006222.2         -0.03357571      -0.03357571
## RP11-206L10.2      -0.04166819      -0.04166819
## RP11-206L10.9      -0.03364562      -0.03364562
## LINC00115          -0.08223981      -0.08223981

But what is actually happening here? The data are being centered and scaled. Let’s go through that. Remember, we already have the data matrix stored as tmp_norm.

There’s a function called “scale” in base R that does this. The way it’s set up, if center and scale are true, then what we go row by row, subtracting every value by the mean of the row, and dividing by the row’s standard deviation. This produces the z-score. The way this works is the numbers get converted to how many standard deviations from the mean (positive or negative) they are.

Note that in the code below, I include some logic that caps the max value at 10, which is what is done in Seurat.

# Scale the cleaned data
tmp_scale <- apply(tmp_norm, 1, function(i) {
    if(sd(i) == 0) return(rep(0, length(i)))
    result <- scale(i, center = TRUE, scale = TRUE)
    result <- ifelse(result > 10, 10, result)
}) %>% t()
colnames(tmp_scale) <- colnames(tmp_norm)

And now we check for equality. Note that we have to deal with a very small tolerance because Seurat’s ScaleData function is actually in C++ (via Rcpp), so there are going to be small differences in the precision of floating point numbers between my function and Seurat’s.

all.equal(tmp_scale, pbmc@assays$RNA@scale.data, tolerance = 1e-13)

## [1] TRUE

PCA

Ok, now it’s time for PCA. For clustering and dimension reduction, we don’t want to use all 2000 “most variable” genes. We want to compress that information into as few dimensions as possible, which essentially make the data more CyTOF-like, in terms of dimensionality.

So first, the Seurat function.

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc), weight.by.var = TRUE)

## PC_ 1 
## Positive:  CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP 
##     FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP 
##     PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD 
## Negative:  MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW 
##     CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A 
##     MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74 
##     HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB 
##     BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74 
## Negative:  NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2 
##     CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX 
##     TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, HLA-DPA1, CD74, MS4A1, HLA-DRB1, HLA-DRA 
##     HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8 
##     PLAC8, BLNK, MALAT1, SMIM14, PLD4, P2RX5, IGLL5, LAT2, SWAP70, FCGR2B 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU 
##     HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1 
##     NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, RP11-367G6.3, MYL9, GP1BA 
## PC_ 4 
## Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HIST1H2AC, HLA-DPB1, PF4, SDPR 
##     TCL1A, HLA-DRB1, HLA-DPA1, HLA-DQA2, PPBP, HLA-DRA, LINC00926, GNG11, SPARC, HLA-DRB5 
##     GP9, AP001189.4, CA2, PTCRA, CD9, NRGN, RGS18, CLU, TUBB1, GZMB 
## Negative:  VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL 
##     AQP3, CD2, CD14, FYB, LGALS2, GIMAP4, ANXA1, CD27, FCN1, RBP7 
##     LYZ, S100A11, GIMAP5, MS4A6A, S100A12, FOLR3, TRABD2A, AIF1, IL8, IFI6 
## PC_ 5 
## Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2 
##     GZMH, S100A9, LGALS2, CCL3, CTSW, XCL2, CD14, CLIC3, S100A12, RBP7 
##     CCL5, MS4A6A, GSTP1, FOLR3, IGFBP7, TYROBP, TTC38, AKR1C3, XCL1, HOPX 
## Negative:  LTB, IL7R, CKB, VIM, MS4A7, AQP3, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1 
##     LILRB2, PTGES3, MAL, CD27, HN1, CD2, GDI2, CORO1B, ANXA5, TUBA1B 
##     FAM110A, ATP1A1, TRADD, PPA1, CCDC109B, ABRACL, CTD-2006K23.1, WARS, VMO1, FYB

This gives us:

pbmc@reductions$pca@cell.embeddings[1:5, 1:5]

##                        PC_1       PC_2       PC_3       PC_4        PC_5
## AAACATACAACCAC-1 -4.7298963 -0.5182652 -0.7809100 -2.3089783 -0.07158848
## AAACATTGAGCTAC-1 -0.5176254  4.5923068  5.9605692  6.8677364 -1.96144350
## AAACATTGATCAGC-1 -3.1892555 -3.4694432 -0.8469782 -1.9956538 -5.10640863
## AAACCGTGCTTCCG-1 12.7931387  0.1008253  0.6292662 -0.3737225  0.21943186
## AAACCGTGTATGCG-1 -3.1290640 -6.3478829  1.2656002  3.0147877  7.84529636

And of course, we can plot these.

plot(pbmc@reductions$pca@cell.embeddings[,1], pbmc@reductions$pca@cell.embeddings[,2])

Or with the built-in Seurat function.

DimPlot(pbmc)

How might we do this in base R? Let’s take our scaled matrix and do PCA. Note that I have to transpose the matrix, because this works on the columns.

tmp_pc <- prcomp(t(tmp_scale), center = FALSE, scale = FALSE)

Which gives us:

tmp_pc$x[1:5, 1:5]

##                        PC1         PC2         PC3        PC4        PC5
## AAACATACAACCAC-1 -8.064775  -0.5646571   0.5684279 -3.4358954  1.7446948
## AAACATTGAGCTAC-1 -0.118376   1.3762525   3.6784549 17.3781755 -0.5604802
## AAACATTGATCAGC-1 -5.126320  -8.8074650   6.7152749 -2.5082868 -4.4440677
## AAACCGTGCTTCCG-1 18.846209   2.3536153   2.2374983 -0.7606574  0.1080687
## AAACCGTGTATGCG-1  3.121668 -10.7846640 -19.8882184  0.4245099  7.3525490

Note that the values are higher, by maybe 2 fold. Let’s plot it.

plot(tmp_pc$x[,1], tmp_pc$x[,2])

Note that the image is similar looking but the scale is different, and there are subtle differences in the shape of the respective images (like how “straight” the left-most island is). This could be for a number of reasons. Seurat is doing things a little different under the hood, as its PCA method (Irlba) is optimized for large sparse matrices typical of scRNA seq data. It’s definitely much faster than the “prcomp” method that I use, which technically is standard singular value decomposition (SVD). Taken together, this should suffice for now. I tried a number of things to get my method and Seurat’s method to be exact, but it became a rabbit hole that this vignette does not need. Just know that in general, there is more than one method to do PCA.

Clustering

Now let’s do our clustering. Clustering is done by producing a K-Nearest Neighbor graph followed by Louvain community detection. There is a random component to this, so any clusters I make outside of Seurat using this method won’t necessarily overlap exactly. This said, let’s start with Seurat’s way of doing it.

pbmc <- FindNeighbors(pbmc, dims = 1:10, k.param = 20)

## Computing nearest neighbor graph

## Computing SNN

pbmc <- FindClusters(pbmc, resolution = 0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2638
## Number of edges: 95927
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8728
## Number of communities: 9
## Elapsed time: 0 seconds

Let’s see what that looks like. Here is where to find it.

pbmc@meta.data$seurat_clusters[1:10]

##  [1] 2 3 2 1 6 2 4 4 4 5
## Levels: 0 1 2 3 4 5 6 7 8

And let’s tabulate.

table(pbmc@meta.data$seurat_clusters)

## 
##   0   1   2   3   4   5   6   7   8 
## 684 481 476 344 291 162 155  32  13

Great. Now how would we cluster this ourselves? In this example we took the first 10 PCs, so let’s make that an accessible data frame here.

pbmc_pc <- pbmc@reductions$pca@cell.embeddings[,1:10]
pbmc_pc[1:10, 1:5]

##                        PC_1       PC_2        PC_3       PC_4        PC_5
## AAACATACAACCAC-1 -4.7298963 -0.5182652 -0.78090997 -2.3089783 -0.07158848
## AAACATTGAGCTAC-1 -0.5176254  4.5923068  5.96056923  6.8677364 -1.96144350
## AAACATTGATCAGC-1 -3.1892555 -3.4694432 -0.84697819 -1.9956538 -5.10640863
## AAACCGTGCTTCCG-1 12.7931387  0.1008253  0.62926618 -0.3737225  0.21943186
## AAACCGTGTATGCG-1 -3.1290640 -6.3478829  1.26560015  3.0147877  7.84529636
## AAACGCACTGGTAC-1 -3.1090537  0.9263107 -0.66516745 -2.3198198 -2.00491810
## AAACGCTGACCAGT-1 -5.0240458 -2.1163888 -0.19651307 -2.1072840  1.11581208
## AAACGCTGGTTCTT-1 -4.9134545 -2.9156280 -0.04381404 -0.8202407 -0.97434022
## AAACGCTGTAGCCA-1 -5.2666534  0.8731879  0.39419513  0.4214284  4.23482294
## AAACGCTGTTTCTG-1  6.2658545  0.2416898 -1.37704748 -1.4789961  0.44699151

Great. Now for the clustering. The first thing we have to do is make a KNN graph. How do we do that? Well, we have to find the K-nearest neighbors of each cell. I’m quite familiar with this, as it was a big part of my thesis work. Notice that the cells are now rows, closer to CyTOF data.

We choose a K of 20, to be consistent with what Seurat does.

library(RANN)
nn <- RANN::nn2(pbmc_pc, k = 20)

dim(nn$nn.idx)

## [1] 2638   20

nn$nn.idx[1:10, 1:10]

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1 1505  964  619 2080 1265  512 2318 1992  1574
##  [2,]    2 2221 2409   11 1204 1501 2445  258 1811  2561
##  [3,]    3  178  315 1419  575  738 1889 2428 1167   467
##  [4,]    4 2441  966 1479 1049 2079 2464  320 1388  1336
##  [5,]    5 1661 1500 1849  520  158  143 2319  288   977
##  [6,]    6  756 2129  261  343  348  769  875  961   547
##  [7,]    7 2084  627 1828 1149 1692 2080   41 1109  2209
##  [8,]    8  380 1203 2618 2542  308 1058  204 2421  1391
##  [9,]    9  879 2331 2229  287 2488  332  674   31  1658
## [10,]   10  571 2389  381 2457 2590 2546 2572 1102   256

Note that the first column is the cell number, and each subsequent column is the n-1th nearest neighbor. So the second column is the first nearest neighbor. And so on. Technically, each row is 1 cell and its 19 nearest neighbors, together making a neighborhood of 20.

Ok, now we have to turn this into an actual graph. To do that, we have to build an adjacency matrix. This can in turn be converted into a graph, using the igraph library, which I’ve been using for about five years now.

Let me show you how to build an adjacency matrix and what that actually looks like.

n <- nrow(pbmc_pc)
adj_matrix <- matrix(0, n, n)
for (i in 1:n) {
  adj_matrix[i, nn$nn.idx[i, ]] <- 1
}

adj_matrix[1:10, 1:10]

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1    0    0    0    0    0    0    0    0     0
##  [2,]    0    1    0    0    0    0    0    0    0     0
##  [3,]    0    0    1    0    0    0    0    0    0     0
##  [4,]    0    0    0    1    0    0    0    0    0     0
##  [5,]    0    0    0    0    1    0    0    0    0     0
##  [6,]    0    0    0    0    0    1    0    0    0     0
##  [7,]    1    0    0    0    0    0    1    0    0     0
##  [8,]    0    0    0    0    0    0    0    1    0     0
##  [9,]    0    0    0    0    0    0    0    0    1     0
## [10,]    0    0    0    0    0    0    0    0    0     1

As you can see, the cell ID is repeated on the rows and the columns. 0 indicates that they’re not connected. 1 indicates that they are. 1 is always connected to 1, 2 to 2, etc, so we expect the diagnol to be 1. But we also see that 1 is connected to 7. But alas, 7 is not connected to 1. This is an important point. Nearest neighbors are not necessarily mutual.

Now for the graph. Here’s what that looks like. Notice it simply is represented as a list of (some cell)–(some other cell).

library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:lubridate':
## 
##     %--%, union

## The following objects are masked from 'package:purrr':
## 
##     compose, simplify

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following object is masked from 'package:tibble':
## 
##     as_data_frame

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

# Create a graph
g <- graph.adjacency(adj_matrix, mode = "undirected")
g

## IGRAPH 037ca1f U--- 2638 38490 -- 
## + edges from 037ca1f:
##  [1] 1--   1 1--   7 1-- 103 1-- 204 1-- 214 1-- 293 1-- 422 1-- 452 1-- 512
## [10] 1-- 548 1-- 619 1-- 652 1-- 768 1-- 964 1-- 979 1--1006 1--1044 1--1149
## [19] 1--1157 1--1265 1--1332 1--1338 1--1339 1--1386 1--1505 1--1555 1--1574
## [28] 1--1621 1--1696 1--1852 1--1905 1--1992 1--2015 1--2080 1--2137 1--2188
## [37] 1--2297 1--2318 1--2365 1--2548 1--2602 2--   2 2--  11 2--  19 2-- 108
## [46] 2-- 161 2-- 258 2-- 273 2-- 423 2-- 562 2-- 632 2-- 765 2-- 927 2--1050
## [55] 2--1204 2--1277 2--1501 2--1811 2--1988 2--2010 2--2038 2--2221 2--2409
## [64] 2--2413 2--2445 2--2561 2--2587 2--2635 3--   3 3-- 178 3-- 315 3-- 462
## [73] 3-- 467 3-- 474 3-- 575 3-- 738 3-- 776 3-- 844 3-- 917 3--1073 3--1134
## + ... omitted several edges

Ok, now what if we were to visualize this graph. Why not?

plot(g)

Not bad! We can already start to see what looks somewhat like a UMAP. And this was a simple graph embedding, based on who was connected to who.

Ok, now we can cluster!

# Detect communities using the Louvain algorithm
communities <- cluster_louvain(g, resolution = 0.5)
table(communities$membership) %>% sort(decreasing = TRUE)

## 
##   6   1   3   2   5   7   4   8 
## 692 503 486 344 252 173 158  30

OK, now let’s look at them side by side. We’ll do that by pasting the respective clusters together, and see if they line up. We will find out, for example, if cluster 1 from Seurat lines up with cluster 8 from igraph every time.

clust_compare <- paste(pbmc$seurat_clusters, communities$membership)
clust_compare[1:10]

##  [1] "2 1" "3 2" "2 1" "1 3" "6 4" "2 1" "4 5" "4 1" "4 6" "5 7"

And now we tabulate.

table(clust_compare) %>% sort(decreasing = TRUE)

## clust_compare
## 0 6 1 3 2 1 3 2 4 5 5 7 6 4 7 8 0 1 4 6 8 7 2 6 4 1 5 3 1 7 4 4 7 3 4 7 
## 656 478 466 344 252 156 155  30  28  26  13  10   9   6   3   3   2   1

If they didn’t line up, this table would look a lot messier. The interpretation here is that for the most part, they line up, with a few disagreements. This is to say that what we did outside of Seurat lines up quite well with what was done in Seurat.

But for those who need a visual, let’s do our dimension reduction and color by cluster ID.

Dimension reduction

So dimension reduction is pretty straightforward at this point, but I’ll go ahead and run it once in Seurat and once outside of Seurat.

pbmc <- Seurat::RunUMAP(pbmc, dims = 1:10)

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 17:39:03 UMAP embedding parameters a = 0.9922 b = 1.112

## 17:39:03 Read 2638 rows and found 10 numeric columns

## 17:39:03 Using Annoy for neighbor search, n_neighbors = 30

## 17:39:03 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 17:39:03 Writing NN index file to temp file /var/folders/6y/qgvrbn_j2v35vfy946hlhh9c0000gn/T//RtmpT54tFx/file500b615764e1
## 17:39:03 Searching Annoy index using 1 thread, search_k = 3000
## 17:39:04 Annoy recall = 100%
## 17:39:04 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 17:39:04 Initializing from normalized Laplacian + noise (using irlba)
## 17:39:04 Commencing optimization for 500 epochs, with 105140 positive edges
## 17:39:07 Optimization finished

Ok, now let’s look at Seurat’s internal plotting method:

Seurat::DimPlot(pbmc, reduction = "umap")

Cool. Note again how much this lines up with the graph embedding from igraph.

Ok, but we need to use our own plotting function so we can compare our clusters. So we’ll pull the data out accordingly.

umap_seurat <- pbmc@reductions$umap@cell.embeddings %>% as.data.frame()
head(umap_seurat)

##                     UMAP_1    UMAP_2
## AAACATACAACCAC-1  3.072963 -4.274904
## AAACATTGAGCTAC-1  3.020997 10.119626
## AAACATTGATCAGC-1  4.827689 -7.220714
## AAACCGTGCTTCCG-1 -7.453662  5.243771
## AAACCGTGTATGCG-1  5.481780  1.383574
## AAACGCACTGGTAC-1  2.536327 -6.211906

Great. Now let’s plot by our Seurat clusters and then our igraph clusters.

tmp <- dplyr::bind_cols(umap_seurat, igraph_clust = communities$membership, seurat_clust = pbmc$seurat_clusters)
head(tmp)

##                     UMAP_1    UMAP_2 igraph_clust seurat_clust
## AAACATACAACCAC-1  3.072963 -4.274904            1            2
## AAACATTGAGCTAC-1  3.020997 10.119626            2            3
## AAACATTGATCAGC-1  4.827689 -7.220714            1            2
## AAACCGTGCTTCCG-1 -7.453662  5.243771            3            1
## AAACCGTGTATGCG-1  5.481780  1.383574            4            6
## AAACGCACTGGTAC-1  2.536327 -6.211906            1            2

Now let’s plot our Seurat clusters.

ggplot(tmp, aes(x = UMAP_1, y = UMAP_2, color = seurat_clust)) + geom_point()

And now the igrpah clusters.

ggplot(tmp, aes(x = UMAP_1, y = UMAP_2, color = as.factor(igraph_clust))) + geom_point()

Ok, now we can see that they’re quite similar.

Finally, here’s how to run umap on our own. We simply use the umap package on the 10PCs that we used as input for the Louvain clustering.

Note that Seurat’s UMAP implementation uses cosine distance as the default metric. The default metric on the umap::umap implementation is Euclidean. I change the metric accordingly below. If you want a bit more depth on distance metrics, please visit my poster here.

umap_tyler <- umap::umap(pbmc_pc, metric = "cosine")$layout %>% as.data.frame()
names(umap_tyler) <- c("UMAP_1", "UMAP_2")
head(umap_tyler)

##                       UMAP_1    UMAP_2
## AAACATACAACCAC-1   3.7099847 -3.511420
## AAACATTGAGCTAC-1 -10.8369808 -3.956857
## AAACATTGATCAGC-1   1.8413225 -1.519281
## AAACCGTGCTTCCG-1  -2.8130010  8.957300
## AAACCGTGTATGCG-1  -0.6156864 -8.492318
## AAACGCACTGGTAC-1   4.0047025 -1.675029

And now we repeat what we did.

tmp <- dplyr::bind_cols(umap_tyler, igraph_clust = communities$membership, seurat_clust = pbmc$seurat_clusters)
head(tmp)

##                       UMAP_1    UMAP_2 igraph_clust seurat_clust
## AAACATACAACCAC-1   3.7099847 -3.511420            1            2
## AAACATTGAGCTAC-1 -10.8369808 -3.956857            2            3
## AAACATTGATCAGC-1   1.8413225 -1.519281            1            2
## AAACCGTGCTTCCG-1  -2.8130010  8.957300            3            1
## AAACCGTGTATGCG-1  -0.6156864 -8.492318            4            6
## AAACGCACTGGTAC-1   4.0047025 -1.675029            1            2

ggplot(tmp, aes(x = UMAP_1, y = UMAP_2, color = as.factor(seurat_clust))) + geom_point()

ggplot(tmp, aes(x = UMAP_1, y = UMAP_2, color = as.factor(igraph_clust))) + geom_point()

Ok, we’ll wrap it up here. There are plenty of nice visuals that Seurat gives you, but they are relatively straightforward, and not the point of this markdown. This point of this markdown is to give you a better feel for what is under the hood of a Seurat pipeline. When I have guts-level understanding, I generally feel more empowered when I’m doing the analysis at hand. Hopefully this markdown did that for you too.