Home


Rocky: I see three of him!
Duke: Right! Hit the one in the middle.

Rocky 4, boxing match versus Drago.


Flow cytometry and CyTOF users cluster on markers that define type (eg. cell surface markers), and never do so on markers that define state (eg. phosphoproteins). You can’t make this distinguishment in single-cell RNA seq data. Accordingly, you have to integrate the datasets between conditions. There are many ways to do this, but you can think of it as computationally identifying genes that don’t change between the two conditions (comparable to surface markers in flow/CyTOF), and then using those to align the two conditions.

The following markdown shows what a particular single-cell dataset looks like when integrated versus un-integrated. The data are PBMCs either at rest of stimulated with IFNb. I show that un-integrated data, when clustered and viewed on a UMAP, looks like there are two sub-clusters for each cell subset, one corresponding to resting and IFNb stimulated cells.

Failure to integrate the data and understand this issue could lead research teams to falsely conclude that some experimental condition produced novel cell subsets, as opposed to just changed the gene expression patterns of the same subsets. You’ll see in the UMAPs below. Thus, this markdown needs to be understood by bioinformaticians, biologists, and leaders alike.

This markdown is an extension of this Seurat vignette on data integration, where I show that the analysis of the un-integrated data look like.

First, we import the data. Let’s get started with the relevant libraries. Notice we’re not importing data from files. Rather, we’re using the SeuratData package.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.2     âś” readr     2.1.4
## âś” forcats   1.0.0     âś” stringr   1.5.0
## âś” ggplot2   3.4.2     âś” tibble    3.2.1
## âś” lubridate 1.9.2     âś” tidyr     1.3.0
## âś” purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## â„ą Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(Seurat)
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
##      (status 2 uses the sf package in place of rgdal)
## Attaching SeuratObject
library(SeuratData)
## ── Installed datasets ───────────────────────────────────── SeuratData v0.2.2 ──
## âś” ifnb   3.0.0                          âś” pbmc3k 3.1.4
## 
## ────────────────────────────────────── Key ─────────────────────────────────────
## âś” Dataset loaded successfully
## ❯ Dataset built with a newer version of Seurat than installed
## âť“ Unknown version of Seurat installed

There were some issues with getting the “ifnb” dataset of interest from the SeuratData package as specified in the vignette. This was discussed here, where the following solution was found.

install.packages("https://seurat.nygenome.org/src/contrib/ifnb.SeuratData_3.0.0.tar.gz", repos = NULL, type = "source") 

library(ifnb.SeuratData)

And let’s pre-process it. Note that the normalization and the finding of variable genes happened in each stimulation condition separately. When I operate with the un-integrated data later, I have to find the variable genes again for the combined dataset.

# load dataset
data("ifnb")

# split the dataset into a list of two seurat objects (stim and CTRL)
ifnb.list <- SplitObject(ifnb, split.by = "stim")

# normalize and identify variable features for each dataset independently
ifnb.list <- lapply(X = ifnb.list, FUN = function(x) {
    x <- NormalizeData(x)
    x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})

# select features that are repeatedly variable across datasets for integration
features <- SelectIntegrationFeatures(object.list = ifnb.list)

Now we integrate the data. This topic is a rabbit hole, but the built-in integration tool for Seurat is sufficient to prove my point.

immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, anchor.features = features)
## Scaling features for provided objects
## Finding all pairwise anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 16393 anchors
## Filtering anchors
##  Retained 6756 anchors
# this command creates an 'integrated' data assay
immune.combined <- IntegrateData(anchorset = immune.anchors)
## Merging dataset 1 into 2
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data

Now we progress to the rest of the pipeline, from the clustering to UMAP.

# specify that we will perform downstream analysis on the corrected data note that the
# original unmodified data still resides in the 'RNA' assay
DefaultAssay(immune.combined) <- "integrated"

# Run the standard workflow for visualization and clustering
immune.combined <- ScaleData(immune.combined, verbose = FALSE)
immune.combined <- RunPCA(immune.combined, npcs = 30, verbose = FALSE)
immune.combined <- RunUMAP(immune.combined, reduction = "pca", dims = 1:30)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 10:19:52 UMAP embedding parameters a = 0.9922 b = 1.112
## 10:19:52 Read 13999 rows and found 30 numeric columns
## 10:19:52 Using Annoy for neighbor search, n_neighbors = 30
## 10:19:52 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 10:19:53 Writing NN index file to temp file /var/folders/6y/qgvrbn_j2v35vfy946hlhh9c0000gn/T//RtmpI68ehB/filee30a659a0b45
## 10:19:53 Searching Annoy index using 1 thread, search_k = 3000
## 10:19:56 Annoy recall = 100%
## 10:19:56 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 10:19:56 Initializing from normalized Laplacian + noise (using irlba)
## 10:19:57 Commencing optimization for 200 epochs, with 617778 positive edges
## 10:20:02 Optimization finished
immune.combined <- FindNeighbors(immune.combined, reduction = "pca", dims = 1:30)
## Computing nearest neighbor graph
## Computing SNN
immune.combined <- FindClusters(immune.combined, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 13999
## Number of edges: 569377
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9055
## Number of communities: 15
## Elapsed time: 2 seconds

And we visualize.

# Visualization
p1 <- DimPlot(immune.combined, reduction = "umap", group.by = "stim")
p2 <- DimPlot(immune.combined, reduction = "umap", label = TRUE, repel = TRUE)
p1 + p2

Notice that the unstimulated and stimulated conditions are sitting on top of each other. Thus, we can rightly go into each cell subset and do our differential gene expression analysis.

As a side note, if you wanted a little more understanding of the percent of cells corresponding to one condition or the other at each little region on the map, please look at my KNN solution here, and examine this pre-print.

And now let’s repeat the pipeline for un-integrated data.

# specify that we will perform downstream analysis on the corrected data note that the
# original unmodified data still resides in the 'RNA' assay
DefaultAssay(immune.combined) <- "RNA"

# Run the standard workflow for visualization and clustering
immune.combined <- ScaleData(immune.combined, verbose = FALSE)
immune.combined <- FindVariableFeatures(immune.combined, selection.method = "vst", nfeatures = 2000) # Had to add this
immune.combined <- RunPCA(immune.combined, npcs = 30, verbose = FALSE)
immune.combined <- RunUMAP(immune.combined, reduction = "pca", dims = 1:30)
## 10:20:19 UMAP embedding parameters a = 0.9922 b = 1.112
## 10:20:19 Read 13999 rows and found 30 numeric columns
## 10:20:19 Using Annoy for neighbor search, n_neighbors = 30
## 10:20:19 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 10:20:20 Writing NN index file to temp file /var/folders/6y/qgvrbn_j2v35vfy946hlhh9c0000gn/T//RtmpI68ehB/filee30a5a3e3d60
## 10:20:20 Searching Annoy index using 1 thread, search_k = 3000
## 10:20:23 Annoy recall = 100%
## 10:20:23 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 10:20:23 Initializing from normalized Laplacian + noise (using irlba)
## 10:20:23 Commencing optimization for 200 epochs, with 614408 positive edges
## 10:20:29 Optimization finished
immune.combined <- FindNeighbors(immune.combined, reduction = "pca", dims = 1:30)
## Computing nearest neighbor graph
## Computing SNN
immune.combined <- FindClusters(immune.combined, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 13999
## Number of edges: 558036
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9257
## Number of communities: 16
## Elapsed time: 1 seconds

And we visualize.

# Visualization
p1 <- DimPlot(immune.combined, reduction = "umap", group.by = "stim")
p2 <- DimPlot(immune.combined, reduction = "umap", label = TRUE, repel = TRUE)
p1 + p2

Notice that there are two sub-clusters for each major cell subset now. This would potentially lead to faulty biological interpretations downstream, just as clustering and making a UMAP on phospho-proteins potentially would for flow and CyTOF data.

Thus, for all single-cell sequencing datasets that involve comparing stimulation conditions, or pre and post drug treatment on patients, or different batches, please take data integration seriously. If you’re a biologist and/or leader who works with a bioinformatics team, please look critically at the visualizations on this markdown, so you can spot technical artifacts from un-integrated data when they are presented to you down the line. Doing so could save research teams lots of time and effort otherwise chasing false conclusions.