Annotate PBMC 3k

Let’s run Seurat’s guided clustering tutorial on the PBMC 3k dataset so we can manually annotate it for later.

library(tidyverse)
library(Seurat)
library(SeuratData)
set.seed(42)

We will load the unannotated Seurat object using the SeuratData package.

pbmc <- SeuratData::LoadData("pbmc3k")

## Warning: Assay RNA changing from Assay to Assay

## Warning: Assay RNA changing from Assay to Assay5

Given that this dataset has been run and validated already in the guided clustering tutorial, we will go ahead and run all the commands here with the default values from the tutorial. We note that what will change, relevant to us, is the clustering scheme. Thus, it is possible that their hard-coded annotations might not work for us (spoiler alert: it turns out that they do).

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
pbmc <- ScaleData(pbmc, features = rownames(pbmc))
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2700
## Number of edges: 97892
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8719
## Number of communities: 9
## Elapsed time: 0 seconds

pbmc <- RunUMAP(pbmc, dims = 1:10)

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

Now we run Seurat’s FindAllMarkers function, which finds the top differentially expressed genes between cluster X and cluster not-X, for each cluster.

# find markers for every cluster compared to all remaining cells, report only the positive
# ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE)
pbmc.markers %>%
    group_by(cluster) %>%
    dplyr::filter(avg_log2FC > 1)

## # A tibble: 7,244 × 7
## # Groups:   cluster [9]
##        p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene     
##        <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>    
##  1 3.04e-104       1.11 0.895 0.592 4.16e-100 0       LDHB     
##  2 1.10e- 81       2.35 0.432 0.111 1.51e- 77 0       CCR7     
##  3 4.20e- 79       1.10 0.848 0.407 5.75e- 75 0       CD3D     
##  4 2.53e- 48       2.12 0.336 0.108 3.46e- 44 0       PRKCQ-AS1
##  5 1.40e- 47       1.21 0.625 0.358 1.91e- 43 0       NOSIP    
##  6 3.99e- 41       1.93 0.316 0.109 5.47e- 37 0       LEF1     
##  7 5.75e- 37       1.37 0.42  0.188 7.89e- 33 0       PIK3IP1  
##  8 1.12e- 32       2.42 0.185 0.045 1.53e- 28 0       FHIT     
##  9 3.67e- 32       1.88 0.259 0.087 5.03e- 28 0       MAL      
## 10 2.99e- 30       2.23 0.155 0.033 4.11e- 26 0       NELL2    
## # ℹ 7,234 more rows

We can visualize this in a heatmap, as is done in Seurat’s tutorial.

pbmc.markers %>%
    group_by(cluster) %>%
    dplyr::filter(avg_log2FC > 1) %>%
    slice_head(n = 20) %>%
    ungroup() -> top10
DoHeatmap(pbmc, features = top10$gene) + NoLegend()

The guided clustering tutorial presents us with this table, in text form, which shows the clusters from their run, and cell types chosen based on hallmark markers.

Cluster ID Markers Cell Type 0 IL7R, CCR7 Naive CD4+ T 1 CD14, LYZ CD14+ Mono 2 IL7R, S100A4 Memory CD4+ 3 MS4A1 B 4 CD8A CD8+ T 5 FCGR3A, MS4A7 FCGR3A+ Mono 6 GNLY, NKG7 NK 7 FCER1A, CST3 DC 8 PPBP Platelet

Here, we take that table and convert it into a data frame that can be used in R.

reference <- data.frame(
  old_cluster_id = c(0, 1, 2, 3, 4, 5, 6, 7, 8),
  markers = I(list( c("IL7R","CCR7"),
                    c("CD14","LYZ"),
                    c("IL7R","S100A4"),
                    c("MS4A1"),
                    c("CD8A"),
                    c("FCGR3A","MS4A7"),
                    c("GNLY","NKG7"),
                    c("FCER1A","CST3"),
                    c("PPBP")
                 )),
  cell_type = c("Naive CD4+ T","CD14+ Mono","Memory CD4+ T","B","CD8+ T",
                "FCGR3A+ Mono","NK","DC","Platelet")
)

Now, we loop each cluster through this table. This is not in the tutorial. I added it to make things a bit more convenient for us.

This produces a list of data frames, where for each of the cell subsets in their table above, we can see what hallmark gene is associated with which of our clusters by what p-value. This can in turn allow us to assign populations to our clustering scheme.

new_clusters <- lapply(seq(nrow(reference)), function(i) {
  curr <- reference[i,]
  curr_markers <- curr$markers[[1]]
  new_clust <- dplyr::filter(pbmc.markers, gene %in% curr_markers) %>% arrange(p_val_adj)
  return(new_clust)
})

names(new_clusters) <- reference$cell_type

And now we visualize each of these data frames.

new_clusters

## $`Naive CD4+ T`
##              p_val avg_log2FC pct.1 pct.2    p_val_adj cluster gene
## CCR7  1.101828e-81  2.3474252 0.432 0.111 1.511047e-77       0 CCR7
## IL7R1 5.971448e-58  1.3300602 0.732 0.333 8.189244e-54       2 IL7R
## IL7R  6.439169e-41  0.9685285 0.611 0.329 8.830677e-37       0 IL7R
## IL7R2 3.196768e-06  0.6344525 0.478 0.391 4.384047e-02       4 IL7R
## CCR71 4.542487e-04  0.1845648 0.266 0.179 1.000000e+00       2 CCR7
## 
## $`CD14+ Mono`
##              p_val avg_log2FC pct.1 pct.2     p_val_adj cluster gene
## CD14 4.401366e-294   5.795209 0.660 0.029 6.036033e-290       1 CD14
## LYZ  9.727454e-271   4.671834 0.998 0.517 1.334023e-266       1  LYZ
## LYZ1  1.487422e-13   1.800079 0.972 0.599  2.039851e-09       7  LYZ
## 
## $`Memory CD4+ T`
##                 p_val avg_log2FC pct.1 pct.2     p_val_adj cluster   gene
## S100A4  1.271386e-183  1.7073387 1.000 0.771 1.743579e-179       1 S100A4
## S100A41  1.734493e-60  1.5250490 1.000 0.801  2.378684e-56       5 S100A4
## IL7R1    5.971448e-58  1.3300602 0.732 0.333  8.189244e-54       2   IL7R
## IL7R     6.439169e-41  0.9685285 0.611 0.329  8.830677e-37       0   IL7R
## IL7R2    3.196768e-06  0.6344525 0.478 0.391  4.384047e-02       4   IL7R
## 
## $B
##       p_val avg_log2FC pct.1 pct.2 p_val_adj cluster  gene
## MS4A1     0   5.737242  0.86 0.052         0       3 MS4A1
## 
## $`CD8+ T`
##              p_val avg_log2FC pct.1 pct.2     p_val_adj cluster gene
## CD8A 1.402222e-106   3.376768 0.469 0.069 1.923008e-102       4 CD8A
## 
## $`FCGR3A+ Mono`
##                 p_val avg_log2FC pct.1 pct.2     p_val_adj cluster   gene
## MS4A71  2.798733e-193   4.422412 0.824 0.072 3.838183e-189       5  MS4A7
## FCGR3A  5.511127e-184   4.091754 0.975 0.135 7.557959e-180       5 FCGR3A
## FCGR3A1 2.084178e-113   2.801622 0.858 0.145 2.858241e-109       6 FCGR3A
## MS4A7    1.358770e-25   1.022746 0.259 0.085  1.863418e-21       1  MS4A7
## 
## $NK
##               p_val avg_log2FC pct.1 pct.2     p_val_adj cluster gene
## NKG7  4.071978e-200   2.496894 0.947 0.210 5.584311e-196       4 NKG7
## GNLY1 1.512309e-170   5.272130 0.939 0.135 2.073980e-166       6 GNLY
## NKG71 4.875700e-127   4.111003 0.986 0.263 6.686535e-123       6 NKG7
## GNLY   3.992225e-15   1.261836 0.319 0.159  5.474938e-11       4 GNLY
## 
## $DC
##                p_val avg_log2FC pct.1 pct.2     p_val_adj cluster   gene
## CST3   2.267524e-275   3.210536 0.992 0.266 3.109683e-271       1   CST3
## FCER1A 1.428583e-267   8.042772 0.833 0.009 1.959158e-263       7 FCER1A
## CST31   1.144484e-72   1.936773 1.000 0.360  1.569545e-68       5   CST3
## CST32   6.655884e-24   2.585181 1.000 0.390  9.127880e-20       7   CST3
## 
## $Platelet
##              p_val avg_log2FC pct.1 pct.2     p_val_adj cluster gene
## PPBP 4.238974e-116   11.25641     1 0.024 5.813329e-112       8 PPBP

And from here we assign the clusters.

They all line up, but the Memory CD4 T cells are a bit ambiguous. We keep this in mind. For the reader, it might help to try a couple of different seeds to see if you can get a better split between the Naive and Memory CD4 T’s. For the purposes of my current project, this will do.

How the cluster IDs are assigned is below, and this is taken straight out of the tutorial. And then we visualize the UMAP.

new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono",
    "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

We note from looking at the final annotated UMAP image in the guided clustering tutorial that they are comparable. You can see that if we go back to the split between the Naive and Memory CD4 T cells, the split looks a bit arbitrary. Again, it might be worthwhile to try a couple of different random seeds to see how the split changes. It might also be worthwhile to visualize some CD4 T cell markers and see how they lay out on that “island.”

Now, we will save the annotated Seurat object as a RDS file for later use.

saveRDS(pbmc, "annotated_pbmc3k.rds")

Annotate PBMC 3k

2025-01-21