LLM based annotation of the PBMC 3k dataset

Home

You can know the name of a bird in all the languages of the world, but when you’re finished, you’ll know absolutely nothing whatever about the bird… So let’s look at the bird and see what it’s doing — that’s what counts. I learned very early the difference between knowing the name of something and knowing something.

Richard Feynmann, “What Do You Care What Other People Think?”: Further Adventures of a Curious Character

Introduction

A typical single-cell analysis workflow (this includes flow cytometry and imaging) involves the act of clustering the cells and then annotating them. The annotation typically involves figuring what markers are specific to a given cluster and not other clusters, and then using prior biological knowledge to annotate the cells.

The advent of increasingly sophisticated LLMs has opened up the possibility of using them to help in the annotation process. The idea would be that you take the genes and proteins that mark a particular cluster, and feed it into a LLM and ask it to make a guess as to what cell population the cluster is. You could pre-prompt it as much as you want with the biological context (eg. this is a PBMC dataset for single-cell sequencing, with these conditions, and so forth). The biologist would then have a LLM “best guess” as to what cell subset the given cluster is.

The use of APIs to these models, or the prospect of running them locally as smaller models become more capable, brings forth the possibility of integrating the LLM calls directly into data analysis pipelines. Here is where we are going to focus.

In this markdown, we will annotate the PBMC 3k dataset with the help of a LLM that we run directly in R, such that the LLM’s output can be stored as an output, and therefore the annotation can take place in a completely automated fashion. After annotating the data, we will run the annotation tool another few times to check the stability of the annotations.

We find that the LLM annotates the PBMCs in comparison to ground truth, but we note that PBMC data is relatively simple, and it would be ideal if this type of workflow was battle tested across more difficult datasets. The main purpose of this markdown, at the time of writing, is to enable users worldwide to both utilize what is here, and battle test it on their more complex “real world” data.

The standard Seurat workflow

Let’s get started by uploading the PBMC 3k dataset, and running it through Seurat’s guided clustering tutorial, which can be found here. If you want to see how you can do the analysis without Seurat, or understand what Seurat is doing under the hood, please go to my markdown here.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(Seurat)

## Loading required package: SeuratObject
## Loading required package: sp
## 
## Attaching package: 'SeuratObject'
## 
## The following objects are masked from 'package:base':
## 
##     intersect, t

library(SeuratData)

## ── Installed datasets ──────────────────────────────── SeuratData v0.2.2.9001 ──
## ✔ cbmc     3.1.4                        ✔ pbmc3k   3.1.4
## ✔ ifnb     3.0.0                        ✔ stxBrain 0.1.1
## ✔ panc8    3.0.2                        
## 
## ────────────────────────────────────── Key ─────────────────────────────────────
## ✔ Dataset loaded successfully
## ❯ Dataset built with a newer version of Seurat than installed
## ❓ Unknown version of Seurat installed

set.seed(42)

pbmc <- SeuratData::LoadData("pbmc3k")

## Validating object structure

## Updating object slots

## Ensuring keys are in the proper structure

## Warning: Assay RNA changing from Assay to Assay

## Ensuring keys are in the proper structure

## Ensuring feature names don't have underscores or pipes

## Updating slots in RNA

## Validating object structure for Assay 'RNA'

## Object representation is consistent with the most current Seurat version

## Warning: Assay RNA changing from Assay to Assay5

And now we will run it through the standard pipeline.

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

## Normalizing layer: counts

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

## Finding variable features for layer counts

pbmc <- ScaleData(pbmc, features = rownames(pbmc))

## Centering and scaling data matrix

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

## PC_ 1 
## Positive:  MALAT1, LTB, IL32, CD2, ACAP1, STK17A, CTSW, CD247, CCL5, GIMAP5 
##     AQP3, GZMA, CST7, TRAF3IP3, MAL, HOPX, ITM2A, GZMK, MYC, BEX2 
##     GIMAP7, ETS1, LDLRAP1, ZAP70, LYAR, RIC3, TNFAIP8, KLRG1, SAMD3, NKG7 
## Negative:  CST3, TYROBP, LST1, AIF1, FTL, FCN1, LYZ, FTH1, S100A9, FCER1G 
##     TYMP, CFD, LGALS1, CTSS, S100A8, SERPINA1, LGALS2, SPI1, IFITM3, PSAP 
##     CFP, SAT1, IFI30, COTL1, S100A11, NPC2, LGALS3, GSTP1, PYCARD, NCF2 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DRA, HLA-DQB1, LINC00926, CD79B, HLA-DRB1, CD74 
##     HLA-DPB1, HLA-DMA, HLA-DQA2, HLA-DRB5, HLA-DPA1, HLA-DMB, FCRLA, HVCN1, LTB, BLNK 
##     KIAA0125, P2RX5, IRF8, IGLL5, SWAP70, ARHGAP24, SMIM14, PPP1R14A, FCRL2, C16orf74 
## Negative:  NKG7, PRF1, CST7, GZMA, GZMB, FGFBP2, CTSW, GNLY, GZMH, SPON2 
##     CCL4, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX, CTSC 
##     TTC38, S100A4, ANXA1, IL32, IGFBP7, ID2, ACTB, XCL1, APOBEC3G, SAMD3 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, CD74, HLA-DPA1, MS4A1, HLA-DRB1, HLA-DRB5 
##     HLA-DRA, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, HVCN1, FCRLA, IRF8, BLNK 
##     KIAA0125, SMIM14, PLD4, IGLL5, P2RX5, TMSB10, SWAP70, LAT2, MALAT1, IGJ 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU 
##     HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, CA2, PTCRA, ACRBP, MMD, TREML1 
##     NGFRAP1, F13A1, RUFY1, SEPT5, MPP1, CMTM5, TSC22D1, MYL9, RP11-367G6.3, GP1BA 
## PC_ 4 
## Positive:  HLA-DQA1, CD79A, CD79B, HIST1H2AC, HLA-DQB1, PF4, MS4A1, SDPR, CD74, PPBP 
##     HLA-DPB1, GNG11, HLA-DQA2, SPARC, HLA-DRB1, HLA-DPA1, GP9, TCL1A, HLA-DRA, LINC00926 
##     NRGN, RGS18, HLA-DRB5, PTCRA, CD9, AP001189.4, CA2, CLU, TUBB1, ITGA2B 
## Negative:  VIM, S100A8, S100A6, S100A4, S100A9, TMSB10, IL32, GIMAP7, LGALS2, S100A10 
##     RBP7, FCN1, MAL, LYZ, S100A12, MS4A6A, CD2, FYB, S100A11, FOLR3 
##     GIMAP4, AQP3, ANXA1, AIF1, MALAT1, GIMAP5, IL8, IFI6, TRABD2A, TMSB4X 
## PC_ 5 
## Positive:  GZMB, FGFBP2, NKG7, GNLY, PRF1, CCL4, CST7, SPON2, GZMA, CLIC3 
##     GZMH, XCL2, CTSW, TTC38, AKR1C3, CCL5, IGFBP7, XCL1, CCL3, S100A8 
##     TYROBP, HOPX, CD160, HAVCR2, S100A9, FCER1G, PTGDR, LGALS2, RBP7, S100A12 
## Negative:  LTB, VIM, AQP3, PPA1, MAL, KIAA0101, CD2, CYTIP, CORO1B, FYB 
##     IL32, TRADD, ANXA5, TUBA1B, HN1, TYMS, PTGES3, ITM2A, COTL1, GPR183 
##     TNFAIP8, ACTG1, TRAF3IP3, ATP5C1, GIMAP4, ZWINT, PRDX1, LDLRAP1, ABRACL, NGFRAP1

pbmc <- FindNeighbors(pbmc, dims = 1:10)

## Computing nearest neighbor graph

## Computing SNN

pbmc <- FindClusters(pbmc, resolution = 0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2700
## Number of edges: 97892
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8719
## Number of communities: 9
## Elapsed time: 0 seconds

We will then run UMAP to set up some visualizations:

pbmc <- RunUMAP(pbmc, dims = 1:10)

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 09:12:30 UMAP embedding parameters a = 0.9922 b = 1.112

## 09:12:30 Read 2700 rows and found 10 numeric columns

## 09:12:30 Using Annoy for neighbor search, n_neighbors = 30

## 09:12:30 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 09:12:30 Writing NN index file to temp file /var/folders/6y/qgvrbn_j2v35vfy946hlhh9c0000gn/T//RtmpR5ktcC/file171e042431dbe
## 09:12:30 Searching Annoy index using 1 thread, search_k = 3000
## 09:12:31 Annoy recall = 100%
## 09:12:31 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 09:12:31 Initializing from normalized Laplacian + noise (using RSpectra)
## 09:12:31 Commencing optimization for 500 epochs, with 107940 positive edges
## 09:12:34 Optimization finished

What we could do is feed the 2000 dimensional feature vectors into a LLM for annotation as an experiment, but for the sake of minimizing token use (and thus minimizing cost) we are going to utilize Seurat’s marker finder, which finds which markers in a given cluster are significantly enrich in comparison to everything outside that cluster.

# find markers for every cluster compared to all remaining cells, report only the positive
# ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE)

## Calculating cluster 0

## For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the presto package
## --------------------------------------------
## install.packages('devtools')
## devtools::install_github('immunogenomics/presto')
## --------------------------------------------
## After installation of presto, Seurat will automatically use the more 
## efficient implementation (no further action necessary).
## This message will be shown once per session

## Calculating cluster 1

## Calculating cluster 2

## Calculating cluster 3

## Calculating cluster 4

## Calculating cluster 5

## Calculating cluster 6

## Calculating cluster 7

## Calculating cluster 8

pbmc.markers %>%
    group_by(cluster) %>%
    dplyr::filter(avg_log2FC > 1)

## # A tibble: 7,244 × 7
## # Groups:   cluster [9]
##        p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene     
##        <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>    
##  1 3.04e-104       1.11 0.895 0.592 4.16e-100 0       LDHB     
##  2 1.10e- 81       2.35 0.432 0.111 1.51e- 77 0       CCR7     
##  3 4.20e- 79       1.10 0.848 0.407 5.75e- 75 0       CD3D     
##  4 2.53e- 48       2.12 0.336 0.108 3.46e- 44 0       PRKCQ-AS1
##  5 1.40e- 47       1.21 0.625 0.358 1.91e- 43 0       NOSIP    
##  6 3.99e- 41       1.93 0.316 0.109 5.47e- 37 0       LEF1     
##  7 5.75e- 37       1.37 0.42  0.188 7.89e- 33 0       PIK3IP1  
##  8 1.12e- 32       2.42 0.185 0.045 1.53e- 28 0       FHIT     
##  9 3.67e- 32       1.88 0.259 0.087 5.03e- 28 0       MAL      
## 10 2.99e- 30       2.23 0.155 0.033 4.11e- 26 0       NELL2    
## # ℹ 7,234 more rows

And from here, to set up my sanity checks, we are going to view the Seurat built in heatmap.

pbmc.markers %>%
    group_by(cluster) %>%
    dplyr::filter(avg_log2FC > 1) %>%
    slice_head(n = 20) %>%
    ungroup() -> top10

DoHeatmap(pbmc, features = top10$gene) + NoLegend()

LLM assisted annotation

And now we are ready for the LLM assisted annotation.

What are are going to do here is related to my command line chatbot project that I wrote about here here. We could in theory take this command line chatbot and use R’s system() command to run it. I did indeed do this in an earlier draft of this markdown. However, a much more user-friendly way of doing it is to just have the command for calling the LLM API directly in the workflow itself. I’ll show you how to do this later.

First let’s pull out our top 10 genes for each cluster, which was set up earlier from the pbmc.markers object.

top10

## # A tibble: 180 × 7
##        p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene     
##        <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>    
##  1 3.04e-104       1.11 0.895 0.592 4.16e-100 0       LDHB     
##  2 1.10e- 81       2.35 0.432 0.111 1.51e- 77 0       CCR7     
##  3 4.20e- 79       1.10 0.848 0.407 5.75e- 75 0       CD3D     
##  4 2.53e- 48       2.12 0.336 0.108 3.46e- 44 0       PRKCQ-AS1
##  5 1.40e- 47       1.21 0.625 0.358 1.91e- 43 0       NOSIP    
##  6 3.99e- 41       1.93 0.316 0.109 5.47e- 37 0       LEF1     
##  7 5.75e- 37       1.37 0.42  0.188 7.89e- 33 0       PIK3IP1  
##  8 1.12e- 32       2.42 0.185 0.045 1.53e- 28 0       FHIT     
##  9 3.67e- 32       1.88 0.259 0.087 5.03e- 28 0       MAL      
## 10 2.99e- 30       2.23 0.155 0.033 4.11e- 26 0       NELL2    
## # ℹ 170 more rows

As you can see, this gives us which to 10 genes were statistically significant in terms of marking a cluster in comparison to not that cluster. What we did below for the proof of concept is we took the gene and the average log2 fold change and used that for the annotation. But you can use whichever columns you want.

Ok, now we can feed this into our GPT model, which we are going to show below. We set up a wrapper so we ran run it multiple times, but you don’t have to do it like that. Note the system call from above, which takes in a prompt as input. Note also that part of this involves subsetting the “top 10” tibble above by cluster, and selecting the columns of interest, and then converting the whole thing into a string that can be fed into the LLM as a prompt.

The prompt below works for the purpose of the proof of concept. But additional prompt engineering might help, especially for datasets that go beyond healthy PBMCs.

Below is the function that allows you to access the LLMs from OpenRouter. It takes as input the model choice and the prompt and returns the answer as output. Note that you need to sign up for an OpenRouter API key first. You can do that by going here.

user_api_key <- "your_api_key_here"

# Load necessary libraries
library(httr)
library(jsonlite)

## 
## Attaching package: 'jsonlite'

## The following object is masked from 'package:purrr':
## 
##     flatten

# Define the function ask_llm that takes a prompt and a model choice as input.
Chatbot <- function(model_choice = "gpt3", user_question, api_key = user_api_key) {
  # Map the model choice to the corresponding model identifier
  if (model_choice == "gpt3") {
    model <- "openai/gpt-3.5-turbo"
  } else if (model_choice == "gpt4") {
    model <- "openai/gpt-4o-2024-11-20"
  } else if (model_choice == "deepseek") {
    model <- "deepseek/deepseek-r1"
  } else if (model_choice == "claude") {
    model <- "anthropic/claude-3.5-sonnet-20240620"
  } else {
    stop("Invalid model choice. Please choose 'gpt3', 'gpt4', 'deepseek', or 'claude'.")
  }
  
  # Define the API endpoint and your API key (replace with your actual API key)
  url <- "https://openrouter.ai/api/v1/chat/completions"
  
  # Create the JSON payload for the POST request
  payload <- list(
    model = model,
    messages = list(
      list(
        role = "user",
        content = user_question
      )
    )
  )
  
  # Convert the payload to JSON
  payload_json <- toJSON(payload, auto_unbox = TRUE)
  
  # Send the POST request to the API
  response <- POST(
    url,
    add_headers(
      Authorization = paste("Bearer", api_key),
      "Content-Type" = "application/json"
    ),
    body = payload_json
  )
  
  # Check for errors in the API response
  if (status_code(response) != 200) {
    stop("Error: Received status code ", status_code(response), "\n", content(response, "text"))
  }
  
  # Parse the JSON response
  result <- content(response, "parsed")
  
  # Return the chatbot's reply
  return(result$choices[[1]]$message$content)
}

# Example usage of the function:
answer <- Chatbot("claude", "What is better, Seurat or Scanpy?")
cat("Answer:", answer)

## Answer: It's difficult to say definitively which is "better" between Seurat and Scanpy, as they are both powerful and widely-used tools for single-cell RNA sequencing (scRNA-seq) analysis. The choice between them often depends on your specific needs, familiarity with programming languages, and the particulars of your project. Here are some points to consider:
## 
## Seurat:
## 1. Written in R
## 2. Extensive documentation and large user community
## 3. Great for visualization and data integration
## 4. User-friendly for those familiar with R
## 5. Regularly updated with new features
## 
## Scanpy:
## 1. Written in Python
## 2. Fast performance, especially for large datasets
## 3. Integrates well with machine learning libraries (e.g., sklearn, tensorflow)
## 4. Good for reproducibility and scalability
## 5. Part of the larger AnnData ecosystem
## 
## Considerations:
## 1. Programming language preference: If you're more comfortable with R, Seurat might be easier. If you prefer Python, go with Scanpy.
## 
## 2. Dataset size: Scanpy tends to handle very large datasets more efficiently.
## 
## 3. Integration with other tools: Consider which other tools or libraries you might need to use in your analysis.
## 
## 4. Specific analysis needs: Both have strengths in different areas of analysis.
## 
## 5. Community and support: Both have active communities, but Seurat might have a slight edge in terms of resources and tutorials available.
## 
## In practice, many researchers use both tools, depending on the specific task at hand. It's also worth noting that there are efforts to make these tools more interoperable, allowing users to switch between them more easily.
## 
## Ultimately, the "better" tool is the one that best fits your specific needs, workflow, and expertise. It might be worth trying both to see which you prefer.

You’ll note that the LLM used below is Claude 3.5 Sonnet, which give you very good bang for your buck at the time of writing. However, later models may be better, either in terms of overall ability to give accurrate answers, or because they are more tailored to the life sciences. The chatbot allows for flexibility in this regard.

AnnotationWrapper <- function() {
  llm_annotations <- lapply(unique(top10$cluster), function(i) {
  curr_terms <- dplyr::filter(top10, cluster == i) %>% 
    dplyr::select(avg_log2FC, gene) %>%
    as.data.frame()
  
  # Make the output into a string that can be fed into a LLM
  curr_terms <- format(curr_terms, row.names = FALSE)
  curr_terms <- paste(curr_terms, collapse = "\n")
  
  # Build the prompt. Feel free to adjust the wording as needed.
  prompt <- paste(
    "Given the following top 10 differentially expressed genes for a given cluster and their average log2 fold change in comparison to their expression in not that cluster, please provide your best guess as to what the cell type is. Don't provide details. Just respond with the name. Nothing else. If you can be more specific (eg. CD4 T cells as opposed to T cells) then please do. The data are as follows:",
    curr_terms
  )
  
  result <- Chatbot("claude", shQuote(prompt))
  return(result)
  }) %>% unlist()

  return(llm_annotations)
}

llm_annotations <- AnnotationWrapper()

We will leverage the fact that the clustering I do here lines up to the clustreing and annotations in the guided clustering tutorial, perhaps because the PBMC 3k dataset is sufficiently simple. Anyway below are the annotations from the tutorial:

reference <- data.frame(
  old_cluster_id = c(0, 1, 2, 3, 4, 5, 6, 7, 8),
  markers = I(list( c("IL7R","CCR7"),
                    c("CD14","LYZ"),
                    c("IL7R","S100A4"),
                    c("MS4A1"),
                    c("CD8A"),
                    c("FCGR3A","MS4A7"),
                    c("GNLY","NKG7"),
                    c("FCER1A","CST3"),
                    c("PPBP")
                 )),
  cell_type = c("Naive CD4+ T","CD14+ Mono","Memory CD4+ T","B","CD8+ T",
                "FCGR3A+ Mono","NK","DC","Platelet")
)

reference$llm_cell_type <- llm_annotations
reference

##   old_cluster_id      markers     cell_type   llm_cell_type
## 1              0   IL7R, CCR7  Naive CD4+ T     CD4 T cells
## 2              1    CD14, LYZ    CD14+ Mono       Monocytes
## 3              2 IL7R, S100A4 Memory CD4+ T    CD4+ T cells
## 4              3        MS4A1             B         B cells
## 5              4         CD8A        CD8+ T    CD8+ T cells
## 6              5 FCGR3A, ....  FCGR3A+ Mono       Monocytes
## 7              6   GNLY, NKG7            NK        NK cells
## 8              7 FCER1A, CST3            DC Dendritic cells
## 9              8         PPBP      Platelet       Platelets

So it actually works! I’ve confirmed in a previous markdown that the clustering scheme remains sufficiently stable that the cell types in the guided clustering tutorial remain essentially the same despite not having the exact random starting point from the tutorial.

But LLMs can return slightly different answers each time. So it’s important to look at the stability of the annotations. Let’s run it a bunch of times and see if we can find variation.

annotation_compare <- lapply(1:3, function(i) {
  result <- AnnotationWrapper()
  return(result)
}) %>% do.call(cbind, .) %>% as_tibble()

## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

And now we can visualize and compare.

annotation_compare

## # A tibble: 9 × 3
##   V1                           V2              V3             
##   <chr>                        <chr>           <chr>          
## 1 CD4+ T cells                 CD4+ T cells    CD4 T cells    
## 2 Monocytes                    Monocytes       Monocytes      
## 3 CD4+ T cells                 CD4+ T cells    CD4 T cells    
## 4 B cells                      B cells         B cells        
## 5 CD8+ T cells                 CD8+ T cells    CD8+ T cells   
## 6 Monocytes                    Monocytes       Monocytes      
## 7 NK cells                     NK cells        NK cells       
## 8 Plasmacytoid dendritic cells Dendritic cells Dendritic cells
## 9 Platelets                    Platelets       Platelets

In some runs of this markdown, the LLM produced the same results for each of the three annotation runs above. In some runs of the markdown, there were subtle differences, but no blatent inaccurracies. I would suggest that the LLM based annotation should be run multiple times, with the human being the final arbiter as to what the subset is.

Now let’s visualize our annotations on a UMAP. First, we will convert our annotations to element metadata in the Seurat object.

# Make sure to convert the cluster numbers to character if necessary:
pbmc@meta.data$llm_annotation <- reference$llm_cell_type[pbmc@meta.data$seurat_clusters]

And now we do our plots. First by cluster number.

DimPlot(pbmc, group.by = "seurat_clusters")

Let’s look at Seurat’s built in annotations now:

DimPlot(pbmc, group.by = "seurat_annotations")

And we compare that to the LLM annotations:

DimPlot(pbmc, group.by = "llm_annotation")

We note that from a communication standpoint, it would be nice if the colors of the LLM annotation plot lined up with those of the Seurat annotation. So we will do that now.

First, we pull out the hex codes for the colors that are associated with the Seurat annotations.

p <- DimPlot(pbmc, group.by = "seurat_annotations")
p_build <- ggplot_build(p)

 # This prints out details including scale information:
print(p_build$plot$scales$scales[[1]]$palette.cache)

## [1] "#F8766D" "#D39200" "#93AA00" "#00BA38" "#00C19F" "#00B9E3" "#619CFF"
## [8] "#DB72FB" "#FF61C3"

Then we create a custom color palette. This was done by eyeballing the Seurat annotation plot, and matching the hex codes of the color palette, which R conveniently colors for you.

# Define a custom color palette for the unique cell types
custom_colors <- c(
  "CD4 T cells"    = "#F8766D",  
  "Monocytes"      = "#93AA00",  
  "CD4+ T cells"   = "#F8766D",  
  "B cells"        = "#00BA38",  
  "CD8+ T cells"   = "#00C19F",  
  "NK cells"       = "#619CFF",  
  "Dendritic cells"= "#DB72FB",  
  "Platelets"      = "#FF61C3"
)

And now we make the plot of the LLM annotations that are colored by the custom color palette that we made.

DimPlot(pbmc, group.by = "llm_annotation", cols = custom_colors)

Discussion

The proof of concept experiment here suggests that at least for the PBMC 3k dataset, Claude 3.5 Sonnet is sufficient to annotate clusters, so far as it has the uniquely expressed genes that mark a given cluster and their average log2 fold change. The clusters are stable through multiple runs.

The PBMC 3k dataset is relatively simple, healthy PBMCs. It might be that more complex datasets, where there is less “training data” for the LLMs (eg. cancer) could prove challenging for this type of annotation. This has yet to be seen. Nonetheless, having an LLM give you a “first guess” as to what a subset might be is a useful start.

If imperfections are found when this workflow is battle tested in more complex datasets, we note that the user can adjust the prompt to provide additional prior info (eg. what type of dataset is it, corresponding to what patients, and what cell types do you know are in there). In other words, there might be a prompt engineering exercise that must be done with more complex “real world” data.

Furthermore, if you do set up the OpenRouter API and use the code above, you will be able to test multiple models. At the time of writing, Deepseek R1, a reasoning model, is also available to try. Later models available could very well be much better. This workflow is robust to new developments, such that the user can try new chatbots that are available through the API as they come. What does not work now could very well work in a few months or years.

For the reader, I would like you to try this on your data. The above example is PBMCs, and as far as single-cell RNA sequencing data goes, this is a relatively simple dataset. If we move into datasets with developmental trajectories, or cancer data, and so forth, it is likely that the LLMs will trip up. In which case we might have to experiment with some additional prompt engineering or more fine-tuned models. Either way, I would like to know specifically where the LLM makes mistakes, so we have an idea of where to put in work.

Zooming out, the use of LLMs in workflows directly in R is, at the time of writing, relatively unexplored. But I see a future where packages will show up that have LLM-leveraged biological interpretation as an option directly within R and python packages. So the workflow I present here could very well be a harbinger of things to come (this sentence was written on February 7, 2025).

LLM based annotation of the PBMC 3k dataset

Tyler J Burns, PhD

2025-02-06 - 2025-02-12

Introduction

The standard Seurat workflow

LLM assisted annotation

Discussion