Feeding PCA loadings into LLM for biological interpretations

Home

Images and sounds, symbols and equations, flooded through the orphan’s classifying networks, leaving behind, not the fine details – not the spacesuited figure standing on grey-and-white rock against a pitch black sky; not the calm, naked figure disintegrating beneath a grey swarm of nanomachines – but an imprint of the simplest regularities, the most common associations. The networks discovered the circle/sphere: in images of the sun and planets, in iris and pupil, in fallen fruit, in a thousand different artworks, artifacts, and mathematical diagrams. They discovered the linear word for “person”, and bound it tentatively both to the regularities which defined the gestalt icon for “citizen”, and to the features they found in common among the many images of fleshers and gleisner robots.

Greg Egan, Diaspora

Introduction

Principal component analysis (PCA) is a linear dimensionality reduction technique, crucial for bioinformatics and single-cell RNA sequencing data (scRNA-seq) because of its ability to condense a huge gene space into a small set of meaningful axes (principal components), which capture the maximum amount of variance in the data.

While you can do PCA in a number of ways, the principal components are the eigenvectors of the covariance matrix. The data points (in our case, cells) from the data matrix are then projected onto the new PC space, and this reduced space is what is typically used for clustering, nonlinear dimensionality reduction (e.g. UMAP), and so forth.

PCA loadings are the weights (or coefficients) that define each PC as a linear combination of all of the original variables, which in the case of scRNA-seq data, would be the genes. Each gene has a loading, or coefficient, attached to it, and when sorted by magnitude, we can determine which genes contribute most to the variance captured by a given PC.

Plotting the PCs against each other (PC1 vs. PC2 or PC2 vs. PC3, for example) is a powerful way to visualize the main axes of variance across the samples, and may reveal different cell types, states, subpopulations, or batch effects.

In this report, we intend to cover two things that we think are missing in the first. The first is that the advent of LLMs now make it easier to biologically interpret the genes associated with the PC loadings in a single-cell/spatial dataset. We will show how to do this. The second is that while PC1 and PC2 are typically visualized, additional PCs carry meaningful information too, especially when paired with the bioloical interpretation of the loadings.

We hypothesize that there is crucial information to be found hidden within the loadings that may help determine the variance captured by a PC. Aided by the compute power and time efficiency of LLMs, it can take almost no time at all to analyze trends among the genes that have the highest magnitude coefficients for a given PC to check for shared functions or co-regulation patterns. These insights will hopefully help better understand the variance captured by any given PC.

Data uploading and processing

First, let’s get the libraries and random seed up.

library(jsonlite)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(Seurat)

## Loading required package: SeuratObject
## Loading required package: sp
## 'SeuratObject' was built under R 4.5.0 but the current version is
## 4.5.1; it is recomended that you reinstall 'SeuratObject' as the ABI
## for R may have changed
## 
## Attaching package: 'SeuratObject'
## 
## The following objects are masked from 'package:base':
## 
##     intersect, t

library(SeuratData)

## ── Installed datasets ──────────────────────────────── SeuratData v0.2.2.9002 ──
## ✔ pbmc3k 3.1.4                          
## 
## ────────────────────────────────────── Key ─────────────────────────────────────
## ✔ Dataset loaded successfully
## ❯ Dataset built with a newer version of Seurat than installed
## ❓ Unknown version of Seurat installed

set.seed(42)

Installing and initializing data

SeuratData::InstallData("pbmc3k")

## Warning: The following packages are already installed and will not be
## reinstalled: pbmc3k

pbmc <- LoadData("pbmc3k")

## Validating object structure

## Updating object slots

## Ensuring keys are in the proper structure

## Warning: Assay RNA changing from Assay to Assay

## Ensuring keys are in the proper structure

## Ensuring feature names don't have underscores or pipes

## Updating slots in RNA

## Validating object structure for Assay 'RNA'

## Object representation is consistent with the most current Seurat version

## Warning: Assay RNA changing from Assay to Assay5

Normalizing, scaling, and removing variable features

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

## Normalizing layer: counts

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

## Finding variable features for layer counts

pbmc <- ScaleData(pbmc, features = rownames(pbmc))

## Centering and scaling data matrix

Running PCA

# Set PCs variable to desired number of PCs
PCs <- 20
pbmc <- RunPCA(pbmc, npcs = PCs, features = VariableFeatures(object = pbmc))

## PC_ 1 
## Positive:  MALAT1, LTB, IL32, CD2, ACAP1, STK17A, CTSW, CD247, CCL5, GIMAP5 
##     AQP3, GZMA, CST7, TRAF3IP3, MAL, HOPX, ITM2A, GZMK, MYC, BEX2 
##     GIMAP7, ETS1, LDLRAP1, ZAP70, LYAR, RIC3, TNFAIP8, KLRG1, SAMD3, NKG7 
## Negative:  CST3, TYROBP, LST1, AIF1, FTL, FCN1, LYZ, FTH1, S100A9, FCER1G 
##     TYMP, CFD, LGALS1, CTSS, S100A8, SERPINA1, LGALS2, SPI1, IFITM3, PSAP 
##     CFP, SAT1, IFI30, COTL1, S100A11, NPC2, LGALS3, GSTP1, PYCARD, NCF2 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DRA, HLA-DQB1, LINC00926, CD79B, HLA-DRB1, CD74 
##     HLA-DPB1, HLA-DMA, HLA-DQA2, HLA-DRB5, HLA-DPA1, HLA-DMB, FCRLA, HVCN1, LTB, BLNK 
##     KIAA0125, P2RX5, IRF8, IGLL5, SWAP70, ARHGAP24, SMIM14, PPP1R14A, FCRL2, C16orf74 
## Negative:  NKG7, PRF1, CST7, GZMA, GZMB, FGFBP2, CTSW, GNLY, GZMH, SPON2 
##     CCL4, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX, CTSC 
##     TTC38, S100A4, ANXA1, IL32, IGFBP7, ID2, ACTB, XCL1, APOBEC3G, SAMD3 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, CD74, HLA-DPA1, MS4A1, HLA-DRB1, HLA-DRB5 
##     HLA-DRA, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, HVCN1, FCRLA, IRF8, BLNK 
##     KIAA0125, SMIM14, PLD4, IGLL5, P2RX5, TMSB10, SWAP70, LAT2, MALAT1, IGJ 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU 
##     HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, CA2, PTCRA, ACRBP, MMD, TREML1 
##     NGFRAP1, F13A1, RUFY1, SEPT5, MPP1, CMTM5, TSC22D1, MYL9, RP11-367G6.3, GP1BA 
## PC_ 4 
## Positive:  VIM, S100A8, S100A6, S100A4, S100A9, TMSB10, IL32, GIMAP7, LGALS2, S100A10 
##     RBP7, FCN1, MAL, LYZ, S100A12, MS4A6A, CD2, FYB, S100A11, FOLR3 
##     GIMAP4, AQP3, ANXA1, AIF1, MALAT1, GIMAP5, IL8, IFI6, TRABD2A, TMSB4X 
## Negative:  HLA-DQA1, CD79A, CD79B, HIST1H2AC, HLA-DQB1, PF4, MS4A1, SDPR, CD74, PPBP 
##     HLA-DPB1, GNG11, HLA-DQA2, SPARC, HLA-DRB1, HLA-DPA1, GP9, TCL1A, HLA-DRA, LINC00926 
##     NRGN, RGS18, HLA-DRB5, PTCRA, CD9, AP001189.4, CA2, CLU, TUBB1, ITGA2B 
## PC_ 5 
## Positive:  LTB, VIM, AQP3, PPA1, MAL, KIAA0101, CD2, CYTIP, CORO1B, FYB 
##     IL32, TRADD, ANXA5, TUBA1B, HN1, TYMS, PTGES3, ITM2A, COTL1, GPR183 
##     TNFAIP8, ACTG1, TRAF3IP3, ATP5C1, GIMAP4, ZWINT, PRDX1, LDLRAP1, ABRACL, NGFRAP1 
## Negative:  GZMB, FGFBP2, NKG7, GNLY, PRF1, CCL4, CST7, SPON2, GZMA, CLIC3 
##     GZMH, XCL2, CTSW, TTC38, AKR1C3, CCL5, IGFBP7, XCL1, CCL3, S100A8 
##     TYROBP, HOPX, CD160, HAVCR2, S100A9, FCER1G, PTGDR, LGALS2, RBP7, S100A12

LLM query function call

We are writing a function which takes in an API key and a search query and returns the LLM response to the given query.

# Retrieving saved API key from .Renviron 
user_api_key <- Sys.getenv("OPENAI_API_KEY")

call_chat <- function(query, api_key) {
  body <- toJSON(list(
    model = "o4-mini-2025-04-16", # change to desired model name
    messages = list(
      list(role = "user", content = query)
    )
  ), auto_unbox = TRUE)
  
  # Write request body to a temporary JSON file
  json_file <- tempfile(fileext = ".json")
  writeLines(body, json_file)
  auth_header <- paste0("-H \"Authorization: Bearer ", api_key, "\"")
  query_input <- paste0("-d @", shQuote(json_file))

  # Construct the cURL command
  curl_cmd <- paste(
    "curl -s",
    "https://api.openai.com/v1/chat/completions",
    "-H \"Content-Type: application/json\"",
    auth_header,
    query_input
  )
  
  response_raw <- system(curl_cmd, intern = TRUE)
  response_str <- paste(response_raw, collapse = "\n")
  response_parsed <- fromJSON(response_str)
  return(response_parsed[5])
}

Gathering LLM responses

We are now going to extract the top 30 loadings from each PC, plug them into a search query, and feed it to an LLM, asking for a biological interpretation of the PC. We save all of the responses to an output file.

# general prompt structure for each query call
prompt <- "I ran a PCA on single cell RNA expression data. %s. Given the top %.0f genes with the highest magnitude loadings from PC%.0f, please give me a biological interpretation of this PC. Here are the top genes with their corresponding weights: %s"

# if there is any important study context the LLM would benefit from knowing, insert here
additional_info = "It's scRNA-seq data from PBMCs"

# saving an output file for the LLM responses to be recorded to 
output_file <- "PCA_LLM_response_30_genes.txt"
write("", file = output_file)

# printing out the LLM responses for the first 3 PCs
for (i in 1:3) {
  gene_num = 30
  top_genes <- names(sort(abs(Loadings(pbmc[["pca"]])[,i]), decreasing = TRUE)[1:gene_num])
  top_genes <- Loadings(pbmc[["pca"]])[,1][top_genes]

  gene_list <- ""
  for (j in 1:gene_num) {
    gene_list <- paste(gene_list, names(top_genes[j]), ": ", top_genes[j], ", ", sep = "")
  }
  gene_list <- substr(gene_list, 1, nchar(gene_list) - 2)
  
  query <- sprintf(prompt, additional_info, gene_num, i, gene_list)
  
  answer <- call_chat(query, user_api_key)
  print(answer)
  cat(sprintf("PC%d:\n%s\n\n", i, answer), file = output_file, append = TRUE)
}

## $choices
##   index message.role
## 1     0    assistant
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              message.content
## 1 Across PBMC single‐cell data, PC1 is overwhelmingly driven by classic myeloid/monocyte markers and innate‐immune–activation genes.  Nearly every gene with a large (negative) loading on PC1—CST3, TYROBP, LST1, AIF1, FCN1, LYZ, S100A8/A9, FTL/FTH1, CTSS, FCER1G, CFP/CFD, SPI1, PYCARD, etc.—is either (a) highly and specifically expressed in monocytes (especially CD14+ “classical” monocytes), (b) involved in phagocytosis/antigen processing (e.g. CTSS, LYZ), (c) an acute‐phase or complement component (FTL/FTH1, CFD, CFP), or (d) part of the inflammatory/S100 alarmin response (S100A8/A9).  In contrast, MALAT1 is the only one of the top 30 with a positive loading, and it is more ubiquitously expressed (often higher in lymphoid cells).  \n\nBiological interpretation: PC1 captures a myeloid versus lymphoid (or nonactivated) axis in the PBMC pool.  Cells scoring strongly negative on PC1 are monocytes or monocyte‐like cells with high innate‐immune activation, whereas cells scoring positive are largely non‐myeloid (e.g. T, B, NK) and lack this inflammatory/monocyte signature.  
##   message.refusal message.annotations finish_reason
## 1              NA                NULL          stop
## 
## $choices
##   index message.role
## 1     0    assistant
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     message.content
## 1 Here the strongest loadings split neatly into two camps:\n\n1. Positive weights – cytotoxic/lymphoid genes  \n   • NKG7, PRF1, GNLY, GZMA, GZMB, GZMH, CST7, CTSW → core perforin/granzyme machinery of NK cells and CD8 T-cells  \n   • CD247 (CD3ζ), CCL4, CCL5, XCL2 → T-cell/NK chemokines and receptor complex  \n   • FGFBP2, SPON2, HOPX, etc. → additional NK/T-cell–enriched transcripts  \n   • MS4A1, CD79A/B, TCL1A → B-cell markers  \n\n2. Negative weights – myeloid/antigen-presenting cell genes  \n   • HLA-DRA, HLA-DQA1, HLA-DQB1 → MHC-II antigen-presentation  \n   • FCGR3A (CD16), S100A4, SRGN, CTSC → monocyte/DC/granulocyte–associated  \n\nBiological interpretation of PC2  \n• PC2 captures a lymphoid vs. myeloid axis.  Cells with high PC2 scores express cytotoxic-lymphocyte (NK/CD8 T) and B-cell programs, whereas cells with low PC2 scores express monocyte/DC/MHC-II programs.  \n• In practice, PC2 is separating cytotoxic effector cells (and B-cells to a lesser extent) from antigen-presenting myeloid cells in your PBMC dataset.
##   message.refusal message.annotations finish_reason
## 1              NA                NULL          stop
## 
## $choices
##   index message.role
## 1     0    assistant
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           message.content
## 1 Here is one way to see what PC-3 is “doing” biologically.  If you look at the sign and the identity of the top loading genes you can split them into three rough modules:\n\n  1.  A strong platelet/megakaryocyte signature (all negative loadings):  \n     •  PPBP, PF4, GP9, ITGA2B, TUBB1, RGS18, SDPR, SPARC, GNG11, TREML1  \n     These are canonical platelet‐granule or megakaryocyte markers.  \n\n  2.  A canonical MHC‐II / professional antigen‐presentation signature (also negative):  \n     •  HLA-DPA1, HLA-DPB1, HLA-DRB1, HLA-DQA1, HLA-DQB1, CD74  \n     (and to a lesser extent CLU, CA2, PTCRA…).  \n\n  3.  A classic B-cell signature (positive loadings):  \n     •  CD79A, MS4A1 (CD20), CD79B  \n     •  NGFRAP1, NRGN (these last two are less “textbook” but track with the B-cell cluster in many PBMC sets)  \n\nBecause the sign of a PC is arbitrary, what really matters is that cells which highly express the platelet or MHC-II modules will score strongly in one direction on PC-3, while cells that express the B-cell receptor complex (CD79A/B, MS4A1) will score in the opposite direction.  In other words, PC-3 is capturing a three‐way split in your PBMC population:\n\n  –  Platelet‐like cells (high PPBP/PF4, etc.)  \n  –  Professional APCs / monocytes/dendritic cells (high HLA-II, CD74)  \n  –  B lymphocytes (high CD79A/B, MS4A1)  \n\nViewed more simply, PC-3 is largely a “B-cell vs non–B-cell” axis, with the non–B-cell end further enriched for both platelets and classical APCs.  This is a very common pattern in unsorted PBMC scRNA-seq: one of the top PCs ends up being the cell‐type axis that contrasts lymphoid subtypes against the myeloid/platelet side of the fence.
##   message.refusal message.annotations finish_reason
## 1              NA                NULL          stop

# iterating through each principle component and creating a list of the top 30 genes to put into the LLM query
for (i in 3:PCs) {
  gene_num = 30
  top_genes <- names(sort(abs(Loadings(pbmc[["pca"]])[,i]), decreasing = TRUE)[1:gene_num])
  top_genes <- Loadings(pbmc[["pca"]])[,1][top_genes]

  gene_list <- ""
  for (j in 1:gene_num) {
    gene_list <- paste(gene_list, names(top_genes[j]), ": ", top_genes[j], ", ", sep = "")
  }
  gene_list <- substr(gene_list, 1, nchar(gene_list) - 2)
  
  query <- sprintf(prompt, additional_info, gene_num, i, gene_list)
  
  answer <- call_chat(query, user_api_key)
  cat(sprintf("PC%d:\n%s\n\n", i, answer), file = output_file, append = TRUE)
}

We are going to compare the results of inputting the top 10, 15, 30, and 50 genes

# repeating, but with 15 top genes, for the first 5 PCs

# creating a new output file
output_file <- "PCA_LLM_response_15_genes.txt"
write("", file = output_file)

for (i in 1:5) {
  gene_num = 15
  top_genes <- names(sort(abs(Loadings(pbmc[["pca"]])[,i]), decreasing = TRUE)[1:gene_num])
  top_genes <- Loadings(pbmc[["pca"]])[,1][top_genes]
  gene_list <- ""
  for (j in 1:gene_num) {
    gene_list <- paste(gene_list, names(top_genes[j]), ": ", top_genes[j], ", ", sep = "")
  }
  gene_list <- substr(gene_list, 1, nchar(gene_list) - 2)
  
  query <- sprintf(prompt, additional_info, gene_num, i, gene_list)
  
  answer <- call_chat(query, user_api_key)
  cat(sprintf("PC%d:\n%s\n\n", i, answer), file = output_file, append = TRUE)
}

# repeating, but with 10 top genes, for the first 5 PCs

# creating a new output file
output_file <- "PCA_LLM_response_10_genes.txt"
write("", file = output_file)

for (i in 1:5) {
  gene_num = 10
  top_genes <- names(sort(abs(Loadings(pbmc[["pca"]])[,i]), decreasing = TRUE)[1:gene_num])
  top_genes <- Loadings(pbmc[["pca"]])[,1][top_genes]
  gene_list <- ""
  for (j in 1:gene_num) {
    gene_list <- paste(gene_list, names(top_genes[j]), ": ", top_genes[j], ", ", sep = "")
  }
  gene_list <- substr(gene_list, 1, nchar(gene_list) - 2)
  
  query <- sprintf(prompt, additional_info, gene_num, i, gene_list)
  
  answer <- call_chat(query, user_api_key)
  cat(sprintf("PC%d:\n%s\n\n", i, answer), file = output_file, append = TRUE)
}

# repeating, but with 50 top genes, for the first 5 PCs

# creating a new output file
output_file <- "PCA_LLM_response_50_genes.txt"
write("", file = output_file)

for (i in 1:5) {
  gene_num = 50
  genes <- names(sort(abs(Loadings(pbmc[["pca"]])[,i]), decreasing = TRUE)[1:gene_num])
  gene_list <- ""
  for (j in 1:gene_num) {
    gene_list <- paste(gene_list, names(top_genes[j]), ": ", top_genes[j], ", ", sep = "")
  }
  gene_list <- substr(gene_list, 1, nchar(gene_list) - 2)
  
  query <- sprintf(prompt, additional_info, gene_num, i, gene_list)
  
  answer <- call_chat(query, user_api_key)
  cat(sprintf("PC%d:\n%s\n\n", i, answer), file = output_file, append = TRUE)
}

For the purpose of testing the reliability of our LLM response, we will be taking the loadings from PC1 (which captures the highest amount of variance) and PC20 (which captures some a lot less variance) and calling the LLM with the same query 10 times to check for variability among the biological interpretations

# PC1 and PC20 and run them through the LLM 10 times to see how the responses differ
output_file_comp <- "PCA_LLM_response_comparison.txt"
write("", file = output_file_comp)

gene_num = 30

top_genes_PC1 <- names(sort(abs(Loadings(pbmc[["pca"]])[,1]), decreasing = TRUE)[1:gene_num])
top_genes_PC1 <- Loadings(pbmc[["pca"]])[,1][top_genes_PC1]

gene_list_PC1 <- ""
for (j in 1:gene_num) {
  gene_list_PC1 <- paste(gene_list_PC1, names(top_genes_PC1[j]), ": ", top_genes_PC1[j], ", ", sep = "")
}
gene_list_PC1 <- substr(gene_list_PC1, 1, nchar(gene_list_PC1) - 2)

top_genes_PC20 <- names(sort(abs(Loadings(pbmc[["pca"]])[,20]), decreasing = TRUE)[1:gene_num])
top_genes_PC20 <- Loadings(pbmc[["pca"]])[,1][top_genes_PC20]

gene_list_PC20 <- ""
for (j in 1:gene_num) {
  gene_list_PC20 <- paste(gene_list_PC20, names(top_genes_PC20[j]), ": ", top_genes_PC20[j], ", ", sep = "")
}
gene_list_PC20 <- substr(gene_list_PC20, 1, nchar(gene_list_PC20) - 2)

for (i in 1:10) {
    query_PC1 <- sprintf(prompt, additional_info, gene_num, 1, gene_list_PC1)
  
    answer_PC1 <- call_chat(query_PC1, user_api_key)
  cat(sprintf("PC1 Run%d:\n%s\n\n", i, answer_PC1), file = output_file_comp, append = TRUE)
  
      query_PC20 <- sprintf(prompt, additional_info, gene_num, 20, gene_list_PC20)
      
  answer_PC20 <- call_chat(query_PC20, user_api_key)
  cat(sprintf("PC 20 Run%d:\n%s\n\n", i, answer_PC20), file = output_file_comp, append = TRUE)
}

Results

Now, we are going to plot the first 20 PCs against PC1 and color each sample by the provided cell type annotations in order to visualize the variance captured by each PC along the y axis.

pcs <- pbmc@reductions$pca@cell.embeddings
anno <- pbmc@meta.data$seurat_annotations

toplot <- bind_cols(pcs, cluster = anno)

for (i in 2:10) {
  pc_name <- paste0("PC ", i)
  pc_col <- toplot[[paste0("PC_", i)]]
  p <- ggplot(toplot, aes(x = PC_1, y = pc_col, color = cluster)) + 
    geom_point() +
    labs(title = paste("PC 1 vs.", pc_name), x = "PC 1", y = pc_name)
  
  print(p)
}

Discussion

In this R markdown file, we have given the loadings from a PCA analysis on the PBMC3k dataset to an LLM model, in this case, OpenAI’s o4-mini (and it remains as an exercise for the reader to try this with other models, like o3, or whatever comes next).

We asked the model to provide a biological interpretation of each PC based on the genes with the highest magnitude loadings. Whether we input 10, 15, 30, or 50 genes, the o4-mini model was able to derive a clear biological interpretation for at least the first 5 PCs.

We also tested for consistency between LLM responses given the same prompt. The consistency of the across the board was notable. When comparing the 10 different outputs of the LLM for PC 1, we expected that the interpretation would be the same each time, because the first PC captures the most variance and would likely have the most clear biological relevance.

We were also pleasantly surprised to see that the model’s interpretations for PC 20, which captures significantly less variance, still remained relatively consistent each separate call to the model (each time the model suggested that PC 20 captured a rare subpopulation of B cells that are supporting massive antibody production).

It is important to note that during first iterations of this experiment, when using the GPT-4o mini, the consistency was significantly worse, especially for PC 20 results. This suggests that at the minimum, the user should test more than one model in their individual pipelines.

For the 30 gene input responses, we compared the biological interpretations for each PC to a plot in which it’s plotted again PC 1, with a color key for cell-type annotations, in order to get a feel for how the biological interpretations could be captured in the scatter of the cell populations along the y axis.

We observed that any variation captured by the PC on the cell-population level was also captured in the PC plots. However, it was difficult to verify the PC-captured variation suggested by the LLM if it had to do with cell-intrinsic traits that may transcend cell type, as the color key only differentiated by cell type.

Overall, we were able to show that there is in fact valuable information contained in the PCs past 1 and 2, and their respective PC loadings. The interpretations of some of the PCs beyond PC2 suggest that subtle, inter-population state and secretory changes can be caught by a PC that would otherwise may have been hidden by hard cluster boundaries, and the standard cluster vs not-cluster DEG analysis (e.g. FindAllMarkers).

The LLM responses using a good reasoning model were also quite consistent, which alleviates doubt surrounding hallucinations (though the user should remain vigilent).

Taken together, this “PC loadings to LLM” pipeline could be a beneficial step to add to the bioinformatician’s toolkit to encounter previously hidden insights in single cell data.