I saw the populous sea, saw dawn and dusk, saw the multitudes of the Americas, saw a silvery spiderweb at the center of a black pyramid, saw a broken labyrinth (it was London), saw endless eyes, all very close, studying themselves in me as though in a mirror, saw all mirrors on the planet (and none of them reflecting me), saw in a rear courtyard on Calle Soler the same tiles I’d seen twenty years before in the entryway of a house in Fray Bentos, saw clusters of grapes, snow, veins of metal, water vapor, saw convex equatorial deserts and their every grain of sand…
Jorge Luis Borges, The Aleph
In RNA seq, one often ends up with a list of differentially expressed genes. The next question is often “what do we do with all of this?” And the answer is typically some sort of functional enrichment analysis. What you get out of this is a number of terms that are associated with biological context, be it GO terms or pathways.
The problem here is that this list itself can be overwhelming. So what happens when we do functional enrichment analysis of a deluge of DEGs, and then a deluge of terms? Here, we will use the lesser known “embedding” feature of some LLMs to group these terms thematically on a visual map.
In this markdown, we are going to run gprofiler on example data. We will then make a “map” of the results by placing the output into a contextual embedding, running a UMAP on that, and then making an interactive visualization using plotly. Again, the purpose of this is to make a deluge of gprofiler output (or any type of analysis that produces GO terms, pathways, etc) a bit easier to understand.
Specifically, this map will group terms based on context. For example, if there are a number of terms that are around brain development, then those will be grouped near to each other on the map.
We first need to get an example gene list. So let’s get started.
For this analysis, we will use g:Profiler. We will first run it using the gprofiler2 package in R. This will take in differentially expressed genes (DEGs) as input, and return statistically enriched annotation terms, from GO terms to pathways.
library(tidyverse)
library(gprofiler2)
set.seed(42)
Ok, and let’s get some data. The sample genes below were taken from the g:Profiler website, and clicking on “random example” in the link.
sample_genes <- c(
"ENSG00000158055", "ENSG00000116539", "ENSG00000180340", "ENSG00000100600",
"ENSG00000136518", "ENSG00000141646", "ENSG00000129173", "ENSG00000073146",
"ENSG00000166337", "ENSG00000158445", "ENSG00000197046", "ENSG00000134250",
"ENSG00000106078", "ENSG00000121454", "ENSG00000018408", "ENSG00000106689",
"ENSG00000135439", "ENSG00000164082", "ENSG00000163913", "ENSG00000113456",
"ENSG00000138080", "ENSG00000152661", "ENSG00000145864", "ENSG00000055917",
"ENSG00000125398", "ENSG00000170175", "ENSG00000159082", "ENSG00000186090",
"ENSG00000152583", "ENSG00000164171", "ENSG00000101958", "ENSG00000259207",
"ENSG00000069849", "ENSG00000145147", "ENSG00000109158", "ENSG00000141655",
"ENSG00000099337", "ENSG00000154736", "ENSG00000125965", "ENSG00000126583"
)
gost_out <- gprofiler2::gost(query = sample_genes, organism = "hsapiens", evcodes = TRUE)
gostres <- gost_out$result %>% as_tibble()
Now we are going to look at our results, so you can see what the best practices are in terms of visualizing it.
First, let’s look at what you are given:
gostres
## # A tibble: 205 × 16
## query significant p_value term_size query_size intersection_size precision
## <chr> <lgl> <dbl> <int> <int> <int> <dbl>
## 1 query_1 TRUE 1.50e- 2 3 13 2 0.154
## 2 query_1 TRUE 1.50e- 2 3 13 2 0.154
## 3 query_1 TRUE 6.34e-11 3047 40 27 0.675
## 4 query_1 TRUE 6.42e-10 7234 40 36 0.9
## 5 query_1 TRUE 5.01e- 9 3985 40 28 0.7
## 6 query_1 TRUE 5.73e- 9 613 40 14 0.35
## 7 query_1 TRUE 2.57e- 8 5924 40 32 0.8
## 8 query_1 TRUE 3.60e- 8 6478 40 33 0.825
## 9 query_1 TRUE 4.31e- 8 1229 40 17 0.425
## 10 query_1 TRUE 1.10e- 7 617 40 13 0.325
## # ℹ 195 more rows
## # ℹ 9 more variables: recall <dbl>, term_id <chr>, source <chr>,
## # term_name <chr>, effective_domain_size <int>, source_order <int>,
## # parents <list>, evidence_codes <chr>, intersection <chr>
names(gostres)
## [1] "query" "significant" "p_value"
## [4] "term_size" "query_size" "intersection_size"
## [7] "precision" "recall" "term_id"
## [10] "source" "term_name" "effective_domain_size"
## [13] "source_order" "parents" "evidence_codes"
## [16] "intersection"
So you get a data frame where each row is a term. Importantly, you get a p-value, and (beyond the scope of this markdown, but I use it) if you set evcodes to TRUE, you get a term called “intersection” which gives you the genes in your data that were part of the given term.
Ok, and are there any good built in visualizations for this? Yes. Here is the one that I use.
gostplot(gost_out)
You can see that is an interactive plot (probably plotly or something similar running in the back end) where if you hover over a point, it gives you the term (grouped on the x axis) and the p-value associated with it (which is also shown on the y axis after a -log10 transform).
However, to my eyes, this is still a bit much. I’m glad to see the terms grouped by category (GO versus KEGG, etc). But I’d like to see the data points grouped by context, regardless of group. So we need to make another visualization. This is where we can leverage the rapid developments of LLMs, and specifically the lesser known application of creating embeddings rather than chatbots.
So now it’s time to set up our contextual embedding step. We are going to use a model that is similar to BERT. In comparison to any of the GPTs, our model, like BERT, produces a high-dimensional embedding rather than a next token prediction. We note that the GPT models can produce embeddings as well, but that is not the primary use case for them at the time of writing.
In short, this is going to take in our results above and produce a contextual embedding that we will eventually be able to visualize.
To get access to our model, we can use the sentence_transformers python package. But this is a R Markdown, so while I typically use python directly to do this type of work, here I am going to use the reticulate package to do it.
I have already set up a virtual environment, which I activate below.
library(reticulate)
use_virtualenv("your_venv_here", required = TRUE)
We test python code here:
py_run_string("print('hello world')")
## hello world
And test python code with imports below:
np <- import("numpy")
x <- np$array(c(1, 2, 3, 4, 5))
print(x)
## [1] 1 2 3 4 5
print(np$mean(x))
## [1] 3
And now we get down to business. Below, we import the sentence_transformers package, where we use all-mpnet-base-v2 as a model. It will take the gprofiler output that we made and convert each term into a 768 dimensional vector, whereby terms that are contextually similar to each other will be grouped near each other in that 768-dimensional vector space.
When you run the code below the first time, it will take a few minutes to run, because the model has to download.
# Import the necessary modules
st <- import("sentence_transformers")
# Load the model
model <- st$SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
# Create a function to encode sentences
encode_sentences <- function(sentences) {
# Convert sentences to a Python list
py_sentences <- r_to_py(sentences)
# Encode the sentences
embeddings <- model$encode(py_sentences)
# Convert the result back to an R matrix
return(py_to_r(embeddings))
}
# Example usage
embeddings <- encode_sentences(gostres$term_name)
# Print the shape of the embeddings
print(dim(embeddings))
## [1] 205 768
Ok, it looks like it worked. Let’s look at what the model output looks like:
as_tibble(embeddings)
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 205 × 768
## V1 V2 V3 V4 V5 V6 V7 V8
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0403 -0.0777 0.0356 0.0185 0.00150 0.0220 0.0337 -0.0575
## 2 0.0490 -0.0856 0.0346 0.0226 -0.0000313 0.0234 0.0331 -0.0566
## 3 -0.00649 0.0371 -0.00633 -0.0310 0.0373 0.00806 0.0660 -0.0400
## 4 -0.0214 -0.0380 0.00458 -0.0374 -0.0148 -0.00459 0.0334 -0.0257
## 5 0.0280 -0.0105 -0.0462 0.00284 0.00378 -0.000652 0.0232 -0.0536
## 6 -0.0351 -0.00546 0.00257 -0.0401 0.00800 -0.0188 0.0291 -0.0282
## 7 -0.0149 0.0348 -0.0404 0.0204 0.0174 0.00283 0.0723 -0.0114
## 8 -0.00593 -0.000992 -0.0213 -0.0147 -0.0110 -0.00728 0.00382 -0.00579
## 9 -0.0598 -0.0300 -0.0169 -0.0153 0.0383 -0.00233 0.0911 -0.0670
## 10 -0.0548 0.00839 -0.0163 -0.0192 -0.00201 -0.0166 0.0463 -0.0332
## # ℹ 195 more rows
## # ℹ 760 more variables: V9 <dbl>, V10 <dbl>, V11 <dbl>, V12 <dbl>, V13 <dbl>,
## # V14 <dbl>, V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>,
## # V20 <dbl>, V21 <dbl>, V22 <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>,
## # V26 <dbl>, V27 <dbl>, V28 <dbl>, V29 <dbl>, V30 <dbl>, V31 <dbl>,
## # V32 <dbl>, V33 <dbl>, V34 <dbl>, V35 <dbl>, V36 <dbl>, V37 <dbl>,
## # V38 <dbl>, V39 <dbl>, V40 <dbl>, V41 <dbl>, V42 <dbl>, V43 <dbl>, …
You can see it’s 768 columns of numbers that don’t really have any meaning to us, but they do group the rows contextually.
But how can we utilize this, given that we don’t operate in 768 dimensional space? We can use dimensionality reduction tools to compress the information down to 2 dimensions.
To this end, we will run UMAP below on the embeddings.
library(umap)
dimr <- umap::umap(embeddings)$layout %>% as_tibble()
names(dimr) <- c("umap1", "umap2")
And now we will do a quick ggplot of this as a sanity check.
ggplot(dimr, aes(x = umap1, y = umap2)) + geom_point()
So there are our points. But this means nothing to us right now, and adding the labels directly to the map above makes it very crowded.
So we are going to do what gprofiler did, and make an interactive plot out of this. We will use plotly to this end. I typically use plotly within python, but there is an R package for it below.
The syntax is a bit dicey, but this is where the LLMs come in handy: giving you the proper syntax to use a syntax-heavy package. Anyway the code below gets the job done.
library(plotly)
# Calculate negative log10 of p-value
dimr$neg_log10_p <- -log10(gostres$p_value)
dimr$category <- gostres$source
dimr$name <- gostres$term_name
# Create a function to scale the sizes
scale_sizes <- function(x, min_size = 5, max_size = 30) {
(x - min(x)) / (max(x) - min(x)) * (max_size - min_size) + min_size
}
# Create the plot
p <- plot_ly(dimr,
x = ~umap1,
y = ~umap2,
type = "scatter",
mode = "markers",
marker = list(
size = ~scale_sizes(neg_log10_p),
sizemode = "area",
sizeref = 0.05,
sizemin = 4
),
color = ~category, # Assuming 'category' is the column with GO/REAC info
hoverinfo = "text",
text = ~paste("Name: ", name,
"<br>Source: ", category,
"<br>-log10(p-value): ", round(neg_log10_p, 2))) %>%
layout(
title = "Contextual embeddings of terms",
legend = list(title = list(text = "Category")),
showlegend = TRUE
)
# Add size legend
p <- p %>% add_trace(
x = c(),
y = c(),
type = "scatter",
mode = "markers",
marker = list(
size = ~scale_sizes(neg_log10_p),
color = "rgba(0,0,0,0)"
),
showlegend = TRUE,
legendgroup = "size",
name = "Size (-log10 p-value)",
hoverinfo = "none"
)
p
And now we have an interactive plot that visualizes our results. You can hover over a point to get the p-value and term name. It is colored by categories (eg. GO, KEGG, etc).
Hover over it for a minute to convince yourself that the plots are grouped thematically, and you’ll see the utility of having a visualization like this in your post-gprofiler analysis.
This is a nice supplement to the built in gostplot, in that you can see that points on the map that are contextually similar to each other are grouped near to each other on the map, giving you another lens through which you can look at an overwhelming number of terms.
Now we are going to make the “themes” explicit in the GO and pathway terms. We are going to do this by clustering the high-dimensional embedding. Then within each cluster we are going to use a LLM chatbot I built (how to do that here), that runs directly in a R Markdown (and any literate programming environment) to develop cluster themes in real time and assign them to the clusters. We will then re-visualize the UMAP with this additional information.
So the first thing we are going to do is the clustering. We are going to do louvain community detection of a KNN graph. We are going to cluster the high-dimensional embedding, not the UMAP. This is because I have found inaccuracies in the resolution of UMAP, that leads me to cluster the higher dimensional space when I can.
You can learn more about that in my webinar here or get a quick understanding of it by looking at the gif on the README of my KnnSleepwalk package, here (scroll to the bottom).
So let’s get started by making the graph.
library(RANN)
# Set the number of neighbors: use k = 11 to get 10 neighbors (excluding the self-match)
k <- 11
# Compute nearest neighbors using nn2. The function returns a list with indices and distances.
nn_res <- RANN::nn2(embeddings, k = k)
# Remove the self-neighbor (first column) from the results
knn_idx <- nn_res$nn.idx[, -1]
knn_dists <- nn_res$nn.dists[, -1]
Now we build the KNN graph itself. This involves first making an edge list, and then feeding that into the igraph package. Note that we are weighting each edge.
edge_list <- map_dfr(1:nrow(embeddings), function(i) {
tibble(
from = i,
# Ensure indices are integers
to = as.integer(knn_idx[i,]),
weight = 1 / (1 + knn_dists[i,])
)
})
And now we turn this into a graph.
library(igraph)
# Create the graph. Vertices are numbered from 1 to nrow(embeddings_matrix)
g <- graph_from_data_frame(edge_list, directed = FALSE, vertices = 1:nrow(embeddings))
# Simplify the graph to remove self-loops and combine multiple edges (using the mean weight)
g <- simplify(g, remove.multiple = TRUE, remove.loops = TRUE, edge.attr.comb = "mean")
g
## IGRAPH ba3b3af UNW- 205 1454 --
## + attr: name (v/c), weight (e/n)
## + edges from ba3b3af (vertex names):
## [1] 1--2 1--145 1--155 1--157 1--160 1--163 1--168 1--169 1--171 1--176
## [11] 2--145 2--155 2--157 2--160 2--163 2--168 2--169 2--171 2--176 3--7
## [21] 3--10 3--11 3--12 3--13 3--15 3--16 3--20 3--29 3--31 3--34
## [31] 3--35 3--46 3--48 3--49 3--50 3--51 3--53 3--72 3--81 3--89
## [41] 3--90 3--92 3--99 3--103 3--104 3--120 3--130 3--134 4--12 4--24
## [51] 4--40 4--56 4--64 4--84 4--97 4--103 4--106 4--114 4--122 4--133
## [61] 5--8 5--35 5--36 5--49 5--51 5--82 5--99 5--114 5--129 5--143
## [71] 5--162 6--10 6--11 6--13 6--14 6--20 6--23 6--27 6--32 6--40
## + ... omitted several edges
From there, we run clustering, or in graph theory terms, community detection. One of the more popular ways of doing this (of many) is louvain. But it’s an exercise to the reader to try additional methods (eg. leiden).
louvain_result <- igraph::cluster_louvain(g, weights = E(g)$weight)
clusters <- membership(louvain_result)
# Add cluster labels to the embeddings tibble (as a factor for easy plotting)
dimr$cluster <- as.factor(clusters)
# Optional: Print cluster sizes
print(table(dimr$cluster))
##
## 1 2 3 4 5 6 7
## 25 41 21 19 33 40 26
Great. And now we are going to make themes for the clusters. We are going to call a LLM chatbot directly within the code to help us with this. Again, I detail how to set up the LLM call below in this article here.
theme <- lapply(unique(clusters), function(i) {
curr_terms <- dplyr::filter(dimr, cluster == i)$name %>% as.character()
curr_str <- paste(curr_terms, collapse = ", ")
# Build the prompt. Feel free to adjust the wording as needed.
prompt <- paste(
"Given the following gprofiler terms, please give me a theme that each of them sit under.",
"For example, it might be something like brain development.",
"Please respond with just the theme. Nothing else.",
"The terms are as follows:",
curr_str
)
cmd <- sprintf('source ~/.zshrc && chatbot "claude" %s', shQuote(prompt))
result <- system(cmd, intern = TRUE)
return(result)
}) %>% unlist()
theme
## [1] "Membrane transport and signaling"
## [2] "Embryonic and organ development"
## [3] "Cellular and tissue development processes"
## [4] "Sensory system and neurological development"
## [5] "Skeletal system development and morphogenesis"
## [6] "Cellular regulation and response"
## [7] "Neuronal signaling and synaptic communication"
Ok, from here, we add the themes to the clusters.
clust_theme <- lapply(dimr$cluster, function(i) {
result <- theme[as.numeric(i)] # Its a factor right now
return(result)
}) %>% unlist()
dimr$theme <- clust_theme
dimr
## # A tibble: 205 × 7
## umap1 umap2 neg_log10_p category name cluster theme
## <dbl> <dbl> <dbl> <chr> <chr> <fct> <chr>
## 1 0.272 -4.04 1.82 CORUM SMAD2-SMAD4-WWTR1 complex 1 Memb…
## 2 0.387 -4.06 1.82 CORUM SMAD3-SMAD4-WWTR1 complex 1 Memb…
## 3 -0.265 2.38 10.2 GO:BP animal organ development 2 Embr…
## 4 0.774 0.0746 9.19 GO:BP multicellular organismal p… 3 Cell…
## 5 0.927 1.34 8.30 GO:BP system development 4 Sens…
## 6 -0.373 2.67 8.24 GO:BP tissue morphogenesis 5 Skel…
## 7 0.300 2.66 7.59 GO:BP anatomical structure devel… 4 Sens…
## 8 0.470 1.27 7.44 GO:BP developmental process 4 Sens…
## 9 -1.37 3.03 7.37 GO:BP epithelium development 2 Embr…
## 10 -0.596 2.76 6.96 GO:BP embryonic morphogenesis 2 Embr…
## # ℹ 195 more rows
Ok, and from here we are going to re-print the map and add the theme to it. We will use plotly once again, below.
# Create the plot
p <- plot_ly(dimr,
x = ~umap1,
y = ~umap2,
type = "scatter",
mode = "markers",
marker = list(
size = ~scale_sizes(neg_log10_p),
sizemode = "area",
sizeref = 0.05,
sizemin = 4
),
color = ~category, # Assuming 'category' is the column with GO/REAC info
hoverinfo = "text",
text = ~paste("Name: ", name,
"<br>Source: ", category,
"<br>-log10(p-value): ", round(neg_log10_p, 2),
"<br>Theme: ", theme)) %>%
layout(
title = "Contextual embeddings of terms",
legend = list(title = list(text = "Category")),
showlegend = TRUE
)
# Add size legend
p <- p %>% add_trace(
x = c(),
y = c(),
type = "scatter",
mode = "markers",
marker = list(
size = ~scale_sizes(neg_log10_p),
color = "rgba(0,0,0,0)"
),
showlegend = TRUE,
legendgroup = "size",
name = "Size (-log10 p-value)",
hoverinfo = "none"
)
p
If you hover over the text, you can see the themes show up and you can do sanity checks in real time. It’s not perfect, but I think adjusting the cluster resolution, adjusting the prompt, and adjusting the model will help in this regard.
I’ll note that I’ve run this a number of times, and the themes differ a little bit each time. This is an important limitation. You don’t exactly know what the chatbot is going to give you.
Now, let’s run this chatbot three times to see if it gives us different themes each time, or if they are similar.
theme_list <- lapply(1:3, function(run) {
theme <- lapply(unique(clusters), function(i) {
curr_terms <- dplyr::filter(dimr, cluster == i)$name %>% as.character()
curr_str <- paste(curr_terms, collapse = ", ")
# Build the prompt. Feel free to adjust the wording as needed.
prompt <- paste(
"Given the following gprofiler terms, please give me a theme that each of them sit under.",
"For example, it might be something like brain development.",
"Please respond with just the theme. Nothing else.",
"The terms are as follows:",
curr_str
)
cmd <- sprintf('source ~/.zshrc && chatbot "claude" %s', shQuote(prompt))
result <- system(cmd, intern = TRUE)
return(result)
}) %>% unlist()
return(theme)
})
theme_list
## [[1]]
## [1] "Membrane transport and signaling complexes"
## [2] "Embryonic and organ development"
## [3] "Cellular and tissue development"
## [4] "Neurological and sensory system development"
## [5] "Skeletal development and morphogenesis"
## [6] "Cellular regulation and response processes"
## [7] "Neural signaling and synaptic communication"
##
## [[2]]
## [1] "Membrane transport and signaling complexes"
## [2] "Embryonic and organ development"
## [3] "Cellular and tissue development"
## [4] "Developmental neurobiology"
## [5] "Skeletal and tissue morphogenesis"
## [6] "Cellular Regulation and Response"
## [7] "Neural signaling and synaptic communication"
##
## [[3]]
## [1] "Membrane transport and signaling"
## [2] "Embryonic and Organ Development"
## [3] "Cellular and tissue development"
## [4] "Organismal Development and Sensory System Formation"
## [5] "Skeletal and tissue morphogenesis"
## [6] "Cellular regulation and response to environmental stimuli"
## [7] "Neuronal signaling and synaptic communication"
We can see that the results are not exact each time, but they are similar. I would give exact commentary here, but knitting this markdown will produce a slightly different set of results compared to the one I’m seeing right now.
But as an example, in the current run, one of the clusters has the following LLM annotation across three runs: 1) “Ion channel and membrane transport signaling”, 2) “Cell signaling and membrane transport”, and 3) “Membrane transport and signaling”.
So you can see that perhaps ion channels matter here, but that’s only mentioned in one of the annotation runs. This suggests that perhaps this cluster is too broad, and we need to cluster again with higher resolution.
The problem I solved above is a subset of a much bigger problem that really started with the beginning of the -omics era: what do I do with all this data?
It used to be that the more data you had, the better. I remember this to be the case in the first biology lab I worked in, where experiments were primarily western blots and 2-color immunofluorescence (not including DAPI). But now, every experiment gives you a firehose of data.
The problem of what to do with DEGs has been around since the beginnings of bulk RNA-seq (though it is still an issue for single-cell). I remember first seeing a talk where GO terms were used and still being overwhelmed by them.
This markdown presents two solutions that are now possible with the advent of AI, and particularly the transformer.
The first of these solutions is producing an embedding of gprofiler terms and visualizing the embedding on an interactive UMAP (or whatever dimension reduction method is popular at the time the reader sees this). We were able to do this here using the sentence-transformers python package to produce the embeddings, and plotly to view the UMAP. We note that an important piece here was the use of the reticulate package to run python within R.
The second of these solutions is using the LLM chatbots to annotate the clusters in terms of a theme that underlies the terms in a given cluster. An important solution here is the development of a command line chatbot using API calls to a LLM (which again I talk about here), and in turn calling that directly within R and saving the results as strings. This automates the LLM calls so the user doesn’t have to stop the pipeline, run the LLM prompt, copy/paste the answer, and so forth.
I will note here that a lot of the work on single-cell foundation models involve the development of embeddings as well, so there is a theme here that the reader should be aware of. For more on that, have a look at my work on the Universal Cell Embeddings (UCE) model, here.
Zooming out, this markdown provides solutions for making large amounts of textual output readable using a model that converts the output into embeddings, using python in R, calling a LLM directly within R and storing that output as a R object without user intervention, and making interactive plots to visualize all of the above.
Hopefully, this markdown will help the user solve the continual “I’m overwhelmed by all of these data” problem, and more quickly convert their data into actionable insights.