In Seurat or Scanpy, when you do scRNA seq analysis, you use only the top 2000 or so most variable genes (as assessed by a combination of gene expression and dispersion).
Feeding this into large foundation models rather than the whole dataset could speed up the processs...if the model can produce similar results from a dataset that consists only of the top variable genes.
Which brings us to the topic of this notebook.
The following is an experiment in which I took the most variable genes from the PBMC 3k dataset, as I would in a standard analysis, and I ran that through the Universal Cell Embeddings (UCE) foundation model (paper here), which is expecting the full dataset. The question was whether it would (aside from actually working) produce comparable output.
The UCE model takes each cell's raw RNA counts and converts it to a 1280 dimensional vector that contains biological meaning when compared to other cells. The UMAPs produced suggest that biological meaning found with the UCE model is comparable to the standard analysis pipeline. For more on this, see my previous markdown here.
Specifically, this experiment has three groups: 1. The dataset where I use the most variable genes. 2. The full dataset (positive control). 3. The dataset with random genes obtained, rather than the most variable genes (negative control).
We note that although I took out just under 2000 genes from groups 2 and 3 (a number that adheres to the best practices), a processing step in the UCE workflow filtered out additional genes. Nonetheless, this does not affect the point of this experiment.
First, we import the data and check it. The processed data, below the three experimental groups, has the cell type annotations.
import scanpy as sc
import warnings
warnings.filterwarnings('ignore')
full = sc.read_h5ad("../data/pbmc3k_raw_uce_adata.h5ad")
cut = sc.read_h5ad("../output/pbmc3k_raw_filtered_uce_adata.h5ad")
random = sc.read_h5ad("../output/pbmc3k_random_filter_uce_adata.h5ad")
proc = sc.read_h5ad("../data/pbmc3k_processed.h5ad")
print(full) # Positive control
print(cut) # Experimental
print(random) # Negative control
print(proc) # Where we get the cell type annotations
AnnData object with n_obs × n_vars = 2700 × 9714 obs: 'n_genes' var: 'gene_ids', 'n_cells' obsm: 'X_uce' AnnData object with n_obs × n_vars = 2700 × 1509 obs: 'n_genes' var: 'gene_ids', 'n_cells' obsm: 'X_uce' AnnData object with n_obs × n_vars = 2392 × 528 obs: 'n_genes' var: 'gene_ids', 'n_cells' obsm: 'X_uce' AnnData object with n_obs × n_vars = 2638 × 1838 obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain' var: 'n_cells' uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups' obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr' varm: 'PCs' obsp: 'distances', 'connectivities'
# Note the total number of genes used
print(full.shape)
print(cut.shape)
print(random.shape)
print(proc.shape)
(2700, 9714) (2700, 1509) (2392, 528) (2638, 1838)
# Make dataset names in the metadata for each
full.obs["dataset"] = "full"
cut.obs["dataset"] = "cut"
random.obs["dataset"] = "random"
# This works because it goes off of the cell IDs
full.obs["louvain"] = proc.obs["louvain"]
cut.obs["louvain"] = proc.obs["louvain"]
random.obs["louvain"] = proc.obs["louvain"]
# But given that proc is smaller than full and cut, we get a few NaNs
# So we remove them here
full = full[full.obs["louvain"].notnull()]
cut = cut[cut.obs["louvain"].notnull()]
random = random[random.obs["louvain"].notnull()]
Now that we have read in and pre-processed the data (added cell type annotations to each dataset), we will concatenate the data into a single dataset.
# Concatenate the three datasets
adata = full.concatenate(cut, random)
adata.obs
n_genes | dataset | louvain | batch | |
---|---|---|---|---|
index | ||||
AAACATACAACCAC-1-0 | 778 | full | CD4 T cells | 0 |
AAACATTGAGCTAC-1-0 | 1346 | full | B cells | 0 |
AAACATTGATCAGC-1-0 | 1126 | full | CD4 T cells | 0 |
AAACCGTGCTTCCG-1-0 | 953 | full | CD14+ Monocytes | 0 |
AAACCGTGTATGCG-1-0 | 520 | full | NK cells | 0 |
... | ... | ... | ... | ... |
TTTCGAACACCTGA-1-2 | 64 | random | Dendritic cells | 2 |
TTTCGAACTCTCAT-1-2 | 52 | random | CD14+ Monocytes | 2 |
TTTCTACTGAGGCA-1-2 | 58 | random | B cells | 2 |
TTTCTACTTCCTCG-1-2 | 29 | random | B cells | 2 |
TTTGCATGCCTCAC-1-2 | 33 | random | CD4 T cells | 2 |
7644 rows × 4 columns
Now, we visualize the data via UMAP. We note that in the previous experiment here, we observed that UMAP distorts distances between islands at least for UCE. But here the purpose is to simply look at cell type and island separation within each dataset, not between them. In this regard, UMAP is fine.
import umap.umap_ as umap
# Make a UMAP of the concatenated datasets, colored by dataset
# Make the umap outside of scanpy
reducer = umap.UMAP(random_state=1)
embedding = reducer.fit_transform(adata.obsm["X_uce"])
adata.obsm["X_umap"] = embedding
# Plot the umap, which is in the obsm["X_umap"] slot
# using sc.pl.umap
sc.pl.umap(adata, color="dataset")
sc.pl.umap(adata, color ="louvain")
We observe that the full dataset has the best separation. The random dataset is packed together into a single island, but does show evidence of cell type separation. The differentially expressed genes dataset shows clear island separation, though the islands appear to be closer together, suggesting that perhaps there is less separation detected.
But given the general skepticism I have around UMAP, we will look at PCA space, which did prove insightful in my previous notebook on this topic.
# Run pca from X_uce, outside of scanpy
from sklearn.decomposition import PCA
pca = PCA(n_components=50)
pca.fit(adata.obsm["X_uce"])
pca_embedding = pca.transform(adata.obsm["X_uce"])
# Put the pca embedding into the obsm slot
adata.obsm["X_pca"] = pca_embedding
# Plot the x_PCA slot in obsm colored by dataset
sc.pl.pca(adata, color="dataset")
sc.pl.pca(adata, color ="louvain")
In PCA space, we see again that the random dataset is bunched up into a single island. The full dataset has the best separation, though the CD8 T cells and the B cells appear a bit bunched up.
However, in the variable genes dataset, we see that there is no separation at all between the CD4 and CD8 T cells, along with the B cells, though there is separation between that island and the monocytes.
Now we are going to quantify these results by looking at the cell type separation per dataset in terms of heterogeneity of k-nearest neighborhoods. We will do this using Shannon entropy as a metric.
# Run KNN in UCE space of full, cut, and random datasets
from sklearn.neighbors import NearestNeighbors
import numpy as np
# Find knn of X_uce for each individual dataset
knn = NearestNeighbors(n_neighbors=10)
knn.fit(full.obsm["X_uce"])
full_knn = knn.kneighbors(return_distance=False)
knn = NearestNeighbors(n_neighbors=10)
knn.fit(cut.obsm["X_uce"])
cut_knn = knn.kneighbors(return_distance=False)
knn = NearestNeighbors(n_neighbors=10)
knn.fit(random.obsm["X_uce"])
random_knn = knn.kneighbors(return_distance=False)
# For each KNN, find the cell types of the neighbors (obs["louvain"])
full_knn_subsets = [full.obs["louvain"].iloc[i].tolist() for i in full_knn]
cut_knn_subsets = [cut.obs["louvain"].iloc[i].tolist() for i in cut_knn]
random_knn_subsets = [random.obs["louvain"].iloc[i].tolist() for i in random_knn]
# What this looks like
random_knn_subsets[:2]
[['CD4 T cells', 'CD4 T cells', 'CD4 T cells', 'CD4 T cells', 'CD8 T cells', 'CD4 T cells', 'CD4 T cells', 'CD14+ Monocytes', 'CD4 T cells', 'CD4 T cells'], ['B cells', 'CD4 T cells', 'CD4 T cells', 'B cells', 'CD4 T cells', 'FCGR3A+ Monocytes', 'CD4 T cells', 'CD4 T cells', 'B cells', 'CD4 T cells']]
Now we will take the subsets above and calculate the Shannon entropy of each. Think of this is amount of surprisal on average. A fair coin will have higher entropy than a biased coin, because a fair coin is less predictable than a biased coin. So the more cell types in a neighborhood, the higher the entropy will be. If there is only one cell type in a neighborhood, the entropy will be zero.
# Calcuate the Shannon Entropy of the KNN cell types
from scipy.stats import entropy
import numpy as np
# Assuming full_knn_subsets, cut_knn_subsets, and random_knn_subsets are defined
full_knn_entropy = [entropy(np.unique(i, return_counts=True)[1] / len(i)) for i in full_knn_subsets]
cut_knn_entropy = [entropy(np.unique(i, return_counts=True)[1] / len(i)) for i in cut_knn_subsets]
random_knn_entropy = [entropy(np.unique(i, return_counts=True)[1] / len(i)) for i in random_knn_subsets]
# Compare with neighborhood composition in the previous code block
print(random_knn_entropy[:2])
# Example with min (0) entropy: all values are the same
values_min = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
probs_min = np.bincount(values_min)[1:] / len(values_min) # Probability distribution
print(probs_min)
print(entropy(probs_min)) # Entropy should be 0
# Example with max entropy: all values are equally likely
values_max = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
probs_max = np.bincount(values_max)[1:] / len(values_max) # Probability distribution
print(probs_max)
print(entropy(probs_max)) # Higher entropy
[0.639031859650177, 0.8979457248567798] [1.] 0.0 [0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1] 2.3025850929940455
import numpy as np
import matplotlib.pyplot as plt
# Assuming full_knn_entropy, cut_knn_entropy, random_knn_entropy are defined
# Calculate means
means = [np.mean(full_knn_entropy), np.mean(cut_knn_entropy), np.mean(random_knn_entropy)]
# Calculate standard deviations for error bars
errors = [np.std(full_knn_entropy), np.std(cut_knn_entropy), np.std(random_knn_entropy)]
# Create the bar plot with error bars
plt.bar(["full", "cut", "random"], means, yerr=errors, capsize=5, color=['blue', 'green', 'orange'])
# Add labels and title
plt.ylabel('Mean Entropy')
plt.title('Mean Entropy of KNN Subsets with Error Bars')
# Show plot
plt.show()
import matplotlib.pyplot as plt
# Plot the entropy distributions as histograms with edge colors and improved settings
plt.hist(full_knn_entropy, bins=20, alpha=0.5, label="full", edgecolor='black')
plt.hist(cut_knn_entropy, bins=20, alpha=0.5, label="cut", edgecolor='blue')
plt.hist(random_knn_entropy, bins=20, alpha=0.5, label="random", edgecolor='green')
# Set labels and title for clarity
plt.xlabel("Entropy")
plt.ylabel("Frequency")
plt.title("KNN Entropy Distributions")
# Add a legend outside the plot to avoid overlap
plt.legend(loc='upper right')
# Display the plot
plt.show()
These results give us a bit more intuition around the UCE foundation model. It appears to get meaningful information from the entire dataset (or at least use it to generate meaningful output).
This means that for UCE, users should indeed use the whole dataset, and not just the variable genes.
Whether this carries over to other foundation models is yet to be seen.