Tyler Burns
September 28 - September 29, 2024
Here, we a look at the output of a transformer-based single-cell foundation model called Universal Cell Embeddings. It is a 1280 dimensional embedding of around 30,000 single cells. Given that it is the output of a black box model, we ask questions about the geometry of the embedding to see if that can tell us anything about what the model is doing.
Accordingly, we find that there there is a distinct "center" of the model, and a distinct "outer edge." There are positive associations between center-ness, frequency of a cell subset occurring, and density of cells in a given subset. We also find that UMAP does not properly represent center-ness of the data, suggesting that you should not treat the center of the UMAP as the literal center of the dataset until you check it yourself (with the code provided).
This notebook looks at a human central nervous system dataset, with respect to a Universal Cell Embeddings (UCE) transformer model. The paper around the model can be found here.
UCE and a number of other transformer-based models embed single cells into a high-dimensional vector space. Additional data can be added to the model, and placed in the same vector space. As such, these models allow for things like per-cell annotation.
Chan-Zuckerberg Initiative (CZI) has done a fair amount of work in single-cell, bringing together many databases along with visualization tools in what they call CELLxGENE. Accordingly, CZI has some of these models in an accessible format for users. These CZI "census models" can be found on their website here
At the time of writing (September 29, 2024), these models are fairly new. Accordingly, this jupyter notebook is a first pass at understanding the properties of the high-dimensional embeddings that these models output.
For the sake of saving time (it takes 20min to pull the anndata object on my 16Gb MacBook Pro), I ran the code in this first block separately, saved the object, and I read it in below. The first code block does nothing. I'm just showing it for display purposes, so you can run this on your end.
import cellxgene_census
import os
# Set the working directory to "data"
os.chdir('data')
print("setting connection")
census = cellxgene_census.open_soma(census_version="2023-12-15")
# Human UCE
print("getting human data")
adata = cellxgene_census.get_anndata(
census,
organism = "homo_sapiens",
measurement_name = "RNA",
obs_value_filter = "tissue_general == 'central nervous system'",
obs_embeddings = ["uce"]
)
adata.write("human_uce_cns.h5ad")
import random
import scanpy as sc
random.seed(42)
import warnings
warnings.filterwarnings('ignore')
# Read in the pre-made anndata file, reading in h5ad
adata = sc.read("data/human_uce_cns.h5ad")
So now we have an AnnData object. This format was originally build for the scanpy package (for single-cell sequencing analysis: R users use Seurat, Python users use scanpy). You can read more about AnnData here.
Now let's look at the shape of the data.
adata.obsm['uce'].shape
(31780, 1280)
Here, we have 31k cells, and 1280 "features." These features are the dimensions in the embedding that was outputted by the transformer. While not exact, it is comparable to the output from NLP models like BERT, which I have heavily used in the past in projects like this, and the content behind my TED talk here.
In short, cells that are similar to each other by whatever context the transformer was able to find will be grouped physically near each other in this embedding. The more cells the model was trained on, and especially the more diverse the training set, the more powerful the model is likely to be.
Below, we will visualize these 1280 dimensions by means of compressing them to 2 dimensions using UMAP, a nonlinear dimensionality reduction tool that is commonly used at this point. While there is plenty about it to critique, it is sufficiently good for our purposes below.
import umap
import matplotlib.pyplot as plt
reducer = umap.UMAP()
embedding = reducer.fit_transform(adata.obsm['uce'])
plt.scatter(embedding[:,0], embedding[:,1])
<matplotlib.collections.PathCollection at 0x2d144fef0>
We see that the data do fall into distinct "islands." This is something that is fairly typical if we were to simply run a UMAP on a compressed version of the top 2000 differentially expressed genes per cell. For more on a typical scRNA-seq analysis workup, go here.
But this only tells us that there are distinct islands. Our AnnData object has information about cell subset. Let's color the UMAP by those subsets and see where they fall. Below, we make a function that allows us to loop through all the metadata columns and color the UMAP by each of them.
We are going to do a data dump of UMAPs colored by various pieces of metadata that might be of interest to some readers (eg. gender). If you just want to cut to the chase, then you can skip this section, as I will show the subsets plot in the subsequent seciton..
The fourth UMAP down is our cell subsets.
import pandas as pd
# Put umap into adata obsm
adata.obsm['X_umap'] = embedding
# Plot UMAP colored by each categorical column in adata.obs using scanpy
for column in adata.obs.columns:
if pd.api.types.is_categorical_dtype(adata.obs[column]) or adata.obs[column].nunique() < 20: # Check for categorical or small number of unique values
sc.pl.umap(adata, color=column)
If we look just at the cell subsets, we see that there are a number of CNS populations, which serves as a good sanity check. We note that the largest of these subsets are cerebellar granular neurons, and oligodendrocytes. We note that specifically, we are dealing with white matter of the cerebellum, so the former makes sense in that regard.
We now look at the metadata below, stored in the obs slot.
# Get metadata
adata.obs
soma_joinid | dataset_id | assay | assay_ontology_term_id | cell_type | cell_type_ontology_term_id | development_stage | development_stage_ontology_term_id | disease | disease_ontology_term_id | ... | suspension_type | tissue | tissue_ontology_term_id | tissue_general | tissue_general_ontology_term_id | raw_sum | nnz | raw_mean_nnz | raw_variance_nnz | n_measured_vars | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8752190 | c8f83821-a242-4ed7-86e9-7da077f5d348 | 10x 3' v3 | EFO:0009922 | ependymal cell | CL:0000065 | 34-year-old human stage | HsapDv:0000128 | normal | PATO:0000461 | ... | nucleus | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 2374.0 | 1623 | 1.462723 | 4.991057 | 24817 |
1 | 8752191 | c8f83821-a242-4ed7-86e9-7da077f5d348 | 10x 3' v3 | EFO:0009922 | astrocyte | CL:0000127 | 34-year-old human stage | HsapDv:0000128 | normal | PATO:0000461 | ... | nucleus | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 917.0 | 640 | 1.432813 | 1.638671 | 24817 |
2 | 8752192 | c8f83821-a242-4ed7-86e9-7da077f5d348 | 10x 3' v3 | EFO:0009922 | astrocyte | CL:0000127 | 34-year-old human stage | HsapDv:0000128 | normal | PATO:0000461 | ... | nucleus | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 2241.0 | 1184 | 1.892736 | 5.527792 | 24817 |
3 | 8752193 | c8f83821-a242-4ed7-86e9-7da077f5d348 | 10x 3' v3 | EFO:0009922 | astrocyte | CL:0000127 | 34-year-old human stage | HsapDv:0000128 | normal | PATO:0000461 | ... | nucleus | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 1255.0 | 887 | 1.414882 | 4.256573 | 24817 |
4 | 8752194 | c8f83821-a242-4ed7-86e9-7da077f5d348 | 10x 3' v3 | EFO:0009922 | astrocyte | CL:0000127 | 34-year-old human stage | HsapDv:0000128 | normal | PATO:0000461 | ... | nucleus | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 1491.0 | 816 | 1.827206 | 5.340658 | 24817 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
31775 | 8831489 | 12194ced-8086-458e-84a8-e2ab935d8db1 | 10x 3' v3 | EFO:0009922 | oligodendrocyte | CL:0000128 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | nucleus | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 23435.0 | 5149 | 4.551369 | 134.862017 | 28059 |
31776 | 8831490 | 12194ced-8086-458e-84a8-e2ab935d8db1 | 10x 3' v3 | EFO:0009922 | oligodendrocyte | CL:0000128 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | nucleus | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 3134.0 | 1343 | 2.333582 | 15.222471 | 28059 |
31777 | 8831491 | 12194ced-8086-458e-84a8-e2ab935d8db1 | 10x 3' v3 | EFO:0009922 | oligodendrocyte | CL:0000128 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | nucleus | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 3692.0 | 1632 | 2.262255 | 21.365883 | 28059 |
31778 | 8831492 | 12194ced-8086-458e-84a8-e2ab935d8db1 | 10x 3' v3 | EFO:0009922 | oligodendrocyte | CL:0000128 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | nucleus | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 11394.0 | 3554 | 3.205965 | 58.032433 | 28059 |
31779 | 8831493 | 12194ced-8086-458e-84a8-e2ab935d8db1 | 10x 3' v3 | EFO:0009922 | oligodendrocyte | CL:0000128 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | nucleus | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 7072.0 | 2502 | 2.826539 | 31.751586 | 28059 |
31780 rows × 26 columns
With something like a high dimensional embedding that comes from a black box model, we as skeptical biologists have no idea whether and how much to trust the model. This may change down the line as the field of explainable artificial intelligence matures. But until then, we have to ask very simple questions to get at the characteristics of the model and what it encoded.
Thus, we are going to ask a very simple set of questions. Given that we are dealing with a 1280 dimensional point cloud, what cells are at the very center of the point cloud, and what cells are at the outer edge.
I am guessing that the center of the model's output will be things that the model determined to be "central" to everything else. For example, there might be cells that share gene expression programmes with the rest of the cells in the model, or perhaps in some developmental datasets, cells that are precursors to a large fraction of the cells in the model.
Furthermore, I am guessing that cells on the outer edges would be least like the others. In other words, cells out here would be "outliers" either by being biologically different (eg. contamination from a different organ system), or technical artifacts.
So let's look at the center and the outside of the embedding, below. We do this by first finding the center of the data, and we approximate that by finding the mean value of each of the 1280 dimensions.
# Find the mean of all coordinates in the manifold
center = adata.obsm['uce'].mean(axis=0)
center[0:10]
array([-0.00088816, 0.02052302, 0.00202639, 0.00415612, 0.02179049, 0.00748633, 0.00377401, -0.00308924, 0.01350688, -0.00047351], dtype=float32)
Next, we compute the distance from each cell to this "center" coordinate we just found.
# Distance from the center that we just computed
distances = np.linalg.norm(adata.obsm['uce'] - center, axis=1)
distances[1:10]
len(distances)
31780
From there, we add the distances to the metadata matrix that we have already seen, so we can sort by them.
# Add distances to the metadata
adata.obs['distance_from_center'] = distances
# Sort cells by distance from center, with the closest cells first
adata_sorted = adata[adata.obs['distance_from_center'].sort_values().index]
And now we have a look.
adata_sorted.obs
soma_joinid | dataset_id | assay | assay_ontology_term_id | cell_type | cell_type_ontology_term_id | development_stage | development_stage_ontology_term_id | disease | disease_ontology_term_id | ... | tissue | tissue_ontology_term_id | tissue_general | tissue_general_ontology_term_id | raw_sum | nnz | raw_mean_nnz | raw_variance_nnz | n_measured_vars | distance_from_center | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
18977 | 8805083 | 894573ad-498f-47ee-9bec-ad0880147eea | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 63-year-old human stage | HsapDv:0000157 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 7387.0 | 3177 | 2.325150 | 14.233350 | 28144 | 0.504095 |
11231 | 8775344 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 74-year-old human stage | HsapDv:0000168 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 15343.0 | 4819 | 3.183856 | 50.794335 | 30436 | 0.506040 |
8151 | 8772264 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 63-year-old human stage | HsapDv:0000157 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 7387.0 | 3177 | 2.325150 | 14.233350 | 30436 | 0.507273 |
19521 | 8805627 | 894573ad-498f-47ee-9bec-ad0880147eea | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 63-year-old human stage | HsapDv:0000157 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 7089.0 | 3456 | 2.051215 | 7.678129 | 28144 | 0.516124 |
9304 | 8773417 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 63-year-old human stage | HsapDv:0000157 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 3559.0 | 2100 | 1.694762 | 4.584727 | 30436 | 0.516689 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
24909 | 8817504 | 3d044b52-140a-4528-bf0d-a2dbef9e1f40 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 71-year-old human stage | HsapDv:0000165 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 694.0 | 373 | 1.860590 | 5.899867 | 25787 | 1.109984 |
25053 | 8817648 | 3d044b52-140a-4528-bf0d-a2dbef9e1f40 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 567.0 | 440 | 1.288636 | 0.902832 | 25787 | 1.114052 |
24128 | 8816723 | 3d044b52-140a-4528-bf0d-a2dbef9e1f40 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 39-year-old human stage | HsapDv:0000133 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 700.0 | 458 | 1.528384 | 1.842737 | 25787 | 1.117524 |
24387 | 8816982 | 3d044b52-140a-4528-bf0d-a2dbef9e1f40 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 71-year-old human stage | HsapDv:0000165 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 563.0 | 340 | 1.655882 | 2.739641 | 25787 | 1.126728 |
6719 | 8770832 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 71-year-old human stage | HsapDv:0000165 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 563.0 | 340 | 1.655882 | 2.739641 | 30436 | 1.129278 |
31780 rows × 27 columns
It divides into cerebellar granular cell as the center and endothelial cell of the artery as the farthest out. Let's check this along a longer list, before we come to any conclusions.
# Top 20 cells closest to the center
adata_sorted.obs.head(20)
soma_joinid | dataset_id | assay | assay_ontology_term_id | cell_type | cell_type_ontology_term_id | development_stage | development_stage_ontology_term_id | disease | disease_ontology_term_id | ... | tissue | tissue_ontology_term_id | tissue_general | tissue_general_ontology_term_id | raw_sum | nnz | raw_mean_nnz | raw_variance_nnz | n_measured_vars | distance_from_center | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
18977 | 8805083 | 894573ad-498f-47ee-9bec-ad0880147eea | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 63-year-old human stage | HsapDv:0000157 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 7387.0 | 3177 | 2.325150 | 14.233350 | 28144 | 0.504095 |
11231 | 8775344 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 74-year-old human stage | HsapDv:0000168 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 15343.0 | 4819 | 3.183856 | 50.794335 | 30436 | 0.506040 |
8151 | 8772264 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 63-year-old human stage | HsapDv:0000157 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 7387.0 | 3177 | 2.325150 | 14.233350 | 30436 | 0.507273 |
19521 | 8805627 | 894573ad-498f-47ee-9bec-ad0880147eea | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 63-year-old human stage | HsapDv:0000157 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 7089.0 | 3456 | 2.051215 | 7.678129 | 28144 | 0.516124 |
9304 | 8773417 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 63-year-old human stage | HsapDv:0000157 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 3559.0 | 2100 | 1.694762 | 4.584727 | 30436 | 0.516689 |
21980 | 8808086 | 894573ad-498f-47ee-9bec-ad0880147eea | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 71-year-old human stage | HsapDv:0000165 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 6840.0 | 2943 | 2.324159 | 16.665110 | 28144 | 0.518092 |
22083 | 8808189 | 894573ad-498f-47ee-9bec-ad0880147eea | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 71-year-old human stage | HsapDv:0000165 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 6048.0 | 2621 | 2.307516 | 16.059596 | 28144 | 0.518990 |
18419 | 8804525 | 894573ad-498f-47ee-9bec-ad0880147eea | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 71-year-old human stage | HsapDv:0000165 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 4508.0 | 2077 | 2.170438 | 15.156871 | 28144 | 0.519240 |
18420 | 8804526 | 894573ad-498f-47ee-9bec-ad0880147eea | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 71-year-old human stage | HsapDv:0000165 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 11328.0 | 4152 | 2.728324 | 31.684787 | 28144 | 0.519303 |
17671 | 8803777 | 894573ad-498f-47ee-9bec-ad0880147eea | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 71-year-old human stage | HsapDv:0000165 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 20291.0 | 5117 | 3.965409 | 95.366082 | 28144 | 0.519756 |
3862 | 8767975 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 71-year-old human stage | HsapDv:0000165 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 6676.0 | 2431 | 2.746195 | 26.775474 | 30436 | 0.521073 |
9351 | 8773464 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 63-year-old human stage | HsapDv:0000157 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 7089.0 | 3456 | 2.051215 | 7.678129 | 30436 | 0.523767 |
15352 | 8779465 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 71-year-old human stage | HsapDv:0000165 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 7537.0 | 3108 | 2.425032 | 16.320416 | 30436 | 0.524051 |
19484 | 8805590 | 894573ad-498f-47ee-9bec-ad0880147eea | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 63-year-old human stage | HsapDv:0000157 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 10069.0 | 4056 | 2.482495 | 20.787856 | 28144 | 0.524355 |
19907 | 8806013 | 894573ad-498f-47ee-9bec-ad0880147eea | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 74-year-old human stage | HsapDv:0000168 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 15343.0 | 4819 | 3.183856 | 50.794335 | 28144 | 0.525085 |
6952 | 8771065 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 71-year-old human stage | HsapDv:0000165 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 11328.0 | 4152 | 2.728324 | 31.684787 | 30436 | 0.525178 |
9268 | 8773381 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 63-year-old human stage | HsapDv:0000157 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 10069.0 | 4056 | 2.482495 | 20.787856 | 30436 | 0.525478 |
19942 | 8806048 | 894573ad-498f-47ee-9bec-ad0880147eea | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 74-year-old human stage | HsapDv:0000168 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 17061.0 | 4816 | 3.542566 | 64.065477 | 28144 | 0.525521 |
19921 | 8806027 | 894573ad-498f-47ee-9bec-ad0880147eea | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 74-year-old human stage | HsapDv:0000168 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 16412.0 | 4935 | 3.325633 | 53.238287 | 28144 | 0.526828 |
11270 | 8775383 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | cerebellar granule cell | CL:0001031 | 74-year-old human stage | HsapDv:0000168 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 16412.0 | 4935 | 3.325633 | 53.238287 | 30436 | 0.528585 |
20 rows × 27 columns
# Cells farthest from the center
adata_sorted.obs.tail(20)
soma_joinid | dataset_id | assay | assay_ontology_term_id | cell_type | cell_type_ontology_term_id | development_stage | development_stage_ontology_term_id | disease | disease_ontology_term_id | ... | tissue | tissue_ontology_term_id | tissue_general | tissue_general_ontology_term_id | raw_sum | nnz | raw_mean_nnz | raw_variance_nnz | n_measured_vars | distance_from_center | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
25070 | 8817665 | 3d044b52-140a-4528-bf0d-a2dbef9e1f40 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 789.0 | 577 | 1.367418 | 1.267548 | 25787 | 1.085520 |
1857 | 8765970 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 542.0 | 377 | 1.437666 | 2.401024 | 30436 | 1.086705 |
16058 | 8780171 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 594.0 | 434 | 1.368664 | 1.752919 | 30436 | 1.087188 |
2777 | 8766890 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 39-year-old human stage | HsapDv:0000133 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 554.0 | 418 | 1.325359 | 0.939451 | 30436 | 1.090194 |
3547 | 8767660 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 39-year-old human stage | HsapDv:0000133 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 700.0 | 458 | 1.528384 | 1.842737 | 30436 | 1.091174 |
25060 | 8817655 | 3d044b52-140a-4528-bf0d-a2dbef9e1f40 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 1027.0 | 695 | 1.477698 | 3.970323 | 25787 | 1.091251 |
16133 | 8780246 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 1027.0 | 695 | 1.477698 | 3.970323 | 30436 | 1.092581 |
25054 | 8817649 | 3d044b52-140a-4528-bf0d-a2dbef9e1f40 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 1233.0 | 709 | 1.739069 | 7.859785 | 25787 | 1.092610 |
25106 | 8817701 | 3d044b52-140a-4528-bf0d-a2dbef9e1f40 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 1800.0 | 949 | 1.896733 | 8.685527 | 25787 | 1.097101 |
23781 | 8813093 | 84242d25-f656-4ca6-8e8d-f3d2beeba11f | 10x 3' v3 | EFO:0009922 | central nervous system macrophage | CL:0000878 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 717.0 | 466 | 1.538627 | 2.649042 | 23987 | 1.099800 |
24075 | 8816670 | 3d044b52-140a-4528-bf0d-a2dbef9e1f40 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 39-year-old human stage | HsapDv:0000133 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 554.0 | 418 | 1.325359 | 0.939451 | 25787 | 1.102289 |
25045 | 8817640 | 3d044b52-140a-4528-bf0d-a2dbef9e1f40 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 594.0 | 434 | 1.368664 | 1.752919 | 25787 | 1.105064 |
16103 | 8780216 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 567.0 | 440 | 1.288636 | 0.902832 | 30436 | 1.107093 |
14944 | 8779057 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 71-year-old human stage | HsapDv:0000165 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 694.0 | 373 | 1.860590 | 5.899867 | 30436 | 1.108619 |
16472 | 8780585 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 1800.0 | 949 | 1.896733 | 8.685527 | 30436 | 1.109879 |
24909 | 8817504 | 3d044b52-140a-4528-bf0d-a2dbef9e1f40 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 71-year-old human stage | HsapDv:0000165 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 694.0 | 373 | 1.860590 | 5.899867 | 25787 | 1.109984 |
25053 | 8817648 | 3d044b52-140a-4528-bf0d-a2dbef9e1f40 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 567.0 | 440 | 1.288636 | 0.902832 | 25787 | 1.114052 |
24128 | 8816723 | 3d044b52-140a-4528-bf0d-a2dbef9e1f40 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 39-year-old human stage | HsapDv:0000133 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 700.0 | 458 | 1.528384 | 1.842737 | 25787 | 1.117524 |
24387 | 8816982 | 3d044b52-140a-4528-bf0d-a2dbef9e1f40 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 71-year-old human stage | HsapDv:0000165 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 563.0 | 340 | 1.655882 | 2.739641 | 25787 | 1.126728 |
6719 | 8770832 | c05e6940-729c-47bd-a2a6-6ce3730c4919 | 10x 3' v3 | EFO:0009922 | endothelial cell of artery | CL:1000413 | 71-year-old human stage | HsapDv:0000165 | normal | PATO:0000461 | ... | white matter of cerebellum | UBERON:0002317 | central nervous system | UBERON:0001017 | 563.0 | 340 | 1.655882 | 2.739641 | 30436 | 1.129278 |
20 rows × 27 columns
We see that cerebellar granular cells, in data that come from the white matter of the cerebellum, are at the center of the embedding. This is somewhat expected. We note that there are oligodendrocyte precursor cells in the data as well, and as per my original hypothesis, I thought the stem cells would be more central than the rest.
Blood and blood related cells seem to be the farthest out. This makes sense as per my hypothesis that we are dealing with outliers. If the typical cell in this region of the model's embedding (which encompassed much more than the CNS, so we can only say so much here) was a CNS/Cerebellum related cell, and we now have an arterial epithelial cell, it is likely to be "farther away" from the CNS/Cenebellum specific cells.
Now we are going to look again at the UMAP colored by cells, because we are now going to see where the actual center and outer edges of the embedding are with respect to the UMAP coordinates.
# Plot the UMAP with cell types, using the function we defined earlier
# Convert adata.obs.index to integer positions
sc.pl.umap(adata, color='cell_type', title='UMAP colored by cell type')
And now we color by distance from the center. Will the center of the UMAP have the lowest "distance from center?"
sc.pl.umap(adata, color='distance_from_center', title='UMAP colored by distance from center')
No!
We can already see that the "centerness" is not reflected on the UMAP. It appears that the center of the embedding is on the north and south end of the UMAP. In other words, if we're asking questions about the "centerness" of a model, we cannot rely on a UMAP to tell us.
Let's make this a bit more explicit by doing some thresholding of the center and the outer edges. The first of the UMAPs below will light up the top 2000 cells from the center, and the second of the UMAPs below will light up the top 2000 cells from the outer edges.
# Color by only top n cells from the center
top_n = adata_sorted.obs.head(2000).index
top_n_mask = adata.obs.index.isin(top_n)
adata.obs['top_n_from_center'] = top_n_mask
sc.pl.umap(adata, color="top_n_from_center", title='UMAP colored by distance from center')
# Color by only top n cells from the center
top_n = adata_sorted.obs.tail(2000).index
top_n_mask = adata.obs.index.isin(top_n)
adata.obs['top_n_from_outside'] = top_n_mask
sc.pl.umap(adata, color="top_n_from_outside", title='UMAP colored by top n from outside')
It's possible that the center reflects the simple weighting in terms of cell type frequency. The center seems to be the cell types that have the highest frequency, and the outside seems to be the cell types that have the lowest frequency. Some sort of gravity well.
Luckily this is testable. We will do that by making a new data frame that simply gives us the cell type and average distance from center. We note that there might be substantial variance in some of these. But we will start here.
# Take the metadata, and make a new data frame that has the cell type and the average distance from the center
cluster_means = adata.obs[['cell_type', 'distance_from_center']]
# Groupby cell types, take the mean, but we also need a frequency column
cluster_means = cluster_means.groupby('cell_type').agg(
mean_distance=('distance_from_center', 'mean'),
frequency=('cell_type', 'count')
)
# Sort by mean distance
cluster_means.sort_values('mean_distance')
mean_distance | frequency | |
---|---|---|
cell_type | ||
oligodendrocyte | 0.686349 | 10924 |
cerebellar granule cell | 0.749444 | 8678 |
oligodendrocyte precursor cell | 0.859274 | 2036 |
differentiation-committed oligodendrocyte precursor | 0.862943 | 306 |
neuron | 0.888968 | 52 |
microglial cell | 0.900753 | 2562 |
GABAergic neuron | 0.910608 | 1744 |
astrocyte | 0.914900 | 1557 |
vascular associated smooth muscle cell | 0.916619 | 158 |
glutamatergic neuron | 0.927685 | 996 |
mural cell | 0.929756 | 1076 |
capillary endothelial cell | 0.936451 | 1072 |
leukocyte | 0.939225 | 146 |
central nervous system macrophage | 0.940465 | 110 |
ependymal cell | 0.967948 | 27 |
endothelial cell of artery | 0.990752 | 336 |
There looks to be a rough trend but its is far from perfect. Let's plot this to get a bit more clarity.
# Make a scatterplot of the mean distance from the center by frequency
plt.figure(figsize=(8, 6))
plt.scatter(cluster_means['frequency'], cluster_means['mean_distance'], s=10)
plt.xlabel('Frequency')
plt.ylabel('Mean Distance from Center')
plt.title('Mean Distance from Center by Frequency')
plt.show()
# Make the same plot but log transform frequency
plt.figure(figsize=(8, 6))
plt.scatter(cluster_means['frequency'], cluster_means['mean_distance'], s=10)
plt.xscale('log')
plt.xlabel('Frequency')
plt.ylabel('Mean Distance from Center')
plt.title('Mean Distance from Center by Frequency (log scale)')
plt.show()
So broadly speaking, the cells with higher frequency are closer to the center of the embedding. But soon as you get past the top 2 most frequent cells, it is not as close of an association.
We also note that there are cells within a given subset that are closer to the center than others. We remember that only a piece of cerebellar granular cells were really at the center. It's not literally cell type by cell type.
Let's look at adata again.
adata.obs
soma_joinid | dataset_id | assay | assay_ontology_term_id | cell_type | cell_type_ontology_term_id | development_stage | development_stage_ontology_term_id | disease | disease_ontology_term_id | ... | tissue_general | tissue_general_ontology_term_id | raw_sum | nnz | raw_mean_nnz | raw_variance_nnz | n_measured_vars | distance_from_center | top_n_from_center | top_n_from_outside | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8752190 | c8f83821-a242-4ed7-86e9-7da077f5d348 | 10x 3' v3 | EFO:0009922 | ependymal cell | CL:0000065 | 34-year-old human stage | HsapDv:0000128 | normal | PATO:0000461 | ... | central nervous system | UBERON:0001017 | 2374.0 | 1623 | 1.462723 | 4.991057 | 24817 | 0.992051 | False | True |
1 | 8752191 | c8f83821-a242-4ed7-86e9-7da077f5d348 | 10x 3' v3 | EFO:0009922 | astrocyte | CL:0000127 | 34-year-old human stage | HsapDv:0000128 | normal | PATO:0000461 | ... | central nervous system | UBERON:0001017 | 917.0 | 640 | 1.432813 | 1.638671 | 24817 | 0.933798 | False | False |
2 | 8752192 | c8f83821-a242-4ed7-86e9-7da077f5d348 | 10x 3' v3 | EFO:0009922 | astrocyte | CL:0000127 | 34-year-old human stage | HsapDv:0000128 | normal | PATO:0000461 | ... | central nervous system | UBERON:0001017 | 2241.0 | 1184 | 1.892736 | 5.527792 | 24817 | 0.960067 | False | True |
3 | 8752193 | c8f83821-a242-4ed7-86e9-7da077f5d348 | 10x 3' v3 | EFO:0009922 | astrocyte | CL:0000127 | 34-year-old human stage | HsapDv:0000128 | normal | PATO:0000461 | ... | central nervous system | UBERON:0001017 | 1255.0 | 887 | 1.414882 | 4.256573 | 24817 | 0.919938 | False | False |
4 | 8752194 | c8f83821-a242-4ed7-86e9-7da077f5d348 | 10x 3' v3 | EFO:0009922 | astrocyte | CL:0000127 | 34-year-old human stage | HsapDv:0000128 | normal | PATO:0000461 | ... | central nervous system | UBERON:0001017 | 1491.0 | 816 | 1.827206 | 5.340658 | 24817 | 0.946847 | False | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
31775 | 8831489 | 12194ced-8086-458e-84a8-e2ab935d8db1 | 10x 3' v3 | EFO:0009922 | oligodendrocyte | CL:0000128 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | central nervous system | UBERON:0001017 | 23435.0 | 5149 | 4.551369 | 134.862017 | 28059 | 0.695828 | False | False |
31776 | 8831490 | 12194ced-8086-458e-84a8-e2ab935d8db1 | 10x 3' v3 | EFO:0009922 | oligodendrocyte | CL:0000128 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | central nervous system | UBERON:0001017 | 3134.0 | 1343 | 2.333582 | 15.222471 | 28059 | 0.745088 | False | False |
31777 | 8831491 | 12194ced-8086-458e-84a8-e2ab935d8db1 | 10x 3' v3 | EFO:0009922 | oligodendrocyte | CL:0000128 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | central nervous system | UBERON:0001017 | 3692.0 | 1632 | 2.262255 | 21.365883 | 28059 | 0.737275 | False | False |
31778 | 8831492 | 12194ced-8086-458e-84a8-e2ab935d8db1 | 10x 3' v3 | EFO:0009922 | oligodendrocyte | CL:0000128 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | central nervous system | UBERON:0001017 | 11394.0 | 3554 | 3.205965 | 58.032433 | 28059 | 0.691257 | False | False |
31779 | 8831493 | 12194ced-8086-458e-84a8-e2ab935d8db1 | 10x 3' v3 | EFO:0009922 | oligodendrocyte | CL:0000128 | 73-year-old human stage | HsapDv:0000167 | normal | PATO:0000461 | ... | central nervous system | UBERON:0001017 | 7072.0 | 2502 | 2.826539 | 31.751586 | 28059 | 0.693035 | False | False |
31780 rows × 29 columns
So we looked at frequency of subset in terms of how close the cells are to the center of the embedding. Now let's go back to cell-by-cell and look at density.
Are the cells closer to the center more densely packed? Is there some sort of "model gravity?" We note that again we are only looking at a piece of the model (CNS), due to availability, but there still might be local areas of high density that go across subsets at the level of, for example, organ system or species. This would be analogous to galactic superclusters in astronomy.
Given the dimensionality of the data, we are going to use a KNN density esimator with this, as Nikolay Samusik used to for the CyTOF clustering algorithm X-shift. I'll note that I tried a kernel density estimator, but it took way too long.
from sklearn.neighbors import NearestNeighbors
import numpy as np
# Set the number of neighbors
k = 5
# Fit the NearestNeighbors model
nbrs = NearestNeighbors(n_neighbors=k, algorithm='auto').fit(adata.obsm['uce'])
# Compute the distances to the k-nearest neighbors
distances, indices = nbrs.kneighbors(adata.obsm['uce'])
# Estimate density as the inverse of the distance to the kth neighbor
density = 1 / (distances[:, -1] + 1e-10) # Add a small constant to avoid division by zero
Now we have to color the UMAP by density to see if we notice any patterns.
# Add density to adata
adata.obs['density'] = density
# Plot
sc.pl.umap(adata, color='density', title='UMAP colored by density')
We do see what might be a correlation between density and whether you're at the center of the manifold. But let's check that real quick with another correlation plot like before.
# Add density to the metadata
adata.obs['density'] = density
import matplotlib.patches as mpatches
# Plot density vs distance from center and color by cell type. Label the axes and add a colorbar
plt.figure(figsize=(8, 6))
scatter = plt.scatter(
adata.obs['distance_from_center'], adata.obs['density'],
c=adata.obs['cell_type'].cat.codes, cmap='tab20', s=10
)
# Label the axes
plt.xlabel('Distance from Center')
plt.ylabel('Density')
# Create legend handles
unique_categories = adata.obs['cell_type'].cat.categories
colors = plt.cm.get_cmap('tab20', len(unique_categories))(np.arange(len(unique_categories)))
handles = [mpatches.Patch(color=colors[i], label=unique_categories[i]) for i in range(len(unique_categories))]
# Add the legend
plt.legend(handles=handles, title="Cell Type", bbox_to_anchor=(1.05, 1), loc='upper left')
# Show the plot
plt.show()
There appears to be a slight trend, though the shape of the plot is interesting.
It appears as if each cell subset occupies a specific distance from center. It's not like there are multiple "in orbit" around the center (otherwise you would see a fair amount of overlap between subsets here).
It is also interesting that there is a high distribution of density for each cell subset. I am going to guess that the high density is the center of a given subset and the low density is the outside. We note that the highest densities do in fact come from oligodendrocytes and cerebellar granule cells, which have the highest frequencies in the dataset.
Back to the slight trend. Let's just check correlation.
# Check spearman correlation between density and distance from center
adata.obs[['density', 'distance_from_center']].corr(method='spearman')
density | distance_from_center | |
---|---|---|
density | 1.000000 | -0.332414 |
distance_from_center | -0.332414 | 1.000000 |
So its a weak negative correlation. More dense cells are closer to the center. Less dense groups of cells are farther from the center.
Thus, there is a sort of local density for each of the clusters, that represents cluster-ness, but then there is perhaps a sort of global density for the model that, along with the distance from the center, represents model-ness. Or it might be that if there are more cells in a given cluster, there is going to be a higher density, regardless of whether the cells are at the center or not.
So the last thing we will do is jump up to the subset level as before and look at average density, and frequency. Let's make the table.
# Make a table of cell subset, frequency, mean distance from center, and mean density
cluster_means = adata.obs[['cell_type', 'distance_from_center', 'density']]
cluster_means = cluster_means.groupby('cell_type').agg(
frequency=('cell_type', 'count'),
mean_distance=('distance_from_center', 'mean'),
mean_density=('density', 'mean')
)
cluster_means.sort_values('mean_distance')
frequency | mean_distance | mean_density | |
---|---|---|---|
cell_type | |||
oligodendrocyte | 10924 | 0.686349 | 6.679530 |
cerebellar granule cell | 8678 | 0.749444 | 8.729607 |
oligodendrocyte precursor cell | 2036 | 0.859274 | 7.116778 |
differentiation-committed oligodendrocyte precursor | 306 | 0.862943 | 3.618157 |
neuron | 52 | 0.888968 | 3.080095 |
microglial cell | 2562 | 0.900753 | 4.082637 |
GABAergic neuron | 1744 | 0.910608 | 4.177983 |
astrocyte | 1557 | 0.914900 | 4.842773 |
vascular associated smooth muscle cell | 158 | 0.916619 | 2.908818 |
glutamatergic neuron | 996 | 0.927685 | 3.246564 |
mural cell | 1076 | 0.929756 | 4.775011 |
capillary endothelial cell | 1072 | 0.936451 | 4.658878 |
leukocyte | 146 | 0.939225 | 2.559924 |
central nervous system macrophage | 110 | 0.940465 | 2.511820 |
ependymal cell | 27 | 0.967948 | 4.161853 |
endothelial cell of artery | 336 | 0.990752 | 2.671052 |
And now we will make some plots.
# Plot mean density vs mean distance from center
plt.figure(figsize=(8, 6))
plt.scatter(cluster_means['mean_distance'], cluster_means['mean_density'], s=10)
plt.xlabel('Mean Distance from Center')
plt.ylabel('Mean Density')
plt.title('Mean Density vs Mean Distance from Center')
plt.show()
# The spearman correlation
cluster_means[['mean_distance', 'mean_density']].corr(method='spearman')
mean_distance | mean_density | |
---|---|---|
mean_distance | 1.000000 | -0.608824 |
mean_density | -0.608824 | 1.000000 |
We see a fairly strong negative correlation at the subset level. This means that if you have a higher density, you are going to be closer to the center of the dataset. While there could be some sort of "global density" or "global gravity" associated with this model, we have to check frequency first, given that it appears to be associated with density as well.
Ok, let's look at density by frequency.
# Plot mean density vs frequency
plt.figure(figsize=(8, 6))
plt.scatter(cluster_means['frequency'], cluster_means['mean_density'], s=10)
plt.xscale('log')
plt.xlabel('Frequency')
plt.ylabel('Mean Density')
plt.title('Mean Density vs Frequency (log scale)')
plt.show()
# The spearman correlation
cluster_means[['frequency', 'mean_density']].corr(method='spearman')
frequency | mean_density | |
---|---|---|
frequency | 1.000000 | 0.758824 |
mean_density | 0.758824 | 1.000000 |
We have a stronger positive correlation between density and frequency. This suggests that if there are more cells of a specific type, they are going to have a higher mean density in this high-dimensional embedding.
Center-ness appears to be a worthwhile feature to look at in the context of these embeddings that come from the single-cell foundation models. It is important to note that this feature must be computed in the embedding space, and not the UMAP space. We have shown here that UMAP does not properly capture the actual center and outer edges of the embedding.
Distance from the center is associated with both frequency and density. My best guess is that the cells closest to the center are the most representative of the CNS section of the embedding, so that is where we are going to see them, and as we move out from that, we move in the direction of other organ systems, so cells less CNS-like that are nonetheless in the CNS-label piece (eg. arterial epithelial cells) will be sitting there.
We note that it would be much more interesting to do this "center-ness" analysis on the entire model (60M cells). Because this would tell us a lot about what cells in biology in general were considered more "globally central" by the model, and which were considered to be "outliers" (where again there could be technical artifacts).
In other words, there is still a lot to be done in this direction, but for those exploring these models, now you have a few unique questions you can ask about these embeddings (this goes for NLP too), and the code to get the answers.
The better you know the embedding, the better you can use the model. So explore it as much as you can. Ask and answer as many questions as you can. Look for good use-cases. Look for imperfections. Be both visionary and nit picky. And of course, enjoy the process.