Home¶

UMAP does not capture the proper center and outer edges of human CNS portion of the Univesal Cell Embeddings (UCE) transformer foundation model¶

Tyler Burns
September 28 - September 29, 2024

Table of contents¶

  • Abstract
  • Introduction
  • Data collection
  • UMAP visualization of metadata
  • The center and the outer edges of the embedding
  • UMAP does not capture center-ness
  • Center-ness is associated with frequency
  • Density is associated with frequency and center-ness
  • Discussion and future directions

Abstract¶

Here, we a look at the output of a transformer-based single-cell foundation model called Universal Cell Embeddings. It is a 1280 dimensional embedding of around 30,000 single cells. Given that it is the output of a black box model, we ask questions about the geometry of the embedding to see if that can tell us anything about what the model is doing.

Accordingly, we find that there there is a distinct "center" of the model, and a distinct "outer edge." There are positive associations between center-ness, frequency of a cell subset occurring, and density of cells in a given subset. We also find that UMAP does not properly represent center-ness of the data, suggesting that you should not treat the center of the UMAP as the literal center of the dataset until you check it yourself (with the code provided).

Introduction¶

This notebook looks at a human central nervous system dataset, with respect to a Universal Cell Embeddings (UCE) transformer model. The paper around the model can be found here.

UCE and a number of other transformer-based models embed single cells into a high-dimensional vector space. Additional data can be added to the model, and placed in the same vector space. As such, these models allow for things like per-cell annotation.

Chan-Zuckerberg Initiative (CZI) has done a fair amount of work in single-cell, bringing together many databases along with visualization tools in what they call CELLxGENE. Accordingly, CZI has some of these models in an accessible format for users. These CZI "census models" can be found on their website here

At the time of writing (September 29, 2024), these models are fairly new. Accordingly, this jupyter notebook is a first pass at understanding the properties of the high-dimensional embeddings that these models output.

Data collection¶

For the sake of saving time (it takes 20min to pull the anndata object on my 16Gb MacBook Pro), I ran the code in this first block separately, saved the object, and I read it in below. The first code block does nothing. I'm just showing it for display purposes, so you can run this on your end.

import cellxgene_census
import os

# Set the working directory to "data"
os.chdir('data')

print("setting connection")
census = cellxgene_census.open_soma(census_version="2023-12-15")

# Human UCE
print("getting human data")
adata = cellxgene_census.get_anndata(
    census,
    organism = "homo_sapiens",
    measurement_name = "RNA",
    obs_value_filter = "tissue_general == 'central nervous system'",
    obs_embeddings = ["uce"]
)

adata.write("human_uce_cns.h5ad")
In [227]:
import random
import scanpy as sc
random.seed(42)

import warnings
warnings.filterwarnings('ignore')

# Read in the pre-made anndata file, reading in h5ad
adata = sc.read("data/human_uce_cns.h5ad")

So now we have an AnnData object. This format was originally build for the scanpy package (for single-cell sequencing analysis: R users use Seurat, Python users use scanpy). You can read more about AnnData here.

Now let's look at the shape of the data.

In [228]:
adata.obsm['uce'].shape
Out[228]:
(31780, 1280)

Here, we have 31k cells, and 1280 "features." These features are the dimensions in the embedding that was outputted by the transformer. While not exact, it is comparable to the output from NLP models like BERT, which I have heavily used in the past in projects like this, and the content behind my TED talk here.

In short, cells that are similar to each other by whatever context the transformer was able to find will be grouped physically near each other in this embedding. The more cells the model was trained on, and especially the more diverse the training set, the more powerful the model is likely to be.

UMAP visualization of metadata¶

Below, we will visualize these 1280 dimensions by means of compressing them to 2 dimensions using UMAP, a nonlinear dimensionality reduction tool that is commonly used at this point. While there is plenty about it to critique, it is sufficiently good for our purposes below.

In [229]:
import umap
import matplotlib.pyplot as plt

reducer = umap.UMAP()
embedding = reducer.fit_transform(adata.obsm['uce'])
plt.scatter(embedding[:,0], embedding[:,1])
Out[229]:
<matplotlib.collections.PathCollection at 0x2d144fef0>

We see that the data do fall into distinct "islands." This is something that is fairly typical if we were to simply run a UMAP on a compressed version of the top 2000 differentially expressed genes per cell. For more on a typical scRNA-seq analysis workup, go here.

But this only tells us that there are distinct islands. Our AnnData object has information about cell subset. Let's color the UMAP by those subsets and see where they fall. Below, we make a function that allows us to loop through all the metadata columns and color the UMAP by each of them.

We are going to do a data dump of UMAPs colored by various pieces of metadata that might be of interest to some readers (eg. gender). If you just want to cut to the chase, then you can skip this section, as I will show the subsets plot in the subsequent seciton..

The fourth UMAP down is our cell subsets.

In [230]:
import pandas as pd

# Put umap into adata obsm
adata.obsm['X_umap'] = embedding

# Plot UMAP colored by each categorical column in adata.obs using scanpy
for column in adata.obs.columns:
    if pd.api.types.is_categorical_dtype(adata.obs[column]) or adata.obs[column].nunique() < 20:  # Check for categorical or small number of unique values
        sc.pl.umap(adata, color=column)

The center and the outer edges of the embedding¶

If we look just at the cell subsets, we see that there are a number of CNS populations, which serves as a good sanity check. We note that the largest of these subsets are cerebellar granular neurons, and oligodendrocytes. We note that specifically, we are dealing with white matter of the cerebellum, so the former makes sense in that regard.

We now look at the metadata below, stored in the obs slot.

In [231]:
# Get metadata
adata.obs
Out[231]:
soma_joinid dataset_id assay assay_ontology_term_id cell_type cell_type_ontology_term_id development_stage development_stage_ontology_term_id disease disease_ontology_term_id ... suspension_type tissue tissue_ontology_term_id tissue_general tissue_general_ontology_term_id raw_sum nnz raw_mean_nnz raw_variance_nnz n_measured_vars
0 8752190 c8f83821-a242-4ed7-86e9-7da077f5d348 10x 3' v3 EFO:0009922 ependymal cell CL:0000065 34-year-old human stage HsapDv:0000128 normal PATO:0000461 ... nucleus white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 2374.0 1623 1.462723 4.991057 24817
1 8752191 c8f83821-a242-4ed7-86e9-7da077f5d348 10x 3' v3 EFO:0009922 astrocyte CL:0000127 34-year-old human stage HsapDv:0000128 normal PATO:0000461 ... nucleus white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 917.0 640 1.432813 1.638671 24817
2 8752192 c8f83821-a242-4ed7-86e9-7da077f5d348 10x 3' v3 EFO:0009922 astrocyte CL:0000127 34-year-old human stage HsapDv:0000128 normal PATO:0000461 ... nucleus white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 2241.0 1184 1.892736 5.527792 24817
3 8752193 c8f83821-a242-4ed7-86e9-7da077f5d348 10x 3' v3 EFO:0009922 astrocyte CL:0000127 34-year-old human stage HsapDv:0000128 normal PATO:0000461 ... nucleus white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 1255.0 887 1.414882 4.256573 24817
4 8752194 c8f83821-a242-4ed7-86e9-7da077f5d348 10x 3' v3 EFO:0009922 astrocyte CL:0000127 34-year-old human stage HsapDv:0000128 normal PATO:0000461 ... nucleus white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 1491.0 816 1.827206 5.340658 24817
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
31775 8831489 12194ced-8086-458e-84a8-e2ab935d8db1 10x 3' v3 EFO:0009922 oligodendrocyte CL:0000128 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... nucleus white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 23435.0 5149 4.551369 134.862017 28059
31776 8831490 12194ced-8086-458e-84a8-e2ab935d8db1 10x 3' v3 EFO:0009922 oligodendrocyte CL:0000128 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... nucleus white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 3134.0 1343 2.333582 15.222471 28059
31777 8831491 12194ced-8086-458e-84a8-e2ab935d8db1 10x 3' v3 EFO:0009922 oligodendrocyte CL:0000128 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... nucleus white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 3692.0 1632 2.262255 21.365883 28059
31778 8831492 12194ced-8086-458e-84a8-e2ab935d8db1 10x 3' v3 EFO:0009922 oligodendrocyte CL:0000128 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... nucleus white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 11394.0 3554 3.205965 58.032433 28059
31779 8831493 12194ced-8086-458e-84a8-e2ab935d8db1 10x 3' v3 EFO:0009922 oligodendrocyte CL:0000128 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... nucleus white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 7072.0 2502 2.826539 31.751586 28059

31780 rows × 26 columns

With something like a high dimensional embedding that comes from a black box model, we as skeptical biologists have no idea whether and how much to trust the model. This may change down the line as the field of explainable artificial intelligence matures. But until then, we have to ask very simple questions to get at the characteristics of the model and what it encoded.

Thus, we are going to ask a very simple set of questions. Given that we are dealing with a 1280 dimensional point cloud, what cells are at the very center of the point cloud, and what cells are at the outer edge.

I am guessing that the center of the model's output will be things that the model determined to be "central" to everything else. For example, there might be cells that share gene expression programmes with the rest of the cells in the model, or perhaps in some developmental datasets, cells that are precursors to a large fraction of the cells in the model.

Furthermore, I am guessing that cells on the outer edges would be least like the others. In other words, cells out here would be "outliers" either by being biologically different (eg. contamination from a different organ system), or technical artifacts.

So let's look at the center and the outside of the embedding, below. We do this by first finding the center of the data, and we approximate that by finding the mean value of each of the 1280 dimensions.

In [232]:
# Find the mean of all coordinates in the manifold
center = adata.obsm['uce'].mean(axis=0)
center[0:10]
Out[232]:
array([-0.00088816,  0.02052302,  0.00202639,  0.00415612,  0.02179049,
        0.00748633,  0.00377401, -0.00308924,  0.01350688, -0.00047351],
      dtype=float32)

Next, we compute the distance from each cell to this "center" coordinate we just found.

In [233]:
# Distance from the center that we just computed
distances = np.linalg.norm(adata.obsm['uce'] - center, axis=1)
distances[1:10]
len(distances)
Out[233]:
31780

From there, we add the distances to the metadata matrix that we have already seen, so we can sort by them.

In [234]:
# Add distances to the metadata
adata.obs['distance_from_center'] = distances

# Sort cells by distance from center, with the closest cells first
adata_sorted = adata[adata.obs['distance_from_center'].sort_values().index]

And now we have a look.

In [235]:
adata_sorted.obs
Out[235]:
soma_joinid dataset_id assay assay_ontology_term_id cell_type cell_type_ontology_term_id development_stage development_stage_ontology_term_id disease disease_ontology_term_id ... tissue tissue_ontology_term_id tissue_general tissue_general_ontology_term_id raw_sum nnz raw_mean_nnz raw_variance_nnz n_measured_vars distance_from_center
18977 8805083 894573ad-498f-47ee-9bec-ad0880147eea 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 63-year-old human stage HsapDv:0000157 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 7387.0 3177 2.325150 14.233350 28144 0.504095
11231 8775344 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 74-year-old human stage HsapDv:0000168 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 15343.0 4819 3.183856 50.794335 30436 0.506040
8151 8772264 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 63-year-old human stage HsapDv:0000157 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 7387.0 3177 2.325150 14.233350 30436 0.507273
19521 8805627 894573ad-498f-47ee-9bec-ad0880147eea 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 63-year-old human stage HsapDv:0000157 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 7089.0 3456 2.051215 7.678129 28144 0.516124
9304 8773417 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 63-year-old human stage HsapDv:0000157 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 3559.0 2100 1.694762 4.584727 30436 0.516689
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
24909 8817504 3d044b52-140a-4528-bf0d-a2dbef9e1f40 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 71-year-old human stage HsapDv:0000165 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 694.0 373 1.860590 5.899867 25787 1.109984
25053 8817648 3d044b52-140a-4528-bf0d-a2dbef9e1f40 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 567.0 440 1.288636 0.902832 25787 1.114052
24128 8816723 3d044b52-140a-4528-bf0d-a2dbef9e1f40 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 39-year-old human stage HsapDv:0000133 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 700.0 458 1.528384 1.842737 25787 1.117524
24387 8816982 3d044b52-140a-4528-bf0d-a2dbef9e1f40 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 71-year-old human stage HsapDv:0000165 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 563.0 340 1.655882 2.739641 25787 1.126728
6719 8770832 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 71-year-old human stage HsapDv:0000165 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 563.0 340 1.655882 2.739641 30436 1.129278

31780 rows × 27 columns

It divides into cerebellar granular cell as the center and endothelial cell of the artery as the farthest out. Let's check this along a longer list, before we come to any conclusions.

In [236]:
# Top 20 cells closest to the center
adata_sorted.obs.head(20)
Out[236]:
soma_joinid dataset_id assay assay_ontology_term_id cell_type cell_type_ontology_term_id development_stage development_stage_ontology_term_id disease disease_ontology_term_id ... tissue tissue_ontology_term_id tissue_general tissue_general_ontology_term_id raw_sum nnz raw_mean_nnz raw_variance_nnz n_measured_vars distance_from_center
18977 8805083 894573ad-498f-47ee-9bec-ad0880147eea 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 63-year-old human stage HsapDv:0000157 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 7387.0 3177 2.325150 14.233350 28144 0.504095
11231 8775344 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 74-year-old human stage HsapDv:0000168 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 15343.0 4819 3.183856 50.794335 30436 0.506040
8151 8772264 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 63-year-old human stage HsapDv:0000157 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 7387.0 3177 2.325150 14.233350 30436 0.507273
19521 8805627 894573ad-498f-47ee-9bec-ad0880147eea 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 63-year-old human stage HsapDv:0000157 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 7089.0 3456 2.051215 7.678129 28144 0.516124
9304 8773417 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 63-year-old human stage HsapDv:0000157 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 3559.0 2100 1.694762 4.584727 30436 0.516689
21980 8808086 894573ad-498f-47ee-9bec-ad0880147eea 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 71-year-old human stage HsapDv:0000165 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 6840.0 2943 2.324159 16.665110 28144 0.518092
22083 8808189 894573ad-498f-47ee-9bec-ad0880147eea 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 71-year-old human stage HsapDv:0000165 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 6048.0 2621 2.307516 16.059596 28144 0.518990
18419 8804525 894573ad-498f-47ee-9bec-ad0880147eea 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 71-year-old human stage HsapDv:0000165 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 4508.0 2077 2.170438 15.156871 28144 0.519240
18420 8804526 894573ad-498f-47ee-9bec-ad0880147eea 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 71-year-old human stage HsapDv:0000165 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 11328.0 4152 2.728324 31.684787 28144 0.519303
17671 8803777 894573ad-498f-47ee-9bec-ad0880147eea 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 71-year-old human stage HsapDv:0000165 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 20291.0 5117 3.965409 95.366082 28144 0.519756
3862 8767975 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 71-year-old human stage HsapDv:0000165 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 6676.0 2431 2.746195 26.775474 30436 0.521073
9351 8773464 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 63-year-old human stage HsapDv:0000157 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 7089.0 3456 2.051215 7.678129 30436 0.523767
15352 8779465 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 71-year-old human stage HsapDv:0000165 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 7537.0 3108 2.425032 16.320416 30436 0.524051
19484 8805590 894573ad-498f-47ee-9bec-ad0880147eea 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 63-year-old human stage HsapDv:0000157 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 10069.0 4056 2.482495 20.787856 28144 0.524355
19907 8806013 894573ad-498f-47ee-9bec-ad0880147eea 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 74-year-old human stage HsapDv:0000168 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 15343.0 4819 3.183856 50.794335 28144 0.525085
6952 8771065 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 71-year-old human stage HsapDv:0000165 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 11328.0 4152 2.728324 31.684787 30436 0.525178
9268 8773381 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 63-year-old human stage HsapDv:0000157 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 10069.0 4056 2.482495 20.787856 30436 0.525478
19942 8806048 894573ad-498f-47ee-9bec-ad0880147eea 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 74-year-old human stage HsapDv:0000168 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 17061.0 4816 3.542566 64.065477 28144 0.525521
19921 8806027 894573ad-498f-47ee-9bec-ad0880147eea 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 74-year-old human stage HsapDv:0000168 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 16412.0 4935 3.325633 53.238287 28144 0.526828
11270 8775383 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 cerebellar granule cell CL:0001031 74-year-old human stage HsapDv:0000168 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 16412.0 4935 3.325633 53.238287 30436 0.528585

20 rows × 27 columns

In [237]:
# Cells farthest from the center
adata_sorted.obs.tail(20)
Out[237]:
soma_joinid dataset_id assay assay_ontology_term_id cell_type cell_type_ontology_term_id development_stage development_stage_ontology_term_id disease disease_ontology_term_id ... tissue tissue_ontology_term_id tissue_general tissue_general_ontology_term_id raw_sum nnz raw_mean_nnz raw_variance_nnz n_measured_vars distance_from_center
25070 8817665 3d044b52-140a-4528-bf0d-a2dbef9e1f40 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 789.0 577 1.367418 1.267548 25787 1.085520
1857 8765970 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 542.0 377 1.437666 2.401024 30436 1.086705
16058 8780171 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 594.0 434 1.368664 1.752919 30436 1.087188
2777 8766890 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 39-year-old human stage HsapDv:0000133 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 554.0 418 1.325359 0.939451 30436 1.090194
3547 8767660 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 39-year-old human stage HsapDv:0000133 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 700.0 458 1.528384 1.842737 30436 1.091174
25060 8817655 3d044b52-140a-4528-bf0d-a2dbef9e1f40 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 1027.0 695 1.477698 3.970323 25787 1.091251
16133 8780246 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 1027.0 695 1.477698 3.970323 30436 1.092581
25054 8817649 3d044b52-140a-4528-bf0d-a2dbef9e1f40 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 1233.0 709 1.739069 7.859785 25787 1.092610
25106 8817701 3d044b52-140a-4528-bf0d-a2dbef9e1f40 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 1800.0 949 1.896733 8.685527 25787 1.097101
23781 8813093 84242d25-f656-4ca6-8e8d-f3d2beeba11f 10x 3' v3 EFO:0009922 central nervous system macrophage CL:0000878 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 717.0 466 1.538627 2.649042 23987 1.099800
24075 8816670 3d044b52-140a-4528-bf0d-a2dbef9e1f40 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 39-year-old human stage HsapDv:0000133 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 554.0 418 1.325359 0.939451 25787 1.102289
25045 8817640 3d044b52-140a-4528-bf0d-a2dbef9e1f40 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 594.0 434 1.368664 1.752919 25787 1.105064
16103 8780216 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 567.0 440 1.288636 0.902832 30436 1.107093
14944 8779057 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 71-year-old human stage HsapDv:0000165 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 694.0 373 1.860590 5.899867 30436 1.108619
16472 8780585 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 1800.0 949 1.896733 8.685527 30436 1.109879
24909 8817504 3d044b52-140a-4528-bf0d-a2dbef9e1f40 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 71-year-old human stage HsapDv:0000165 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 694.0 373 1.860590 5.899867 25787 1.109984
25053 8817648 3d044b52-140a-4528-bf0d-a2dbef9e1f40 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 567.0 440 1.288636 0.902832 25787 1.114052
24128 8816723 3d044b52-140a-4528-bf0d-a2dbef9e1f40 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 39-year-old human stage HsapDv:0000133 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 700.0 458 1.528384 1.842737 25787 1.117524
24387 8816982 3d044b52-140a-4528-bf0d-a2dbef9e1f40 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 71-year-old human stage HsapDv:0000165 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 563.0 340 1.655882 2.739641 25787 1.126728
6719 8770832 c05e6940-729c-47bd-a2a6-6ce3730c4919 10x 3' v3 EFO:0009922 endothelial cell of artery CL:1000413 71-year-old human stage HsapDv:0000165 normal PATO:0000461 ... white matter of cerebellum UBERON:0002317 central nervous system UBERON:0001017 563.0 340 1.655882 2.739641 30436 1.129278

20 rows × 27 columns

We see that cerebellar granular cells, in data that come from the white matter of the cerebellum, are at the center of the embedding. This is somewhat expected. We note that there are oligodendrocyte precursor cells in the data as well, and as per my original hypothesis, I thought the stem cells would be more central than the rest.

Blood and blood related cells seem to be the farthest out. This makes sense as per my hypothesis that we are dealing with outliers. If the typical cell in this region of the model's embedding (which encompassed much more than the CNS, so we can only say so much here) was a CNS/Cerebellum related cell, and we now have an arterial epithelial cell, it is likely to be "farther away" from the CNS/Cenebellum specific cells.

UMAP does not capture center-ness¶

Now we are going to look again at the UMAP colored by cells, because we are now going to see where the actual center and outer edges of the embedding are with respect to the UMAP coordinates.

In [238]:
# Plot the UMAP with cell types, using the function we defined earlier
# Convert adata.obs.index to integer positions
sc.pl.umap(adata, color='cell_type', title='UMAP colored by cell type')   

And now we color by distance from the center. Will the center of the UMAP have the lowest "distance from center?"

In [239]:
sc.pl.umap(adata, color='distance_from_center', title='UMAP colored by distance from center')

No!

We can already see that the "centerness" is not reflected on the UMAP. It appears that the center of the embedding is on the north and south end of the UMAP. In other words, if we're asking questions about the "centerness" of a model, we cannot rely on a UMAP to tell us.

Let's make this a bit more explicit by doing some thresholding of the center and the outer edges. The first of the UMAPs below will light up the top 2000 cells from the center, and the second of the UMAPs below will light up the top 2000 cells from the outer edges.

In [240]:
# Color by only top n cells from the center
top_n = adata_sorted.obs.head(2000).index
top_n_mask = adata.obs.index.isin(top_n)

adata.obs['top_n_from_center'] = top_n_mask

sc.pl.umap(adata, color="top_n_from_center", title='UMAP colored by distance from center')
In [241]:
# Color by only top n cells from the center
top_n = adata_sorted.obs.tail(2000).index
top_n_mask = adata.obs.index.isin(top_n)

adata.obs['top_n_from_outside'] = top_n_mask
sc.pl.umap(adata, color="top_n_from_outside", title='UMAP colored by top n from outside')

Center-ness is associated with frequency¶

It's possible that the center reflects the simple weighting in terms of cell type frequency. The center seems to be the cell types that have the highest frequency, and the outside seems to be the cell types that have the lowest frequency. Some sort of gravity well.

Luckily this is testable. We will do that by making a new data frame that simply gives us the cell type and average distance from center. We note that there might be substantial variance in some of these. But we will start here.

In [242]:
# Take the metadata, and make a new data frame that has the cell type and the average distance from the center
cluster_means = adata.obs[['cell_type', 'distance_from_center']]

# Groupby cell types, take the mean, but we also need a frequency column
cluster_means = cluster_means.groupby('cell_type').agg(
    mean_distance=('distance_from_center', 'mean'),
    frequency=('cell_type', 'count')
)

# Sort by mean distance
cluster_means.sort_values('mean_distance')
Out[242]:
mean_distance frequency
cell_type
oligodendrocyte 0.686349 10924
cerebellar granule cell 0.749444 8678
oligodendrocyte precursor cell 0.859274 2036
differentiation-committed oligodendrocyte precursor 0.862943 306
neuron 0.888968 52
microglial cell 0.900753 2562
GABAergic neuron 0.910608 1744
astrocyte 0.914900 1557
vascular associated smooth muscle cell 0.916619 158
glutamatergic neuron 0.927685 996
mural cell 0.929756 1076
capillary endothelial cell 0.936451 1072
leukocyte 0.939225 146
central nervous system macrophage 0.940465 110
ependymal cell 0.967948 27
endothelial cell of artery 0.990752 336

There looks to be a rough trend but its is far from perfect. Let's plot this to get a bit more clarity.

In [243]:
# Make a scatterplot of the mean distance from the center by frequency
plt.figure(figsize=(8, 6))
plt.scatter(cluster_means['frequency'], cluster_means['mean_distance'], s=10)
plt.xlabel('Frequency')
plt.ylabel('Mean Distance from Center')
plt.title('Mean Distance from Center by Frequency')
plt.show()
In [244]:
# Make the same plot but log transform frequency
plt.figure(figsize=(8, 6))
plt.scatter(cluster_means['frequency'], cluster_means['mean_distance'], s=10)
plt.xscale('log')
plt.xlabel('Frequency')
plt.ylabel('Mean Distance from Center')
plt.title('Mean Distance from Center by Frequency (log scale)')
plt.show()

So broadly speaking, the cells with higher frequency are closer to the center of the embedding. But soon as you get past the top 2 most frequent cells, it is not as close of an association.

We also note that there are cells within a given subset that are closer to the center than others. We remember that only a piece of cerebellar granular cells were really at the center. It's not literally cell type by cell type.

Let's look at adata again.

In [245]:
adata.obs
Out[245]:
soma_joinid dataset_id assay assay_ontology_term_id cell_type cell_type_ontology_term_id development_stage development_stage_ontology_term_id disease disease_ontology_term_id ... tissue_general tissue_general_ontology_term_id raw_sum nnz raw_mean_nnz raw_variance_nnz n_measured_vars distance_from_center top_n_from_center top_n_from_outside
0 8752190 c8f83821-a242-4ed7-86e9-7da077f5d348 10x 3' v3 EFO:0009922 ependymal cell CL:0000065 34-year-old human stage HsapDv:0000128 normal PATO:0000461 ... central nervous system UBERON:0001017 2374.0 1623 1.462723 4.991057 24817 0.992051 False True
1 8752191 c8f83821-a242-4ed7-86e9-7da077f5d348 10x 3' v3 EFO:0009922 astrocyte CL:0000127 34-year-old human stage HsapDv:0000128 normal PATO:0000461 ... central nervous system UBERON:0001017 917.0 640 1.432813 1.638671 24817 0.933798 False False
2 8752192 c8f83821-a242-4ed7-86e9-7da077f5d348 10x 3' v3 EFO:0009922 astrocyte CL:0000127 34-year-old human stage HsapDv:0000128 normal PATO:0000461 ... central nervous system UBERON:0001017 2241.0 1184 1.892736 5.527792 24817 0.960067 False True
3 8752193 c8f83821-a242-4ed7-86e9-7da077f5d348 10x 3' v3 EFO:0009922 astrocyte CL:0000127 34-year-old human stage HsapDv:0000128 normal PATO:0000461 ... central nervous system UBERON:0001017 1255.0 887 1.414882 4.256573 24817 0.919938 False False
4 8752194 c8f83821-a242-4ed7-86e9-7da077f5d348 10x 3' v3 EFO:0009922 astrocyte CL:0000127 34-year-old human stage HsapDv:0000128 normal PATO:0000461 ... central nervous system UBERON:0001017 1491.0 816 1.827206 5.340658 24817 0.946847 False True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
31775 8831489 12194ced-8086-458e-84a8-e2ab935d8db1 10x 3' v3 EFO:0009922 oligodendrocyte CL:0000128 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... central nervous system UBERON:0001017 23435.0 5149 4.551369 134.862017 28059 0.695828 False False
31776 8831490 12194ced-8086-458e-84a8-e2ab935d8db1 10x 3' v3 EFO:0009922 oligodendrocyte CL:0000128 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... central nervous system UBERON:0001017 3134.0 1343 2.333582 15.222471 28059 0.745088 False False
31777 8831491 12194ced-8086-458e-84a8-e2ab935d8db1 10x 3' v3 EFO:0009922 oligodendrocyte CL:0000128 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... central nervous system UBERON:0001017 3692.0 1632 2.262255 21.365883 28059 0.737275 False False
31778 8831492 12194ced-8086-458e-84a8-e2ab935d8db1 10x 3' v3 EFO:0009922 oligodendrocyte CL:0000128 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... central nervous system UBERON:0001017 11394.0 3554 3.205965 58.032433 28059 0.691257 False False
31779 8831493 12194ced-8086-458e-84a8-e2ab935d8db1 10x 3' v3 EFO:0009922 oligodendrocyte CL:0000128 73-year-old human stage HsapDv:0000167 normal PATO:0000461 ... central nervous system UBERON:0001017 7072.0 2502 2.826539 31.751586 28059 0.693035 False False

31780 rows × 29 columns

Density is associated with frequency and center-ness¶

So we looked at frequency of subset in terms of how close the cells are to the center of the embedding. Now let's go back to cell-by-cell and look at density.

Are the cells closer to the center more densely packed? Is there some sort of "model gravity?" We note that again we are only looking at a piece of the model (CNS), due to availability, but there still might be local areas of high density that go across subsets at the level of, for example, organ system or species. This would be analogous to galactic superclusters in astronomy.

Given the dimensionality of the data, we are going to use a KNN density esimator with this, as Nikolay Samusik used to for the CyTOF clustering algorithm X-shift. I'll note that I tried a kernel density estimator, but it took way too long.

In [246]:
from sklearn.neighbors import NearestNeighbors
import numpy as np

# Set the number of neighbors
k = 5

# Fit the NearestNeighbors model
nbrs = NearestNeighbors(n_neighbors=k, algorithm='auto').fit(adata.obsm['uce'])

# Compute the distances to the k-nearest neighbors
distances, indices = nbrs.kneighbors(adata.obsm['uce'])

# Estimate density as the inverse of the distance to the kth neighbor
density = 1 / (distances[:, -1] + 1e-10)  # Add a small constant to avoid division by zero

Now we have to color the UMAP by density to see if we notice any patterns.

In [247]:
# Add density to adata
adata.obs['density'] = density

# Plot
sc.pl.umap(adata, color='density', title='UMAP colored by density')

We do see what might be a correlation between density and whether you're at the center of the manifold. But let's check that real quick with another correlation plot like before.

In [248]:
# Add density to the metadata
adata.obs['density'] = density
In [249]:
import matplotlib.patches as mpatches

# Plot density vs distance from center and color by cell type. Label the axes and add a colorbar
plt.figure(figsize=(8, 6))
scatter = plt.scatter(
    adata.obs['distance_from_center'], adata.obs['density'], 
    c=adata.obs['cell_type'].cat.codes, cmap='tab20', s=10
)

# Label the axes
plt.xlabel('Distance from Center')
plt.ylabel('Density')

# Create legend handles
unique_categories = adata.obs['cell_type'].cat.categories
colors = plt.cm.get_cmap('tab20', len(unique_categories))(np.arange(len(unique_categories)))
handles = [mpatches.Patch(color=colors[i], label=unique_categories[i]) for i in range(len(unique_categories))]

# Add the legend
plt.legend(handles=handles, title="Cell Type", bbox_to_anchor=(1.05, 1), loc='upper left')

# Show the plot
plt.show()

There appears to be a slight trend, though the shape of the plot is interesting.

It appears as if each cell subset occupies a specific distance from center. It's not like there are multiple "in orbit" around the center (otherwise you would see a fair amount of overlap between subsets here).

It is also interesting that there is a high distribution of density for each cell subset. I am going to guess that the high density is the center of a given subset and the low density is the outside. We note that the highest densities do in fact come from oligodendrocytes and cerebellar granule cells, which have the highest frequencies in the dataset.

Back to the slight trend. Let's just check correlation.

In [250]:
# Check spearman correlation between density and distance from center
adata.obs[['density', 'distance_from_center']].corr(method='spearman')
Out[250]:
density distance_from_center
density 1.000000 -0.332414
distance_from_center -0.332414 1.000000

So its a weak negative correlation. More dense cells are closer to the center. Less dense groups of cells are farther from the center.

Thus, there is a sort of local density for each of the clusters, that represents cluster-ness, but then there is perhaps a sort of global density for the model that, along with the distance from the center, represents model-ness. Or it might be that if there are more cells in a given cluster, there is going to be a higher density, regardless of whether the cells are at the center or not.

So the last thing we will do is jump up to the subset level as before and look at average density, and frequency. Let's make the table.

In [251]:
# Make a table of cell subset, frequency, mean distance from center, and mean density
cluster_means = adata.obs[['cell_type', 'distance_from_center', 'density']]
cluster_means = cluster_means.groupby('cell_type').agg(
    frequency=('cell_type', 'count'),
    mean_distance=('distance_from_center', 'mean'),
    mean_density=('density', 'mean')
)
cluster_means.sort_values('mean_distance')
Out[251]:
frequency mean_distance mean_density
cell_type
oligodendrocyte 10924 0.686349 6.679530
cerebellar granule cell 8678 0.749444 8.729607
oligodendrocyte precursor cell 2036 0.859274 7.116778
differentiation-committed oligodendrocyte precursor 306 0.862943 3.618157
neuron 52 0.888968 3.080095
microglial cell 2562 0.900753 4.082637
GABAergic neuron 1744 0.910608 4.177983
astrocyte 1557 0.914900 4.842773
vascular associated smooth muscle cell 158 0.916619 2.908818
glutamatergic neuron 996 0.927685 3.246564
mural cell 1076 0.929756 4.775011
capillary endothelial cell 1072 0.936451 4.658878
leukocyte 146 0.939225 2.559924
central nervous system macrophage 110 0.940465 2.511820
ependymal cell 27 0.967948 4.161853
endothelial cell of artery 336 0.990752 2.671052

And now we will make some plots.

In [252]:
# Plot mean density vs mean distance from center
plt.figure(figsize=(8, 6))
plt.scatter(cluster_means['mean_distance'], cluster_means['mean_density'], s=10)
plt.xlabel('Mean Distance from Center')
plt.ylabel('Mean Density')
plt.title('Mean Density vs Mean Distance from Center')
plt.show()

# The spearman correlation
cluster_means[['mean_distance', 'mean_density']].corr(method='spearman')
Out[252]:
mean_distance mean_density
mean_distance 1.000000 -0.608824
mean_density -0.608824 1.000000

We see a fairly strong negative correlation at the subset level. This means that if you have a higher density, you are going to be closer to the center of the dataset. While there could be some sort of "global density" or "global gravity" associated with this model, we have to check frequency first, given that it appears to be associated with density as well.

Ok, let's look at density by frequency.

In [253]:
# Plot mean density vs frequency
plt.figure(figsize=(8, 6))
plt.scatter(cluster_means['frequency'], cluster_means['mean_density'], s=10)
plt.xscale('log')
plt.xlabel('Frequency')
plt.ylabel('Mean Density')
plt.title('Mean Density vs Frequency (log scale)')
plt.show()

# The spearman correlation
cluster_means[['frequency', 'mean_density']].corr(method='spearman')
Out[253]:
frequency mean_density
frequency 1.000000 0.758824
mean_density 0.758824 1.000000

We have a stronger positive correlation between density and frequency. This suggests that if there are more cells of a specific type, they are going to have a higher mean density in this high-dimensional embedding.

Discussion and future directions¶

Center-ness appears to be a worthwhile feature to look at in the context of these embeddings that come from the single-cell foundation models. It is important to note that this feature must be computed in the embedding space, and not the UMAP space. We have shown here that UMAP does not properly capture the actual center and outer edges of the embedding.

Distance from the center is associated with both frequency and density. My best guess is that the cells closest to the center are the most representative of the CNS section of the embedding, so that is where we are going to see them, and as we move out from that, we move in the direction of other organ systems, so cells less CNS-like that are nonetheless in the CNS-label piece (eg. arterial epithelial cells) will be sitting there.

We note that it would be much more interesting to do this "center-ness" analysis on the entire model (60M cells). Because this would tell us a lot about what cells in biology in general were considered more "globally central" by the model, and which were considered to be "outliers" (where again there could be technical artifacts).

In other words, there is still a lot to be done in this direction, but for those exploring these models, now you have a few unique questions you can ask about these embeddings (this goes for NLP too), and the code to get the answers.

The better you know the embedding, the better you can use the model. So explore it as much as you can. Ask and answer as many questions as you can. Look for good use-cases. Look for imperfections. Be both visionary and nit picky. And of course, enjoy the process.