Home


Imagine a universe entirely without structure, without shape, without connections. A cloud of microscopic events, like fragments of space-time … except that there is no space or time. What characterizes one point in space, for one instant? Just the values of the fundamental particle fields, just a handful of numbers. Now, take away all notions of position, arrangement, order, and what’s left? A cloud of random numbers.

But if the pattern that is me could pick itself out from all the other events taking place on this planet, why shouldn’t the pattern we think of as ‘the universe’ assemble itself, find itself, in exactly the same way? If I can piece together my own coherent space and time from data scattered so widely that it might as well be part of some giant cloud of random numbers, then what makes you think that you’re not doing the very same thing?

Greg Egan, Permutation City


0.1 Introduction

Here, we are going to take the Samusik dataset and the LLM generated data and merge them together, then process them and make our UMAP. We will then color by from where the data came.

So we can be clear about how the LLM-generated data came about, we will be explicit about how it was done. The LLM being used was Claude Sonnet 4.6, a frontier model at the time of the experiment. This was accessed through OpenRouter, where the prompt of simulating one cell from the Samusik dataset was run over and over again for a set number of times.

Below is the script (which was not run directly as part of this markdown). Note that in order to use this script, I set up a “chatbot” tool that is accessible through the command line. I show you how to do this here.

num_runs=1000; # However many you want
prompt="Please simulate one cell in the flagship Samusik CyTOF dataset. Make the data raw. Use floating points as you'd do in a fcs file. I'll do the asinh transform later. I would like you to output as follows.

Here are the names of the parameters: Time, Cell_length, BC1, BC2, BC3, BC4, BC5, BC6, Ter119, CD45.2, Ly6G, IgD, CD11c, F480, CD3, NKp46, CD23, CD34, CD115, CD19, 120g8, CD8, Ly6C, CD4, CD11b, CD27, CD16_32, SiglecF, Foxp3, B220, CD5, FceR1a, TCRgd, CCR7, Sca1, CD49b, cKit, CD150, CD25, TCRb, CD43, CD64, CD138, CD103, IgM, CD44, MHCII, DNA1, DNA2, Cisplatin, beadDist

You will create the numbers for these parameters. Your output will just be the numbers comma separated. No names, but the nmbers will be outputted in the order of the markers that we specified above, from Time to beadDist.

Just output this. No other commentary";

for i in $(seq 1 $num_runs);
do
    chatbot "claude" "$prompt" >> output.txt;
done

0.2 Process data

library(tidyverse)
suppressPackageStartupMessages(library(here))
setwd(here("..", "data"))

num_cells <- 10000
sam <- readr::read_rds("samusik_01.rds") %>% as_tibble()
sam <- sam[sample(nrow(sam), num_cells),]

sam$dataset <- "samusik"

Now we get out our LLM results.

setwd(here("..", "local_data"))

# Read in the file
llm <- readr::read_lines("output.txt")

# Tidy
llm <- lapply(llm, function(i) stringr::str_split(i, ",")[[1]] %>% as.numeric()) %>%
  do.call(rbind, .) %>%
  as_tibble()

names(llm) <- names(sam)
llm$dataset <- "llm"

And now for the mishmash

cells <- bind_rows(sam, llm)
cells
## # A tibble: 11,058 × 52
##        Time Cell_length   BC1   BC2   BC3   BC4     BC5    BC6  Ter119 CD45.2   Ly6G    IgD   CD11c   F480
##       <dbl>       <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>
##  1 10918876          43  173. 200.   210. 12.0   6.56    2.57   0.903    4.37  1.10  -0.270 -0.258   1.32 
##  2 13201089          21  504. 281.   337.  5.03  7.93    5.85  -0.597   23.3   2.30  -0.433  0.856   9.22 
##  3  6335806          17  241. 190.   297. 16.1   4.76    3.75  -0.655    2.89  0.300  1.57  -0.432   0.342
##  4 14895295          19  152.  91.9  122.  6.80  0.717  -0.306 -0.378   13.5  -0.168 64.6   -0.0998 -0.548
##  5 14834515          18  331. 251.   440.  9.18  2.64    0.465 -0.184    8.87  1.01  -0.266 -0.682   9.20 
##  6   563248          19  243. 196.   186.  2.23  3.36    0.133 -0.367   25.2   0.686 -0.207 -0.210   5.14 
##  7  9141112          24  100. 118.   169. 17.1   1.12    0.271 -0.0644   7.10  0.421  0.536 -0.233   5.94 
##  8 13914301          18  228. 214.   308. 23.2  -0.0301  3.36   2.41    12.6   2.36  -0.395 -0.648   2.39 
##  9  3131507          25  226. 170.   233.  8.45  7.82   10.2   -0.652   27.2   2.03  -0.315  1.27   42.4  
## 10  8605530          27  275. 178.   316. 25.5   8.82    2.98  -0.433   12.9   0.882 -0.333 -0.310  -0.643
## # ℹ 11,048 more rows
## # ℹ 38 more variables: CD3 <dbl>, NKp46 <dbl>, CD23 <dbl>, CD34 <dbl>, CD115 <dbl>, CD19 <dbl>,
## #   `120g8` <dbl>, CD8 <dbl>, Ly6C <dbl>, CD4 <dbl>, CD11b <dbl>, CD27 <dbl>, CD16_32 <dbl>,
## #   SiglecF <dbl>, Foxp3 <dbl>, B220 <dbl>, CD5 <dbl>, FceR1a <dbl>, TCRgd <dbl>, CCR7 <dbl>, Sca1 <dbl>,
## #   CD49b <dbl>, cKit <dbl>, CD150 <dbl>, CD25 <dbl>, TCRb <dbl>, CD43 <dbl>, CD64 <dbl>, CD138 <dbl>,
## #   CD103 <dbl>, IgM <dbl>, CD44 <dbl>, MHCII <dbl>, DNA1 <dbl>, DNA2 <dbl>, Cisplatin <dbl>,
## #   beadDist <dbl>, dataset <chr>

From here, we do our pre-processing. Do we remember what the surface markers are for the Samusik dataset?

setwd(here("..", "local_data"))
md <- readr::read_rds("marker_types.rds")
type_markers <- dplyr::filter(md, marker_class == "type")$marker_name 

surface <- cells[type_markers]
surface <- asinh(surface/5) %>% as_tibble()
surface
## # A tibble: 11,058 × 39
##     Ter119 CD45.2    Ly6G     IgD   CD11c    F480     CD3    NKp46     CD23     CD34    CD115    CD19
##      <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>
##  1  0.180   0.789  0.218  -0.0539 -0.0517  0.262  -0.120  -0.148    0.00429  0.946   -0.00555 -0.101 
##  2 -0.119   2.24   0.446  -0.0865  0.170   1.37   -0.0261 -0.0619  -0.0452   0.337   -0.0923  -0.0732
##  3 -0.131   0.551  0.0600  0.310  -0.0862  0.0683  0.340   0.00179 -0.0294   1.45     0.731   -0.0558
##  4 -0.0756  1.72  -0.0336  3.25   -0.0200 -0.109  -0.155   0.115    0.227   -0.00814 -0.0119   1.71  
##  5 -0.0367  1.34   0.200  -0.0532 -0.136   1.37    0.0372  0.730    0.0470   0.845    1.81    -0.150 
##  6 -0.0734  2.32   0.137  -0.0413 -0.0421  0.901   0.0716 -0.137   -0.112    1.37     1.77     0.139 
##  7 -0.0129  1.15   0.0840  0.107  -0.0466  1.01    0.119  -0.157   -0.0433  -0.108    0.387   -0.0885
##  8  0.464   1.65   0.457  -0.0790 -0.129   0.462  -0.0627 -0.140   -0.00825  0.399   -0.0699   0.0409
##  9 -0.130   2.40   0.395  -0.0630  0.251   2.83    0.302   1.16    -0.0799   0.306    0.594    0.0980
## 10 -0.0865  1.68   0.175  -0.0666 -0.0619 -0.128  -0.0132 -0.0161  -0.0732   1.58     0.536   -0.0219
## # ℹ 11,048 more rows
## # ℹ 27 more variables: `120g8` <dbl>, CD8 <dbl>, Ly6C <dbl>, CD4 <dbl>, CD11b <dbl>, CD27 <dbl>,
## #   CD16_32 <dbl>, SiglecF <dbl>, Foxp3 <dbl>, B220 <dbl>, CD5 <dbl>, FceR1a <dbl>, TCRgd <dbl>,
## #   CCR7 <dbl>, Sca1 <dbl>, CD49b <dbl>, cKit <dbl>, CD150 <dbl>, CD25 <dbl>, TCRb <dbl>, CD43 <dbl>,
## #   CD64 <dbl>, CD138 <dbl>, CD103 <dbl>, IgM <dbl>, CD44 <dbl>, MHCII <dbl>

0.3 Visualize results

And from here, we are going to make a UMAP

library(umap)

dimr <- umap(surface)$layout %>% as_tibble()
names(dimr) <- c("umap1", "umap2")

And from here, we are going to plot it as such:

library(ggplot2)
ggplot(dimr, aes(x = umap1, y = umap2, color = cells$dataset)) +
  geom_point()

0.4 Visualize details

We are now going to go a bit more in depth and determine if the “island” of LLM-generated cells have any sort of biological plausibility at all. To this end, we are going to start with some simple plots, per marker.

ggplot(surface, aes(x = CD45.2, color = cells$dataset)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

This is already a problem. The LLM data has CD45 being very high. Let’s look at a biaxial now. A standard one. CD3 by CD11b.

ggplot(surface, aes(x = CD3, y = CD11b, color = cells$dataset)) + geom_point()

We can see that this is absolute garbage. So this suggests the data that was pulled out of the LLM makes absolutely no sense. Let’s try another one to see if the pattern holds.

ggplot(surface, aes(x = CD4, y = CD8, color = cells$dataset)) + geom_point()

Yes, the pattern still holds. Things that make no sense, like CD4+ CD8+ cells. What this might suggest is that there is a sort of next token prediction that it was doing, taken to the extreme. Where it was just randomly outputting values.

We note that both the value distributions and the combinatorics of the value ranges made no sense. We note that the distributions were might tighter, and that there were marker combinations that made no sense.

Let’s do some more summary statistics. We just said that the value distributions are tighter. We can look at that a bit more closely. First, we will split our processed surface marker data.

surface_sam <- dplyr::filter(surface, cells$dataset == "samusik")
surface_llm <- dplyr::filter(surface, cells$dataset == "llm")

From here, we will run our summary statistics.

apply(surface_sam, 2, summary)
##              Ter119    CD45.2         Ly6G         IgD       CD11c         F480         CD3       NKp46
## Min.    -0.19016366 0.1373186 -0.187688925 -0.18889759 -0.19011366 -0.189593166 -0.18824457 -0.18799585
## 1st Qu. -0.11603891 0.8751696 -0.006833073 -0.09822101 -0.10952191 -0.005818719 -0.10232228 -0.09045308
## Median  -0.06781758 1.4206448  0.204757288 -0.03072763 -0.05042880  0.329297839 -0.03403289 -0.01504077
## Mean    -0.02420201 1.4536391  0.310679707  0.32568406  0.06942321  0.636488142  0.16442833  0.12334237
## 3rd Qu. -0.01606623 1.9679153  0.535478280  0.17221381  0.05045427  1.018653927  0.12733671  0.19236035
## Max.     1.40240393 4.4354540  2.633664261  4.47037564  4.29823038  5.992107183  4.27247637  2.38599891
##                CD23        CD34       CD115        CD19       120g8         CD8        Ly6C         CD4
## Min.    -0.18988526 -0.18841122 -0.18740831 -0.18903033 -0.18780069 -0.18904360 -0.18710816 -0.19060301
## 1st Qu. -0.11179613 -0.04098163  0.03501379 -0.09769185 -0.10470486 -0.10284890  0.06080606 -0.06966148
## Median  -0.05482977  0.15893521  0.32545574 -0.02172942 -0.04220942 -0.03813746  1.98258249  0.07327919
## Mean     0.01912709  0.46696830  0.49978284  0.21548033  0.17983092  0.13230566  1.98359971  0.36153713
## 3rd Qu.  0.01786339  0.64528997  0.81276852  0.27012358  0.09609460  0.11658728  3.58530069  0.46398813
## Max.     2.48313964  4.79682270  3.81789678  3.14682837  5.17723352  5.02384261  6.40423000  6.11056252
##               CD11b       CD27    CD16_32     SiglecF       Foxp3        B220         CD5     FceR1a
## Min.    -0.18787495 -0.1894748 -0.1841988 -0.18623274 -0.18983913 -0.18860330 -0.18764791 -0.1891613
## 1st Qu.  0.03832147 -0.0622791  1.6342462  0.03510659 -0.08682806 -0.07000977  0.03637336 -0.0720568
## Median   1.39813860  0.1065137  2.8887704  0.37118737 -0.00253725  0.07642104  0.29984983  0.0520543
## Mean     1.80935542  0.4823741  2.6192100  0.67652048  0.11904613  0.82988936  0.56677586  0.1741945
## 3rd Qu.  3.47878162  0.5780958  3.6649569  0.98288575  0.20543632  1.26447642  0.76579921  0.2894772
## Max.     6.03959576  4.9172335  6.7695805  5.31040054  2.67291918  5.30962116  6.42188386  3.6151468
##                TCRgd        CCR7        Sca1       CD49b         cKit       CD150        CD25        TCRb
## Min.    -0.190200662 -0.18927588 -0.18829775 -0.19064167 -0.189021321 -0.18877257 -0.18903614 -0.19083474
## 1st Qu. -0.111388009 -0.07606703 -0.03771335 -0.10166505 -0.086742817 -0.09849456 -0.10466090 -0.09806690
## Median  -0.056599576  0.03788346  0.16816522 -0.03532581 -0.009417906 -0.02771282 -0.03735943 -0.03026985
## Mean    -0.002960151  0.16375119  0.79517759  0.18190695  0.403733439  0.07549126  0.04755663  0.14730894
## 3rd Qu.  0.001927813  0.27507162  0.82555206  0.14380906  0.380707161  0.13319113  0.09678068  0.12761044
## Max.     2.273246265  2.64955647  7.12668585  4.72589475  5.616765855  2.57341739  4.94917915  4.56986693
##                CD43        CD64       CD138       CD103         IgM       CD44      MHCII
## Min.    -0.18834818 -0.18854107 -0.18619332 -0.18792006 -0.18889136 -0.1817913 -0.1856824
## 1st Qu.  0.07755234  0.08630538 -0.07793945 -0.10672799 -0.04143494  2.1828865  0.3753040
## Median   0.69137412  0.48202782  0.02170501 -0.04552825  0.15510626  3.7133102  0.9573889
## Mean     1.17521918  0.72992300  0.14329976  0.01775892  0.80894544  3.2423301  1.6343682
## 3rd Qu.  2.07871909  1.15570223  0.24189563  0.05656577  0.84892095  4.3657561  2.1325015
## Max.     5.87229564  5.70470558  3.27610534  3.44751803  8.15718185  6.8124887  7.6840793

And now we will do the same on the LLM data.

apply(surface_llm, 2, summary)
##              Ter119   CD45.2        Ly6G         IgD       CD11c        F480         CD3       NKp46
## Min.    -0.24358407 5.043725 -0.15142070 -0.20848635 -0.20261092 -0.20065089 -0.18494388 -0.16721960
## 1st Qu.  0.05996406 6.842333  0.03998934  0.01999867  0.03998934  0.03449315  0.01999867  0.01999867
## Median   0.07991491 7.048010  0.05996406  0.03998934  0.07991491  0.05996406  0.03998934  0.05397379
## Mean     0.10579625 7.056548  0.08124992  0.09519684  0.11134510  0.10511200  0.74695211  0.08275429
## 3rd Qu.  0.08788682 7.167594  0.07592703  0.05796753  0.09983408  0.07991491  0.06595218  0.05996406
## Max.     1.83289343 8.906865  1.54934538  7.05321994  2.24975640  7.05313694  7.59835948  1.54853220
##                CD23        CD34       CD115     CD19       120g8         CD8        Ly6C         CD4
## Min.    -0.18494388 -0.18494388 -0.18100993 0.000000 -0.18100993 -0.17510381 -0.17707320 -0.13756570
## 1st Qu.  0.01999867  0.01999867  0.01999867 6.476020  0.01999867  0.01999867  0.01999867  0.03998934
## Median   0.03998934  0.03998934  0.05996406 6.912342  0.03998934  0.04598379  0.05996406  0.05996406
## Mean     0.20607827  0.07109850  0.08853507 5.396984  0.10447629  0.45501336  0.40201223  2.32965535
## 3rd Qu.  0.07991491  0.06595218  0.07991491 7.130835  0.06595218  0.07991491  0.07991491  6.62082137
## Max.     7.22165227  1.21128984  1.77453898 8.894791  7.05317152  7.54579164  7.15067070  8.04832054
##               CD11b        CD27     CD16_32     SiglecF       Foxp3     B220         CD5      FceR1a
## Min.    -0.18690981 -0.14944312 -0.18494388 -0.17904190 -0.18297726 0.000000 -0.17116302 -0.15734989
## 1st Qu.  0.02999550  0.04598379  0.01999867  0.01999867  0.01999867 6.701628  0.03799086  0.01999867
## Median   0.05996406  6.10786936  0.03998934  0.03998934  0.03998934 7.119455  0.05996406  0.03998934
## Mean     0.52548637  3.68449444  0.30579851  0.07359914  0.13550081 6.128205  1.89078788  0.08089356
## 3rd Qu.  0.10132649  6.99729787  0.07991491  0.05996406  0.05996406 7.231808  6.19923476  0.05996406
## Max.     7.34637185  8.68753666  7.35284143  1.32688962  7.52259520 8.978333  7.53941024  6.62087466
##               TCRgd        CCR7        Sca1       CD49b        cKit       CD150        CD25        TCRb
## Min.    -0.18297726 -0.19083950 -0.16524692 -0.17707320 -0.18100993 -0.15142070 -0.18690981 -0.16721960
## 1st Qu.  0.01999867  0.01999867  0.01999867  0.01999867  0.01999867  0.01999867  0.01999867  0.01999867
## Median   0.03998934  0.03998934  0.04008926  0.03998934  0.03998934  0.04398582  0.03998934  0.05996406
## Mean     0.08751204  0.14680016  0.24460847  0.08887742  0.08062664  0.16013123  0.63088778  1.70828019
## 3rd Qu.  0.06196035  0.07991491  0.07991491  0.07941646  0.06595218  0.07991491  0.08340324  0.40391399
## Max.     7.08279442  7.32972389  7.43474432  6.84242825  1.46402005  7.38365935  7.45447025  8.68154088
##                CD43        CD64       CD138       CD103        IgM     CD44       MHCII
## Min.    -0.18690981 -0.17313375 -0.17510381 -0.18100993 0.00000000 0.000000 -0.16129965
## 1st Qu.  0.03998934  0.01999867  0.01999867  0.01999867 0.07592703 4.829017  0.03799086
## Median   0.06595218  0.03998934  0.03998934  0.03998934 6.60762156 6.909876  0.05597077
## Mean     1.88886045  0.07659209  0.10106515  0.08128831 4.19015198 5.338956  1.02714338
## 3rd Qu.  6.10786268  0.05996406  0.06146127  0.07592703 7.05397007 7.181547  0.12517282
## Max.     7.56270230  1.32688962  6.77933212  1.43194709 8.89478545 8.752822  8.53646238

We can at least see that the LLM data does indeed include negative values. This is otherwise quite a lot to take in. So we are going to make it a bit easier. We are going to plot the mean x standard deviation, marker by marker. We will then color by dataset.

msd_sam <- apply(surface_sam, 2, function(i) {
  curr_mean <- mean(i)
  curr_sd <- sd(i)
  return(c(mean = curr_mean, sd = curr_sd))
}) %>% t() %>% as_tibble()

msd_llm <- apply(surface_llm, 2, function(i) {
  curr_mean <- mean(i)
  curr_sd <- sd(i)
  return(c(mean = curr_mean, sd = curr_sd))
}) %>% t() %>% as_tibble()

# what it looks like
msd_sam[1:5,]
## # A tibble: 5 × 2
##      mean    sd
##     <dbl> <dbl>
## 1 -0.0242 0.179
## 2  1.45   0.743
## 3  0.311  0.402
## 4  0.326  0.872
## 5  0.0694 0.378

Now we combine the data and label for our plot.

msd_sam$dataset <- "samusik"
msd_llm$dataset <- "llm"

msd <- dplyr::bind_rows(msd_sam, msd_llm)
msd[1:5,]
## # A tibble: 5 × 3
##      mean    sd dataset
##     <dbl> <dbl> <chr>  
## 1 -0.0242 0.179 samusik
## 2  1.45   0.743 samusik
## 3  0.311  0.402 samusik
## 4  0.326  0.872 samusik
## 5  0.0694 0.378 samusik

And from here, we make our plot.

ggplot(msd, aes(x = mean, y = sd, color = dataset)) + geom_point(aes(size = 3))

And we have just discovered something very, very strange. The mean and sd of the markers form a parabola. This is actually very weird. I really did not expect this. What might it mean?

0.5 Investigating the parabola

The first thing we are going to do, in order to investigate this further is to fit an equation to the line.

fit <- lm(sd ~ mean + I(mean^2), data = msd_llm)
coef(fit)  # c (Intercept), b (x), a (I(x^2))
## (Intercept)        mean   I(mean^2) 
##   0.2216885   2.0294554  -0.2843874
c0 <- coef(fit)[1]
b  <- coef(fit)[2]
a  <- coef(fit)[3]

sprintf("y = %.4f x^2 + %.4f x + %.4f", a, b, c0)
## [1] "y = -0.2844 x^2 + 2.0295 x + 0.2217"

Now we fit the line to the plot:

ggplot(msd, aes(x = mean, y = sd, color = dataset)) +
    geom_point(aes(size = 3)) +                                                                               stat_function(fun = function(x) a*x^2 + b*x + c0,                                                                       color = "black", linetype = "dashed")

Not perfect. The one we have is a bit shorter and wider. This warrents further investigation, but this will do for now. We’re moving to the outer edge of my wheelhouse now, but this might come up as a result of very tightly bounded min and maxes in the sampling. To quote ChatGPT:

A very tight upside-down parabola in (mean, sd) across features usually means each marker is behaving like a bounded on/off variable with (almost) the same min and max across markers.

This is sort of what we saw in the biaxials. So here’s what we can do. The same mean-sd thing, but look at min vs max per marker. Then a biaxial of min vs max.

mm_sam <- apply(surface_sam, 2, function(i) {
  curr_min <- min(i)
  curr_max <- max(i)
  return(c(min = curr_min, max = curr_max))
}) %>% t() %>% as_tibble()

mm_llm <- apply(surface_llm, 2, function(i) {
  curr_min <- min(i)
  curr_max <- max(i)
  return(c(min = curr_min, max = curr_max))
}) %>% t() %>% as_tibble()

# what it looks like
mm_sam[1:5,]
## # A tibble: 5 × 2
##      min   max
##    <dbl> <dbl>
## 1 -0.190  1.40
## 2  0.137  4.44
## 3 -0.188  2.63
## 4 -0.189  4.47
## 5 -0.190  4.30

And from here, we combine the data and make our plots.

mm_sam$dataset <- "samusik"
mm_llm$dataset <- "llm"

mm <- dplyr::bind_rows(mm_sam, mm_llm)
mm[1:5,]
## # A tibble: 5 × 3
##      min   max dataset
##    <dbl> <dbl> <chr>  
## 1 -0.190  1.40 samusik
## 2  0.137  4.44 samusik
## 3 -0.188  2.63 samusik
## 4 -0.189  4.47 samusik
## 5 -0.190  4.30 samusik

And now we plot:

ggplot(mm, aes(x = min, y = max, color = dataset)) + geom_point(aes(size = 3))

So this tells us that the max values of the LLM dataset are more tightly bound (higher and lower) than that of the Samusik dataset. I think the point on the upper right in the LLM dataset is CD45, whose biaxial we saw earlier.

So now we are going to attempt to reproduce the upside down parabola. After doing some visualization, it seems that if we have a bimodal distribution, that is very tight between two numbers (e.g. 0 and 7), then the mean and the standard deviation of the bimodal distribution might be tightly bounded by the parabola, whereby a mean between the two peaks would have the maximum standard deviation (incorporating both peaks equally) and a mean on one side would have a minimum standard deviation, in that it will be cloeset to that of that single peak, because it would have the most data points on that peak and the fewest data points on the other side.

Let’s just run a simulation to put this into pictures.

make_bimodal <-  function(p, sd_tight = 0.001) {
  side <- sample(c(0, 1), 1, prob = c(1 - p, p))
  if(side == 0) {
    result <- rnorm(1, mean = 0, sd = sd_tight)
  } else {
    result <- rnorm(1, mean = 7, sd = sd_tight)
  }
  return(result)
}

probs <- runif(39, min = 0, max = 1)

dists <- lapply(seq(1000), function(i) {
  result <- sapply(probs, function(j) {
    make_bimodal(j)
  })
  return(result)
})

dists[1:3]
## [[1]]
##  [1]  7.000156e+00  6.998986e+00  6.999212e+00  6.999520e+00 -3.621034e-04  7.000346e+00 -3.403421e-04
##  [8]  6.997734e+00 -9.991783e-04  2.697323e-03  7.000889e+00  7.001053e+00  3.316079e-04  3.461340e-04
## [15]  7.000093e+00  6.999955e+00  6.216383e-04  6.999729e+00  1.217954e-03  6.999986e+00  6.998312e+00
## [22]  7.000178e+00  7.000536e+00 -3.131586e-04 -9.640669e-04 -3.844386e-04  7.000438e+00  7.441518e-04
## [29] -2.396784e-04  8.401598e-05 -4.203988e-04  4.841057e-04 -9.310505e-04  6.999681e+00  3.635099e-04
## [36]  7.000485e+00  7.956802e-04  7.000368e+00  7.001714e+00
## 
## [[2]]
##  [1]  5.053340e-04  6.999675e+00 -9.465689e-04  6.999416e+00 -1.525179e-03  6.999398e+00 -1.419020e-03
##  [8] -6.276005e-05 -9.035718e-04  7.346070e-04 -1.257077e-03  9.564911e-04  6.999904e+00 -5.628557e-04
## [15] -1.644493e-03 -8.551046e-04  7.000756e+00  6.999889e+00  6.999604e+00  6.999333e+00  4.968569e-04
## [22]  6.999753e+00  7.000967e+00  3.306486e-04 -2.068207e-03  7.000613e+00  3.798048e-04 -3.449821e-04
## [29] -1.474497e-03 -4.261783e-04  3.233054e-04 -7.004294e-05 -3.738492e-04  7.000600e+00 -1.430327e-03
## [36]  3.961244e-04  6.999576e+00  7.000288e+00 -6.058894e-04
## 
## [[3]]
##  [1] -1.866398e-03  7.000273e+00  7.000619e+00  6.999274e+00  5.677327e-04  6.997856e+00  5.110735e-04
##  [8]  5.418109e-04  1.737952e-03  3.033823e-04  7.000354e+00  2.023997e-03  7.000189e+00 -1.160867e-03
## [15]  6.997018e+00  6.999626e+00  7.001172e+00  7.000662e+00  6.998291e+00  7.002196e+00 -1.594206e-03
## [22]  6.999705e+00 -9.711627e-04 -7.280571e-05  7.000155e+00  7.000740e+00  7.000661e+00  1.740819e-04
## [29] -1.362673e-03  7.000483e+00  7.000479e+00  3.747400e-04  6.998916e+00  2.135212e-05  7.000594e+00
## [36]  7.001298e+00  1.055461e-03 -3.276885e-04 -7.465639e-04

And now we make our mean and standard deviation for this.

dists_mat <- do.call(rbind, dists)
dists_summary <- tibble(dist_mean = colMeans(dists_mat), dist_sd = apply(dists_mat, 2, sd))

And from here, we plot:

ggplot(dists_summary, aes(x = dist_mean, y = dist_sd)) + geom_point(aes(size = 3))

Not shown: I tried the above with various standard deviations on the make_bimodal function above. What I found is that as soon as I dropped the standard deviation to effectively zero, such that it was an on/off (Bernouli) process, I was able to replicate the parabola.

This suggests that the LLM is thinking more in terms of on/off, rather than sampling from a distribution in its training data.

0.6 Raw distributions

0.6.1 Marker-positive values

We will check one more thing. We looked above at post-transform distributions. Now we are going to look again at the raw data. Pre-transform. This might give us some insights as to what the LLM is doing when it is actually trying to pick values that “make sense.”

So we go back to the original datasets and have a look. We have already looked at CD45. So let’s see what that looks like raw. We go back to the “cells” object, which is not yet transformed.

The Samusik dataset.

ggplot(sam, aes(x = CD45.2)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

The LLM dataset.

ggplot(llm, aes(x = CD45.2)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

The LLM dataset does not look much like a distribution. It seems to be bounded at 5000 counts and something like 1000 counts (with a few exceptions). This would suggest that the LLM is thinking in “boundaries.” We also notice that these values are very high in comparison to the Samusik dataset, which peak in the hundreds.

Now let’s go ahead and look at the biaxials we had previously plotted. Starting with CD3 and CD11b.

ggplot(cells, aes(x = CD3, y = CD11b, color = cells$dataset)) + geom_point()
## Warning: Use of `cells$dataset` is discouraged.
## ℹ Use `dataset` instead.

Once again, we see very large values for the marker positive populations in the LLM dataset in comparison to the Samusik dataset. CD3+ seems to be bounded between 1000 and 5000 with a rather uniform distribution. CD 11b+ seems to be bounded between 1000 and 4000 with a few exceptions underneath. So this seems to suggest bounded uniform distributions.

Now we can look at CD4 and CD8 again, untransformed, and see if we see similar patterns.

ggplot(cells, aes(x = CD4, y = CD8, color = cells$dataset)) + geom_point()
## Warning: Use of `cells$dataset` is discouraged.
## ℹ Use `dataset` instead.

We have the same story with the very large marker-positive values coming out of the LLM dataset. CD4+ has rather loose bounds between 1000 and 4000. CD8+ has similar loose bounds between 1000 and 4000. While CD4’s distribution looks fairly uniform (at least eyeballing it), CD8 looks like there’s a cluster at 2000 and a cluster that starts just below 3000. This would suggest different “levels” of CD8 in the training data. Perhaps this value can change depending on different cell subsets, or datasets.

Either way, the untransformed LLM output seems to be pointing toward bounded ranges. The shape of the distributions require further investigation.

0.6.2 Marker-negative values

Now, a very interesting observation can be made if we look at the negative values. As per the plots above, they appear to be clustering around zero. But let’s take a closer look.

llm_surface_raw <- llm[type_markers]

apply(llm_surface_raw[type_markers], 2, function(i) {
  i[i < 5] %>% table() %>% sort(decreasing = TRUE) %>% .[. > 10] 
})
## $Ter119
## .
##   0.4   0.3  0.08   1.4  0.44 0.031  0.18  0.31   0.8  1.23  1.45  0.45  0.15 0.041  0.05  0.67 
##   361   113    68    50    48    36    28    28    21    21    21    19    17    16    15    15 
## 
## $CD45.2
## integer(0)
## 
## $Ly6G
## .
##   0.3   0.2  0.38  0.05  0.28  0.08  0.21   1.3 0.028 0.038  0.55 0.024   1.8  0.18  0.03   0.6  1.45 
##   263   226    61    51    31    27    22    21    20    18    16    13    13    12    11    11    11 
## 
## $IgD
## .
##   0.2   0.1  0.29  0.03   0.3  0.19  0.09 0.029   1.8   1.3  0.11  0.48  1.12  0.22  1.23 
##   263   207    57    42    29    24    22    18    17    16    15    15    13    12    12 
## 
## $CD11c
## .
##   0.4   0.3   0.2  0.51  0.07  0.06  0.09  0.24   0.5   0.1  0.14 0.033   1.1   1.8  0.19 0.019  0.04 
##   287   117    52    43    40    27    22    22    20    19    16    14    13    13    12    11    11 
## 
## $F480
## .
##   0.3   0.2  0.04   0.1  0.33   0.4  0.11   1.6  0.42  0.03  0.07 0.033  0.19  0.06  0.29 
##   262   164    48    44    27    27    24    22    21    13    13    12    12    11    11 
## 
## $CD3
## .
##   0.2   0.1  0.33  0.07  0.06   0.3  0.08  0.03 0.027  0.17 
##   280   182    45    35    33    33    18    17    11    11 
## 
## $NKp46
## .
##   0.3   0.2   0.1  0.03  0.28  0.05  0.22  0.02  0.27  0.44  0.21 0.021   0.4  0.41 
##   242   163    95    38    25    20    16    15    14    14    12    11    11    11 
## 
## $CD23
## .
##   0.1   0.2   0.4   0.3  0.02  0.04  0.05  0.28  0.18  0.03 0.025  0.06  0.07  0.41   1.2 
##   213   182    61    52    29    20    16    16    15    14    12    12    12    11    11 
## 
## $CD34
## .
##   0.2   0.1   0.3  0.03   0.4  0.05  0.04 0.019  0.02  0.06  0.22  0.09 
##   216   199    62    29    27    23    22    21    21    20    14    11 
## 
## $CD115
## .
##  0.3  0.2  0.1  0.4 0.04 0.09 0.03 0.08 0.22  1.3 0.05 0.06 0.07 
##  250  124   80   45   25   19   18   15   14   14   13   13   11 
## 
## $CD19
## .
##  0.2  0.1  0.3  0.4 0.03 0.06 
##   49   23   19   15   14   12 
## 
## $`120g8`
## .
##   0.2   0.1   0.3  0.05   0.4  0.03  0.19  0.02  0.04  0.06  0.44 0.018  0.25  0.31  0.39   1.3 
##   240   197    46    24    24    23    17    16    16    15    15    12    12    11    11    11 
## 
## $CD8
## .
##  0.2  0.3  0.4  0.1 0.06 0.04 0.44 0.08 0.31 0.03 0.05 0.07 
##  206  132   78   74   23   22   20   18   16   14   12   11 
## 
## $Ly6C
## .
##   0.3   0.1   0.2   0.4  0.04  0.06  0.03  0.19  0.38 0.019  0.27  0.31  0.44 
##   174   127    89    89    27    23    19    14    12    11    11    11    11 
## 
## $CD4
## .
##  0.2  0.1  0.3  0.4 0.07 0.06 0.04 0.31 0.03 
##  190  103   76   44   14   13   12   12   11 
## 
## $CD11b
## .
##  0.3  0.2  0.1  0.4 0.05 0.07 0.03 0.08 0.29 0.11 0.04 0.44 
##  173  172   59   54   20   17   16   16   15   12   11   11 
## 
## $CD27
## .
##  0.2  0.3 0.05  0.1  0.4 0.03 0.09 
##   65   36   23   23   20   18   12 
## 
## $CD16_32
## .
##  0.2  0.1  0.3  0.4 0.04 0.03 0.29 0.07 0.06 
##  190  167   82   63   23   22   14   13   12 
## 
## $SiglecF
## .
##   0.2   0.1   0.3  0.02  0.03  0.06  0.27 0.018  1.23 0.019  0.04  0.22  0.07  0.08  0.33 
##   213   185   102    39    28    17    15    14    13    12    12    12    11    11    11 
## 
## $Foxp3
## .
##   0.1   0.2   0.3  0.04  0.02  0.03  0.06  0.19   0.4 0.018  0.38   1.1 0.019 
##   232   185    69    31    29    24    23    16    15    12    12    12    11 
## 
## $B220
## .
## 0.3 0.2 
##  22  12 
## 
## $CD5
## .
##   0.2   0.3   0.1   0.4  0.06  0.04  0.05  0.03  0.02  0.29  0.09 0.024  0.44 
##   139   115    49    44    18    17    17    15    14    14    13    12    11 
## 
## $FceR1a
## .
##   0.2   0.1   0.3  0.03  0.04  0.02  0.05   1.2 0.019  0.17  0.07  0.08  0.38 
##   253   191    57    28    24    19    17    16    13    12    11    11    11 
## 
## $TCRgd
## .
##   0.2   0.1   0.3  0.03  0.05   0.4  0.04  0.07  0.22  0.44 0.019  0.02  0.31  1.23 0.021  0.09   1.1 
##   184   175   128    23    22    22    21    20    15    13    12    12    12    12    11    11    11 
## 
## $CCR7
## .
##  0.2  0.3  0.1  0.4 0.03 0.06 0.02 0.04 0.05 0.19 0.31  1.4 0.33 0.38 0.44  1.3 
##  177  160  123   41   21   21   17   17   13   12   12   12   11   11   11   11 
## 
## $Sca1
## .
##  0.2  0.3  0.1  0.4 0.06 0.03 0.04 0.05 0.07 0.08 0.29 
##  194  146  100   63   22   20   16   16   14   12   11 
## 
## $CD49b
## .
##  0.2  0.1  0.3  0.4 0.04 0.02 0.03 0.44 0.05 0.08 
##  201  185   87   38   27   24   22   19   13   12 
## 
## $cKit
## .
##   0.2   0.1   0.3   0.4  0.03  0.04  0.05  0.02 0.019  0.38  0.06  0.19  0.07  1.23 
##   193   178   105    32    25    25    25    19    15    15    12    12    11    11 
## 
## $CD150
## .
##  0.2  0.3  0.1  0.4 0.03 0.06 0.04 0.02 0.08 0.11 0.44 
##  177  165  102   55   23   21   18   15   15   11   11 
## 
## $CD25
## .
##  0.1  0.2  0.3 0.03  0.4 0.04 0.02 0.06 0.07  1.2 0.08 0.05 
##  170  164   87   30   28   22   17   14   14   14   12   11 
## 
## $TCRb
## .
##  0.2  0.1  0.3  0.4 0.06 0.05 0.03 0.02 0.04 0.07 0.41 
##  162  105   84   47   22   19   18   13   12   11   11 
## 
## $CD43
## .
##   0.2   0.3   0.1  0.03  0.05   0.4  0.04  0.33 0.018  0.09  0.08 
##   122    93    54    18    17    17    14    13    12    12    11 
## 
## $CD64
## .
##   0.2   0.1   0.3  0.04  0.03  0.07  0.05 0.019 0.021  0.21  0.28   0.4  0.02  0.18  0.19  0.31  1.23 
##   207   200    87    30    25    16    15    14    14    14    14    14    12    11    11    11    11 
##   1.4 
##    11 
## 
## $CD138
## .
##   0.2   0.1   0.3  0.02  0.03   0.4  0.04  0.06  0.19  0.08  0.18 0.019  0.05 
##   215   170    93    30    29    22    20    16    14    13    13    12    12 
## 
## $CD103
## .
##   0.2   0.1   0.3   0.4  0.04  0.02  0.03  0.06  0.05  0.27 0.022  0.41   1.1 
##   190   176   106    33    23    21    21    20    14    13    12    11    11 
## 
## $IgM
## .
##  0.3  0.2  0.1 0.05 
##   42   39   31   11 
## 
## $CD44
## .
##  0.2  0.3 0.05  0.4 0.09  0.1 
##   25   24   14   13   12   11 
## 
## $MHCII
## .
##  0.2  0.3  0.1 0.03 0.04 0.05  0.4 0.08 0.02 0.06 
##  233  162   64   21   21   16   16   13   12   12

What do you notice? That the top values are disproportiontely decimal values rounded to the tenths place. 0.1, 0.2, 0.3, 0.4, and so forth.

Let’s put a number to this.

llm_freq_table <- llm_surface_raw %>%
  pivot_longer(everything(), names_to = "marker", values_to = "value") %>%
  filter(value >= 0, value < 1) %>%
  count(value, sort = TRUE) %>%
  mutate(pct = n / (nrow(llm_surface_raw) * ncol(llm_surface_raw)) * 100)

llm_freq_table
## # A tibble: 192 × 3
##    value     n   pct
##    <dbl> <int> <dbl>
##  1  0.2   6161 14.9 
##  2  0.1   4316 10.5 
##  3  0.3   4095  9.92
##  4  0.4   1746  4.23
##  5  0.03   725  1.76
##  6  0.04   649  1.57
##  7  0.05   536  1.30
##  8  0.06   529  1.28
##  9  0.02   484  1.17
## 10  0.08   419  1.02
## # ℹ 182 more rows

So you can see that nearly 15% of the dataset is 0.2 alone.

llm_freq_table$pct[1:4] %>% sum()
## [1] 39.54728

So this means that just around 39.5% of the data is just 0.1, 0.2, 0.3, 0.4. We now check the Samusik dataset as our control.

sam[type_markers] %>%
  pivot_longer(everything(), names_to = "marker", values_to = "value") %>%
  filter(value >= 0, value < 1) %>%
  count(value, sort = TRUE) %>%
  mutate(pct = n / (nrow(llm_surface_raw) * ncol(llm_surface_raw)) * 100)
## # A tibble: 63,075 × 3
##     value     n     pct
##     <dbl> <int>   <dbl>
##  1 0.0728     2 0.00485
##  2 0.134      2 0.00485
##  3 0.209      2 0.00485
##  4 0.219      2 0.00485
##  5 0.258      2 0.00485
##  6 0.275      2 0.00485
##  7 0.277      2 0.00485
##  8 0.290      2 0.00485
##  9 0.310      2 0.00485
## 10 0.322      2 0.00485
## # ℹ 63,065 more rows

The frequency table does not snow anything rounded to the tens place. All of this is very telling, in terms of what tokens the LLM is choosing. There seems to be a heuristic of “tenths place denimal of 0.5 or below” as a “near zero” value, that has nothing to do with what CyTOF data look like.

And just to hammer the point home, we look at just 0.1 and 0.2.

llm_freq_table$pct[1:2] %>% sum()
## [1] 25.3914

This shows that a quarter of the LLM dataset’s “type” marker data are just 0.1, and 0.2, showing a very strong bias as to which tokens the LLM chooses to represent “near zero.”

0.7 Discussion

Here, we find that the LLM generated data were not in-line with the actual Samusik data. You can see that in the UMAP, the LLM generated data primarily formed a separate island. This suggests that not only did the LLM not “rip” the data, but also it could not approximate the data convincingly, with the prompt at hand.

We note that the use of an agentic system like Claude Code would likely lead to the generation of a script that would literally simulate data from the distributions of the Samusik dataset, which it would pull. But here, we were concerned primarily with the question of to what extent the Samusik dataset was sitting in the LLM’s training data.

A follow-up experiment prompted Claude Sonnet to simulate 100 cells, rather than 1 cell. When this happened, it ran a python script under the hood rather than pulling from its own latent space. So for those doing “batch simulation” followups, this should be kept in mind, and the prompt must be engineered so that this does not happen. Note that if you’re using the likes of OpenRouter, this might be invisible to you. In fact, it is possible that in some of the runs above that this took place. But when I tested the “1-cell run” prompt on Claude Sonnet in Antrhopic’s UI, it did not run a script. It pulled from its “training data.”

We found a strange relationship between mean and standard deviation in the simulated data. An upside down parabola. Upon running some simulations, we determined that the parabola can be found in a bernouli process, whereby you’re sampling on/off as opposed to sampling from a bimodal distribution. This suggests that Claude is not doing sampling from a distribution, but rather thinking in terms of marker-positive or marker-negative.

Furthermore, the untransformed data it returned seemed to be bounded distributions (e.g. between 1000 and 4000), consistent with this positive-vs-negative model. Strikingly, the untranformed data also revealed that the marker-negative values were disproportiately represented by 0.1, 0.2, 0.3, and 0.4. But especially 0.1 and 0.2, which was already 25% of the data.

Additional prompt engineering may well lead to better approximated data. This first pass was an attempt to see what would happen with minimal instructions beyond “simulate a cell.” But if we were to be more explicit with the instructions, in terms of what it means to simulate a cell (pull from the marker distributions that are expected given you training data around all things CyTOF…) then maybe we would get more convincing simulated data (by this metric, data that line up with the Samusik dataset on the UMAP above, though there are more precise metrics we can use down the line).

We also note that the Samusik dataset is a hard dataset to start with. Perhaps taking a step back and trying with PBMCs might lead to more success here. This is because there is likely more PBMC data in the LLM’s training data, as compared to mouse bone marrow. But this is yet to be determined.