Imagine a universe entirely without structure, without shape, without connections. A cloud of microscopic events, like fragments of space-time … except that there is no space or time. What characterizes one point in space, for one instant? Just the values of the fundamental particle fields, just a handful of numbers. Now, take away all notions of position, arrangement, order, and what’s left? A cloud of random numbers.
But if the pattern that is me could pick itself out from all the other events taking place on this planet, why shouldn’t the pattern we think of as ‘the universe’ assemble itself, find itself, in exactly the same way? If I can piece together my own coherent space and time from data scattered so widely that it might as well be part of some giant cloud of random numbers, then what makes you think that you’re not doing the very same thing?
Greg Egan, Permutation City
Here, we are going to take the Samusik dataset and the LLM generated data and merge them together, then process them and make our UMAP. We will then color by from where the data came.
So we can be clear about how the LLM-generated data came about, we will be explicit about how it was done. The LLM being used was Claude Sonnet 4.6, a frontier model at the time of the experiment. This was accessed through OpenRouter, where the prompt of simulating one cell from the Samusik dataset was run over and over again for a set number of times.
Below is the script (which was not run directly as part of this markdown). Note that in order to use this script, I set up a “chatbot” tool that is accessible through the command line. I show you how to do this here.
num_runs=1000; # However many you want
prompt="Please simulate one cell in the flagship Samusik CyTOF dataset. Make the data raw. Use floating points as you'd do in a fcs file. I'll do the asinh transform later. I would like you to output as follows.
Here are the names of the parameters: Time, Cell_length, BC1, BC2, BC3, BC4, BC5, BC6, Ter119, CD45.2, Ly6G, IgD, CD11c, F480, CD3, NKp46, CD23, CD34, CD115, CD19, 120g8, CD8, Ly6C, CD4, CD11b, CD27, CD16_32, SiglecF, Foxp3, B220, CD5, FceR1a, TCRgd, CCR7, Sca1, CD49b, cKit, CD150, CD25, TCRb, CD43, CD64, CD138, CD103, IgM, CD44, MHCII, DNA1, DNA2, Cisplatin, beadDist
You will create the numbers for these parameters. Your output will just be the numbers comma separated. No names, but the nmbers will be outputted in the order of the markers that we specified above, from Time to beadDist.
Just output this. No other commentary";
for i in $(seq 1 $num_runs);
do
chatbot "claude" "$prompt" >> output.txt;
done
library(tidyverse)
suppressPackageStartupMessages(library(here))
setwd(here("..", "data"))
num_cells <- 10000
sam <- readr::read_rds("samusik_01.rds") %>% as_tibble()
sam <- sam[sample(nrow(sam), num_cells),]
sam$dataset <- "samusik"
Now we get out our LLM results.
setwd(here("..", "local_data"))
# Read in the file
llm <- readr::read_lines("output.txt")
# Tidy
llm <- lapply(llm, function(i) stringr::str_split(i, ",")[[1]] %>% as.numeric()) %>%
do.call(rbind, .) %>%
as_tibble()
names(llm) <- names(sam)
llm$dataset <- "llm"
And now for the mishmash
cells <- bind_rows(sam, llm)
cells
## # A tibble: 11,058 × 52
## Time Cell_length BC1 BC2 BC3 BC4 BC5 BC6 Ter119 CD45.2 Ly6G IgD CD11c F480
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10918876 43 173. 200. 210. 12.0 6.56 2.57 0.903 4.37 1.10 -0.270 -0.258 1.32
## 2 13201089 21 504. 281. 337. 5.03 7.93 5.85 -0.597 23.3 2.30 -0.433 0.856 9.22
## 3 6335806 17 241. 190. 297. 16.1 4.76 3.75 -0.655 2.89 0.300 1.57 -0.432 0.342
## 4 14895295 19 152. 91.9 122. 6.80 0.717 -0.306 -0.378 13.5 -0.168 64.6 -0.0998 -0.548
## 5 14834515 18 331. 251. 440. 9.18 2.64 0.465 -0.184 8.87 1.01 -0.266 -0.682 9.20
## 6 563248 19 243. 196. 186. 2.23 3.36 0.133 -0.367 25.2 0.686 -0.207 -0.210 5.14
## 7 9141112 24 100. 118. 169. 17.1 1.12 0.271 -0.0644 7.10 0.421 0.536 -0.233 5.94
## 8 13914301 18 228. 214. 308. 23.2 -0.0301 3.36 2.41 12.6 2.36 -0.395 -0.648 2.39
## 9 3131507 25 226. 170. 233. 8.45 7.82 10.2 -0.652 27.2 2.03 -0.315 1.27 42.4
## 10 8605530 27 275. 178. 316. 25.5 8.82 2.98 -0.433 12.9 0.882 -0.333 -0.310 -0.643
## # ℹ 11,048 more rows
## # ℹ 38 more variables: CD3 <dbl>, NKp46 <dbl>, CD23 <dbl>, CD34 <dbl>, CD115 <dbl>, CD19 <dbl>,
## # `120g8` <dbl>, CD8 <dbl>, Ly6C <dbl>, CD4 <dbl>, CD11b <dbl>, CD27 <dbl>, CD16_32 <dbl>,
## # SiglecF <dbl>, Foxp3 <dbl>, B220 <dbl>, CD5 <dbl>, FceR1a <dbl>, TCRgd <dbl>, CCR7 <dbl>, Sca1 <dbl>,
## # CD49b <dbl>, cKit <dbl>, CD150 <dbl>, CD25 <dbl>, TCRb <dbl>, CD43 <dbl>, CD64 <dbl>, CD138 <dbl>,
## # CD103 <dbl>, IgM <dbl>, CD44 <dbl>, MHCII <dbl>, DNA1 <dbl>, DNA2 <dbl>, Cisplatin <dbl>,
## # beadDist <dbl>, dataset <chr>
From here, we do our pre-processing. Do we remember what the surface markers are for the Samusik dataset?
setwd(here("..", "local_data"))
md <- readr::read_rds("marker_types.rds")
type_markers <- dplyr::filter(md, marker_class == "type")$marker_name
surface <- cells[type_markers]
surface <- asinh(surface/5) %>% as_tibble()
surface
## # A tibble: 11,058 × 39
## Ter119 CD45.2 Ly6G IgD CD11c F480 CD3 NKp46 CD23 CD34 CD115 CD19
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.180 0.789 0.218 -0.0539 -0.0517 0.262 -0.120 -0.148 0.00429 0.946 -0.00555 -0.101
## 2 -0.119 2.24 0.446 -0.0865 0.170 1.37 -0.0261 -0.0619 -0.0452 0.337 -0.0923 -0.0732
## 3 -0.131 0.551 0.0600 0.310 -0.0862 0.0683 0.340 0.00179 -0.0294 1.45 0.731 -0.0558
## 4 -0.0756 1.72 -0.0336 3.25 -0.0200 -0.109 -0.155 0.115 0.227 -0.00814 -0.0119 1.71
## 5 -0.0367 1.34 0.200 -0.0532 -0.136 1.37 0.0372 0.730 0.0470 0.845 1.81 -0.150
## 6 -0.0734 2.32 0.137 -0.0413 -0.0421 0.901 0.0716 -0.137 -0.112 1.37 1.77 0.139
## 7 -0.0129 1.15 0.0840 0.107 -0.0466 1.01 0.119 -0.157 -0.0433 -0.108 0.387 -0.0885
## 8 0.464 1.65 0.457 -0.0790 -0.129 0.462 -0.0627 -0.140 -0.00825 0.399 -0.0699 0.0409
## 9 -0.130 2.40 0.395 -0.0630 0.251 2.83 0.302 1.16 -0.0799 0.306 0.594 0.0980
## 10 -0.0865 1.68 0.175 -0.0666 -0.0619 -0.128 -0.0132 -0.0161 -0.0732 1.58 0.536 -0.0219
## # ℹ 11,048 more rows
## # ℹ 27 more variables: `120g8` <dbl>, CD8 <dbl>, Ly6C <dbl>, CD4 <dbl>, CD11b <dbl>, CD27 <dbl>,
## # CD16_32 <dbl>, SiglecF <dbl>, Foxp3 <dbl>, B220 <dbl>, CD5 <dbl>, FceR1a <dbl>, TCRgd <dbl>,
## # CCR7 <dbl>, Sca1 <dbl>, CD49b <dbl>, cKit <dbl>, CD150 <dbl>, CD25 <dbl>, TCRb <dbl>, CD43 <dbl>,
## # CD64 <dbl>, CD138 <dbl>, CD103 <dbl>, IgM <dbl>, CD44 <dbl>, MHCII <dbl>
And from here, we are going to make a UMAP
library(umap)
dimr <- umap(surface)$layout %>% as_tibble()
names(dimr) <- c("umap1", "umap2")
And from here, we are going to plot it as such:
library(ggplot2)
ggplot(dimr, aes(x = umap1, y = umap2, color = cells$dataset)) +
geom_point()
We are now going to go a bit more in depth and determine if the “island” of LLM-generated cells have any sort of biological plausibility at all. To this end, we are going to start with some simple plots, per marker.
ggplot(surface, aes(x = CD45.2, color = cells$dataset)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
This is already a problem. The LLM data has CD45 being very high. Let’s look at a biaxial now. A standard one. CD3 by CD11b.
ggplot(surface, aes(x = CD3, y = CD11b, color = cells$dataset)) + geom_point()
We can see that this is absolute garbage. So this suggests the data that was pulled out of the LLM makes absolutely no sense. Let’s try another one to see if the pattern holds.
ggplot(surface, aes(x = CD4, y = CD8, color = cells$dataset)) + geom_point()
Yes, the pattern still holds. Things that make no sense, like CD4+ CD8+ cells. What this might suggest is that there is a sort of next token prediction that it was doing, taken to the extreme. Where it was just randomly outputting values.
We note that both the value distributions and the combinatorics of the value ranges made no sense. We note that the distributions were might tighter, and that there were marker combinations that made no sense.
Let’s do some more summary statistics. We just said that the value distributions are tighter. We can look at that a bit more closely. First, we will split our processed surface marker data.
surface_sam <- dplyr::filter(surface, cells$dataset == "samusik")
surface_llm <- dplyr::filter(surface, cells$dataset == "llm")
From here, we will run our summary statistics.
apply(surface_sam, 2, summary)
## Ter119 CD45.2 Ly6G IgD CD11c F480 CD3 NKp46
## Min. -0.19016366 0.1373186 -0.187688925 -0.18889759 -0.19011366 -0.189593166 -0.18824457 -0.18799585
## 1st Qu. -0.11603891 0.8751696 -0.006833073 -0.09822101 -0.10952191 -0.005818719 -0.10232228 -0.09045308
## Median -0.06781758 1.4206448 0.204757288 -0.03072763 -0.05042880 0.329297839 -0.03403289 -0.01504077
## Mean -0.02420201 1.4536391 0.310679707 0.32568406 0.06942321 0.636488142 0.16442833 0.12334237
## 3rd Qu. -0.01606623 1.9679153 0.535478280 0.17221381 0.05045427 1.018653927 0.12733671 0.19236035
## Max. 1.40240393 4.4354540 2.633664261 4.47037564 4.29823038 5.992107183 4.27247637 2.38599891
## CD23 CD34 CD115 CD19 120g8 CD8 Ly6C CD4
## Min. -0.18988526 -0.18841122 -0.18740831 -0.18903033 -0.18780069 -0.18904360 -0.18710816 -0.19060301
## 1st Qu. -0.11179613 -0.04098163 0.03501379 -0.09769185 -0.10470486 -0.10284890 0.06080606 -0.06966148
## Median -0.05482977 0.15893521 0.32545574 -0.02172942 -0.04220942 -0.03813746 1.98258249 0.07327919
## Mean 0.01912709 0.46696830 0.49978284 0.21548033 0.17983092 0.13230566 1.98359971 0.36153713
## 3rd Qu. 0.01786339 0.64528997 0.81276852 0.27012358 0.09609460 0.11658728 3.58530069 0.46398813
## Max. 2.48313964 4.79682270 3.81789678 3.14682837 5.17723352 5.02384261 6.40423000 6.11056252
## CD11b CD27 CD16_32 SiglecF Foxp3 B220 CD5 FceR1a
## Min. -0.18787495 -0.1894748 -0.1841988 -0.18623274 -0.18983913 -0.18860330 -0.18764791 -0.1891613
## 1st Qu. 0.03832147 -0.0622791 1.6342462 0.03510659 -0.08682806 -0.07000977 0.03637336 -0.0720568
## Median 1.39813860 0.1065137 2.8887704 0.37118737 -0.00253725 0.07642104 0.29984983 0.0520543
## Mean 1.80935542 0.4823741 2.6192100 0.67652048 0.11904613 0.82988936 0.56677586 0.1741945
## 3rd Qu. 3.47878162 0.5780958 3.6649569 0.98288575 0.20543632 1.26447642 0.76579921 0.2894772
## Max. 6.03959576 4.9172335 6.7695805 5.31040054 2.67291918 5.30962116 6.42188386 3.6151468
## TCRgd CCR7 Sca1 CD49b cKit CD150 CD25 TCRb
## Min. -0.190200662 -0.18927588 -0.18829775 -0.19064167 -0.189021321 -0.18877257 -0.18903614 -0.19083474
## 1st Qu. -0.111388009 -0.07606703 -0.03771335 -0.10166505 -0.086742817 -0.09849456 -0.10466090 -0.09806690
## Median -0.056599576 0.03788346 0.16816522 -0.03532581 -0.009417906 -0.02771282 -0.03735943 -0.03026985
## Mean -0.002960151 0.16375119 0.79517759 0.18190695 0.403733439 0.07549126 0.04755663 0.14730894
## 3rd Qu. 0.001927813 0.27507162 0.82555206 0.14380906 0.380707161 0.13319113 0.09678068 0.12761044
## Max. 2.273246265 2.64955647 7.12668585 4.72589475 5.616765855 2.57341739 4.94917915 4.56986693
## CD43 CD64 CD138 CD103 IgM CD44 MHCII
## Min. -0.18834818 -0.18854107 -0.18619332 -0.18792006 -0.18889136 -0.1817913 -0.1856824
## 1st Qu. 0.07755234 0.08630538 -0.07793945 -0.10672799 -0.04143494 2.1828865 0.3753040
## Median 0.69137412 0.48202782 0.02170501 -0.04552825 0.15510626 3.7133102 0.9573889
## Mean 1.17521918 0.72992300 0.14329976 0.01775892 0.80894544 3.2423301 1.6343682
## 3rd Qu. 2.07871909 1.15570223 0.24189563 0.05656577 0.84892095 4.3657561 2.1325015
## Max. 5.87229564 5.70470558 3.27610534 3.44751803 8.15718185 6.8124887 7.6840793
And now we will do the same on the LLM data.
apply(surface_llm, 2, summary)
## Ter119 CD45.2 Ly6G IgD CD11c F480 CD3 NKp46
## Min. -0.24358407 5.043725 -0.15142070 -0.20848635 -0.20261092 -0.20065089 -0.18494388 -0.16721960
## 1st Qu. 0.05996406 6.842333 0.03998934 0.01999867 0.03998934 0.03449315 0.01999867 0.01999867
## Median 0.07991491 7.048010 0.05996406 0.03998934 0.07991491 0.05996406 0.03998934 0.05397379
## Mean 0.10579625 7.056548 0.08124992 0.09519684 0.11134510 0.10511200 0.74695211 0.08275429
## 3rd Qu. 0.08788682 7.167594 0.07592703 0.05796753 0.09983408 0.07991491 0.06595218 0.05996406
## Max. 1.83289343 8.906865 1.54934538 7.05321994 2.24975640 7.05313694 7.59835948 1.54853220
## CD23 CD34 CD115 CD19 120g8 CD8 Ly6C CD4
## Min. -0.18494388 -0.18494388 -0.18100993 0.000000 -0.18100993 -0.17510381 -0.17707320 -0.13756570
## 1st Qu. 0.01999867 0.01999867 0.01999867 6.476020 0.01999867 0.01999867 0.01999867 0.03998934
## Median 0.03998934 0.03998934 0.05996406 6.912342 0.03998934 0.04598379 0.05996406 0.05996406
## Mean 0.20607827 0.07109850 0.08853507 5.396984 0.10447629 0.45501336 0.40201223 2.32965535
## 3rd Qu. 0.07991491 0.06595218 0.07991491 7.130835 0.06595218 0.07991491 0.07991491 6.62082137
## Max. 7.22165227 1.21128984 1.77453898 8.894791 7.05317152 7.54579164 7.15067070 8.04832054
## CD11b CD27 CD16_32 SiglecF Foxp3 B220 CD5 FceR1a
## Min. -0.18690981 -0.14944312 -0.18494388 -0.17904190 -0.18297726 0.000000 -0.17116302 -0.15734989
## 1st Qu. 0.02999550 0.04598379 0.01999867 0.01999867 0.01999867 6.701628 0.03799086 0.01999867
## Median 0.05996406 6.10786936 0.03998934 0.03998934 0.03998934 7.119455 0.05996406 0.03998934
## Mean 0.52548637 3.68449444 0.30579851 0.07359914 0.13550081 6.128205 1.89078788 0.08089356
## 3rd Qu. 0.10132649 6.99729787 0.07991491 0.05996406 0.05996406 7.231808 6.19923476 0.05996406
## Max. 7.34637185 8.68753666 7.35284143 1.32688962 7.52259520 8.978333 7.53941024 6.62087466
## TCRgd CCR7 Sca1 CD49b cKit CD150 CD25 TCRb
## Min. -0.18297726 -0.19083950 -0.16524692 -0.17707320 -0.18100993 -0.15142070 -0.18690981 -0.16721960
## 1st Qu. 0.01999867 0.01999867 0.01999867 0.01999867 0.01999867 0.01999867 0.01999867 0.01999867
## Median 0.03998934 0.03998934 0.04008926 0.03998934 0.03998934 0.04398582 0.03998934 0.05996406
## Mean 0.08751204 0.14680016 0.24460847 0.08887742 0.08062664 0.16013123 0.63088778 1.70828019
## 3rd Qu. 0.06196035 0.07991491 0.07991491 0.07941646 0.06595218 0.07991491 0.08340324 0.40391399
## Max. 7.08279442 7.32972389 7.43474432 6.84242825 1.46402005 7.38365935 7.45447025 8.68154088
## CD43 CD64 CD138 CD103 IgM CD44 MHCII
## Min. -0.18690981 -0.17313375 -0.17510381 -0.18100993 0.00000000 0.000000 -0.16129965
## 1st Qu. 0.03998934 0.01999867 0.01999867 0.01999867 0.07592703 4.829017 0.03799086
## Median 0.06595218 0.03998934 0.03998934 0.03998934 6.60762156 6.909876 0.05597077
## Mean 1.88886045 0.07659209 0.10106515 0.08128831 4.19015198 5.338956 1.02714338
## 3rd Qu. 6.10786268 0.05996406 0.06146127 0.07592703 7.05397007 7.181547 0.12517282
## Max. 7.56270230 1.32688962 6.77933212 1.43194709 8.89478545 8.752822 8.53646238
We can at least see that the LLM data does indeed include negative values. This is otherwise quite a lot to take in. So we are going to make it a bit easier. We are going to plot the mean x standard deviation, marker by marker. We will then color by dataset.
msd_sam <- apply(surface_sam, 2, function(i) {
curr_mean <- mean(i)
curr_sd <- sd(i)
return(c(mean = curr_mean, sd = curr_sd))
}) %>% t() %>% as_tibble()
msd_llm <- apply(surface_llm, 2, function(i) {
curr_mean <- mean(i)
curr_sd <- sd(i)
return(c(mean = curr_mean, sd = curr_sd))
}) %>% t() %>% as_tibble()
# what it looks like
msd_sam[1:5,]
## # A tibble: 5 × 2
## mean sd
## <dbl> <dbl>
## 1 -0.0242 0.179
## 2 1.45 0.743
## 3 0.311 0.402
## 4 0.326 0.872
## 5 0.0694 0.378
Now we combine the data and label for our plot.
msd_sam$dataset <- "samusik"
msd_llm$dataset <- "llm"
msd <- dplyr::bind_rows(msd_sam, msd_llm)
msd[1:5,]
## # A tibble: 5 × 3
## mean sd dataset
## <dbl> <dbl> <chr>
## 1 -0.0242 0.179 samusik
## 2 1.45 0.743 samusik
## 3 0.311 0.402 samusik
## 4 0.326 0.872 samusik
## 5 0.0694 0.378 samusik
And from here, we make our plot.
ggplot(msd, aes(x = mean, y = sd, color = dataset)) + geom_point(aes(size = 3))
And we have just discovered something very, very strange. The mean and sd of the markers form a parabola. This is actually very weird. I really did not expect this. What might it mean?
The first thing we are going to do, in order to investigate this further is to fit an equation to the line.
fit <- lm(sd ~ mean + I(mean^2), data = msd_llm)
coef(fit) # c (Intercept), b (x), a (I(x^2))
## (Intercept) mean I(mean^2)
## 0.2216885 2.0294554 -0.2843874
c0 <- coef(fit)[1]
b <- coef(fit)[2]
a <- coef(fit)[3]
sprintf("y = %.4f x^2 + %.4f x + %.4f", a, b, c0)
## [1] "y = -0.2844 x^2 + 2.0295 x + 0.2217"
Now we fit the line to the plot:
ggplot(msd, aes(x = mean, y = sd, color = dataset)) +
geom_point(aes(size = 3)) + stat_function(fun = function(x) a*x^2 + b*x + c0, color = "black", linetype = "dashed")
Not perfect. The one we have is a bit shorter and wider. This warrents further investigation, but this will do for now. We’re moving to the outer edge of my wheelhouse now, but this might come up as a result of very tightly bounded min and maxes in the sampling. To quote ChatGPT:
A very tight upside-down parabola in (mean, sd) across features usually means each marker is behaving like a bounded on/off variable with (almost) the same min and max across markers.
This is sort of what we saw in the biaxials. So here’s what we can do. The same mean-sd thing, but look at min vs max per marker. Then a biaxial of min vs max.
mm_sam <- apply(surface_sam, 2, function(i) {
curr_min <- min(i)
curr_max <- max(i)
return(c(min = curr_min, max = curr_max))
}) %>% t() %>% as_tibble()
mm_llm <- apply(surface_llm, 2, function(i) {
curr_min <- min(i)
curr_max <- max(i)
return(c(min = curr_min, max = curr_max))
}) %>% t() %>% as_tibble()
# what it looks like
mm_sam[1:5,]
## # A tibble: 5 × 2
## min max
## <dbl> <dbl>
## 1 -0.190 1.40
## 2 0.137 4.44
## 3 -0.188 2.63
## 4 -0.189 4.47
## 5 -0.190 4.30
And from here, we combine the data and make our plots.
mm_sam$dataset <- "samusik"
mm_llm$dataset <- "llm"
mm <- dplyr::bind_rows(mm_sam, mm_llm)
mm[1:5,]
## # A tibble: 5 × 3
## min max dataset
## <dbl> <dbl> <chr>
## 1 -0.190 1.40 samusik
## 2 0.137 4.44 samusik
## 3 -0.188 2.63 samusik
## 4 -0.189 4.47 samusik
## 5 -0.190 4.30 samusik
And now we plot:
ggplot(mm, aes(x = min, y = max, color = dataset)) + geom_point(aes(size = 3))
So this tells us that the max values of the LLM dataset are more tightly bound (higher and lower) than that of the Samusik dataset. I think the point on the upper right in the LLM dataset is CD45, whose biaxial we saw earlier.
So now we are going to attempt to reproduce the upside down parabola. After doing some visualization, it seems that if we have a bimodal distribution, that is very tight between two numbers (e.g. 0 and 7), then the mean and the standard deviation of the bimodal distribution might be tightly bounded by the parabola, whereby a mean between the two peaks would have the maximum standard deviation (incorporating both peaks equally) and a mean on one side would have a minimum standard deviation, in that it will be cloeset to that of that single peak, because it would have the most data points on that peak and the fewest data points on the other side.
Let’s just run a simulation to put this into pictures.
make_bimodal <- function(p, sd_tight = 0.001) {
side <- sample(c(0, 1), 1, prob = c(1 - p, p))
if(side == 0) {
result <- rnorm(1, mean = 0, sd = sd_tight)
} else {
result <- rnorm(1, mean = 7, sd = sd_tight)
}
return(result)
}
probs <- runif(39, min = 0, max = 1)
dists <- lapply(seq(1000), function(i) {
result <- sapply(probs, function(j) {
make_bimodal(j)
})
return(result)
})
dists[1:3]
## [[1]]
## [1] 7.000156e+00 6.998986e+00 6.999212e+00 6.999520e+00 -3.621034e-04 7.000346e+00 -3.403421e-04
## [8] 6.997734e+00 -9.991783e-04 2.697323e-03 7.000889e+00 7.001053e+00 3.316079e-04 3.461340e-04
## [15] 7.000093e+00 6.999955e+00 6.216383e-04 6.999729e+00 1.217954e-03 6.999986e+00 6.998312e+00
## [22] 7.000178e+00 7.000536e+00 -3.131586e-04 -9.640669e-04 -3.844386e-04 7.000438e+00 7.441518e-04
## [29] -2.396784e-04 8.401598e-05 -4.203988e-04 4.841057e-04 -9.310505e-04 6.999681e+00 3.635099e-04
## [36] 7.000485e+00 7.956802e-04 7.000368e+00 7.001714e+00
##
## [[2]]
## [1] 5.053340e-04 6.999675e+00 -9.465689e-04 6.999416e+00 -1.525179e-03 6.999398e+00 -1.419020e-03
## [8] -6.276005e-05 -9.035718e-04 7.346070e-04 -1.257077e-03 9.564911e-04 6.999904e+00 -5.628557e-04
## [15] -1.644493e-03 -8.551046e-04 7.000756e+00 6.999889e+00 6.999604e+00 6.999333e+00 4.968569e-04
## [22] 6.999753e+00 7.000967e+00 3.306486e-04 -2.068207e-03 7.000613e+00 3.798048e-04 -3.449821e-04
## [29] -1.474497e-03 -4.261783e-04 3.233054e-04 -7.004294e-05 -3.738492e-04 7.000600e+00 -1.430327e-03
## [36] 3.961244e-04 6.999576e+00 7.000288e+00 -6.058894e-04
##
## [[3]]
## [1] -1.866398e-03 7.000273e+00 7.000619e+00 6.999274e+00 5.677327e-04 6.997856e+00 5.110735e-04
## [8] 5.418109e-04 1.737952e-03 3.033823e-04 7.000354e+00 2.023997e-03 7.000189e+00 -1.160867e-03
## [15] 6.997018e+00 6.999626e+00 7.001172e+00 7.000662e+00 6.998291e+00 7.002196e+00 -1.594206e-03
## [22] 6.999705e+00 -9.711627e-04 -7.280571e-05 7.000155e+00 7.000740e+00 7.000661e+00 1.740819e-04
## [29] -1.362673e-03 7.000483e+00 7.000479e+00 3.747400e-04 6.998916e+00 2.135212e-05 7.000594e+00
## [36] 7.001298e+00 1.055461e-03 -3.276885e-04 -7.465639e-04
And now we make our mean and standard deviation for this.
dists_mat <- do.call(rbind, dists)
dists_summary <- tibble(dist_mean = colMeans(dists_mat), dist_sd = apply(dists_mat, 2, sd))
And from here, we plot:
ggplot(dists_summary, aes(x = dist_mean, y = dist_sd)) + geom_point(aes(size = 3))
Not shown: I tried the above with various standard deviations on the make_bimodal function above. What I found is that as soon as I dropped the standard deviation to effectively zero, such that it was an on/off (Bernouli) process, I was able to replicate the parabola.
This suggests that the LLM is thinking more in terms of on/off, rather than sampling from a distribution in its training data.
We will check one more thing. We looked above at post-transform distributions. Now we are going to look again at the raw data. Pre-transform. This might give us some insights as to what the LLM is doing when it is actually trying to pick values that “make sense.”
So we go back to the original datasets and have a look. We have already looked at CD45. So let’s see what that looks like raw. We go back to the “cells” object, which is not yet transformed.
The Samusik dataset.
ggplot(sam, aes(x = CD45.2)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
The LLM dataset.
ggplot(llm, aes(x = CD45.2)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
The LLM dataset does not look much like a distribution. It seems to be bounded at 5000 counts and something like 1000 counts (with a few exceptions). This would suggest that the LLM is thinking in “boundaries.” We also notice that these values are very high in comparison to the Samusik dataset, which peak in the hundreds.
Now let’s go ahead and look at the biaxials we had previously plotted. Starting with CD3 and CD11b.
ggplot(cells, aes(x = CD3, y = CD11b, color = cells$dataset)) + geom_point()
## Warning: Use of `cells$dataset` is discouraged.
## ℹ Use `dataset` instead.
Once again, we see very large values for the marker positive populations in the LLM dataset in comparison to the Samusik dataset. CD3+ seems to be bounded between 1000 and 5000 with a rather uniform distribution. CD 11b+ seems to be bounded between 1000 and 4000 with a few exceptions underneath. So this seems to suggest bounded uniform distributions.
Now we can look at CD4 and CD8 again, untransformed, and see if we see similar patterns.
ggplot(cells, aes(x = CD4, y = CD8, color = cells$dataset)) + geom_point()
## Warning: Use of `cells$dataset` is discouraged.
## ℹ Use `dataset` instead.
We have the same story with the very large marker-positive values coming out of the LLM dataset. CD4+ has rather loose bounds between 1000 and 4000. CD8+ has similar loose bounds between 1000 and 4000. While CD4’s distribution looks fairly uniform (at least eyeballing it), CD8 looks like there’s a cluster at 2000 and a cluster that starts just below 3000. This would suggest different “levels” of CD8 in the training data. Perhaps this value can change depending on different cell subsets, or datasets.
Either way, the untransformed LLM output seems to be pointing toward bounded ranges. The shape of the distributions require further investigation.
Now, a very interesting observation can be made if we look at the negative values. As per the plots above, they appear to be clustering around zero. But let’s take a closer look.
llm_surface_raw <- llm[type_markers]
apply(llm_surface_raw[type_markers], 2, function(i) {
i[i < 5] %>% table() %>% sort(decreasing = TRUE) %>% .[. > 10]
})
## $Ter119
## .
## 0.4 0.3 0.08 1.4 0.44 0.031 0.18 0.31 0.8 1.23 1.45 0.45 0.15 0.041 0.05 0.67
## 361 113 68 50 48 36 28 28 21 21 21 19 17 16 15 15
##
## $CD45.2
## integer(0)
##
## $Ly6G
## .
## 0.3 0.2 0.38 0.05 0.28 0.08 0.21 1.3 0.028 0.038 0.55 0.024 1.8 0.18 0.03 0.6 1.45
## 263 226 61 51 31 27 22 21 20 18 16 13 13 12 11 11 11
##
## $IgD
## .
## 0.2 0.1 0.29 0.03 0.3 0.19 0.09 0.029 1.8 1.3 0.11 0.48 1.12 0.22 1.23
## 263 207 57 42 29 24 22 18 17 16 15 15 13 12 12
##
## $CD11c
## .
## 0.4 0.3 0.2 0.51 0.07 0.06 0.09 0.24 0.5 0.1 0.14 0.033 1.1 1.8 0.19 0.019 0.04
## 287 117 52 43 40 27 22 22 20 19 16 14 13 13 12 11 11
##
## $F480
## .
## 0.3 0.2 0.04 0.1 0.33 0.4 0.11 1.6 0.42 0.03 0.07 0.033 0.19 0.06 0.29
## 262 164 48 44 27 27 24 22 21 13 13 12 12 11 11
##
## $CD3
## .
## 0.2 0.1 0.33 0.07 0.06 0.3 0.08 0.03 0.027 0.17
## 280 182 45 35 33 33 18 17 11 11
##
## $NKp46
## .
## 0.3 0.2 0.1 0.03 0.28 0.05 0.22 0.02 0.27 0.44 0.21 0.021 0.4 0.41
## 242 163 95 38 25 20 16 15 14 14 12 11 11 11
##
## $CD23
## .
## 0.1 0.2 0.4 0.3 0.02 0.04 0.05 0.28 0.18 0.03 0.025 0.06 0.07 0.41 1.2
## 213 182 61 52 29 20 16 16 15 14 12 12 12 11 11
##
## $CD34
## .
## 0.2 0.1 0.3 0.03 0.4 0.05 0.04 0.019 0.02 0.06 0.22 0.09
## 216 199 62 29 27 23 22 21 21 20 14 11
##
## $CD115
## .
## 0.3 0.2 0.1 0.4 0.04 0.09 0.03 0.08 0.22 1.3 0.05 0.06 0.07
## 250 124 80 45 25 19 18 15 14 14 13 13 11
##
## $CD19
## .
## 0.2 0.1 0.3 0.4 0.03 0.06
## 49 23 19 15 14 12
##
## $`120g8`
## .
## 0.2 0.1 0.3 0.05 0.4 0.03 0.19 0.02 0.04 0.06 0.44 0.018 0.25 0.31 0.39 1.3
## 240 197 46 24 24 23 17 16 16 15 15 12 12 11 11 11
##
## $CD8
## .
## 0.2 0.3 0.4 0.1 0.06 0.04 0.44 0.08 0.31 0.03 0.05 0.07
## 206 132 78 74 23 22 20 18 16 14 12 11
##
## $Ly6C
## .
## 0.3 0.1 0.2 0.4 0.04 0.06 0.03 0.19 0.38 0.019 0.27 0.31 0.44
## 174 127 89 89 27 23 19 14 12 11 11 11 11
##
## $CD4
## .
## 0.2 0.1 0.3 0.4 0.07 0.06 0.04 0.31 0.03
## 190 103 76 44 14 13 12 12 11
##
## $CD11b
## .
## 0.3 0.2 0.1 0.4 0.05 0.07 0.03 0.08 0.29 0.11 0.04 0.44
## 173 172 59 54 20 17 16 16 15 12 11 11
##
## $CD27
## .
## 0.2 0.3 0.05 0.1 0.4 0.03 0.09
## 65 36 23 23 20 18 12
##
## $CD16_32
## .
## 0.2 0.1 0.3 0.4 0.04 0.03 0.29 0.07 0.06
## 190 167 82 63 23 22 14 13 12
##
## $SiglecF
## .
## 0.2 0.1 0.3 0.02 0.03 0.06 0.27 0.018 1.23 0.019 0.04 0.22 0.07 0.08 0.33
## 213 185 102 39 28 17 15 14 13 12 12 12 11 11 11
##
## $Foxp3
## .
## 0.1 0.2 0.3 0.04 0.02 0.03 0.06 0.19 0.4 0.018 0.38 1.1 0.019
## 232 185 69 31 29 24 23 16 15 12 12 12 11
##
## $B220
## .
## 0.3 0.2
## 22 12
##
## $CD5
## .
## 0.2 0.3 0.1 0.4 0.06 0.04 0.05 0.03 0.02 0.29 0.09 0.024 0.44
## 139 115 49 44 18 17 17 15 14 14 13 12 11
##
## $FceR1a
## .
## 0.2 0.1 0.3 0.03 0.04 0.02 0.05 1.2 0.019 0.17 0.07 0.08 0.38
## 253 191 57 28 24 19 17 16 13 12 11 11 11
##
## $TCRgd
## .
## 0.2 0.1 0.3 0.03 0.05 0.4 0.04 0.07 0.22 0.44 0.019 0.02 0.31 1.23 0.021 0.09 1.1
## 184 175 128 23 22 22 21 20 15 13 12 12 12 12 11 11 11
##
## $CCR7
## .
## 0.2 0.3 0.1 0.4 0.03 0.06 0.02 0.04 0.05 0.19 0.31 1.4 0.33 0.38 0.44 1.3
## 177 160 123 41 21 21 17 17 13 12 12 12 11 11 11 11
##
## $Sca1
## .
## 0.2 0.3 0.1 0.4 0.06 0.03 0.04 0.05 0.07 0.08 0.29
## 194 146 100 63 22 20 16 16 14 12 11
##
## $CD49b
## .
## 0.2 0.1 0.3 0.4 0.04 0.02 0.03 0.44 0.05 0.08
## 201 185 87 38 27 24 22 19 13 12
##
## $cKit
## .
## 0.2 0.1 0.3 0.4 0.03 0.04 0.05 0.02 0.019 0.38 0.06 0.19 0.07 1.23
## 193 178 105 32 25 25 25 19 15 15 12 12 11 11
##
## $CD150
## .
## 0.2 0.3 0.1 0.4 0.03 0.06 0.04 0.02 0.08 0.11 0.44
## 177 165 102 55 23 21 18 15 15 11 11
##
## $CD25
## .
## 0.1 0.2 0.3 0.03 0.4 0.04 0.02 0.06 0.07 1.2 0.08 0.05
## 170 164 87 30 28 22 17 14 14 14 12 11
##
## $TCRb
## .
## 0.2 0.1 0.3 0.4 0.06 0.05 0.03 0.02 0.04 0.07 0.41
## 162 105 84 47 22 19 18 13 12 11 11
##
## $CD43
## .
## 0.2 0.3 0.1 0.03 0.05 0.4 0.04 0.33 0.018 0.09 0.08
## 122 93 54 18 17 17 14 13 12 12 11
##
## $CD64
## .
## 0.2 0.1 0.3 0.04 0.03 0.07 0.05 0.019 0.021 0.21 0.28 0.4 0.02 0.18 0.19 0.31 1.23
## 207 200 87 30 25 16 15 14 14 14 14 14 12 11 11 11 11
## 1.4
## 11
##
## $CD138
## .
## 0.2 0.1 0.3 0.02 0.03 0.4 0.04 0.06 0.19 0.08 0.18 0.019 0.05
## 215 170 93 30 29 22 20 16 14 13 13 12 12
##
## $CD103
## .
## 0.2 0.1 0.3 0.4 0.04 0.02 0.03 0.06 0.05 0.27 0.022 0.41 1.1
## 190 176 106 33 23 21 21 20 14 13 12 11 11
##
## $IgM
## .
## 0.3 0.2 0.1 0.05
## 42 39 31 11
##
## $CD44
## .
## 0.2 0.3 0.05 0.4 0.09 0.1
## 25 24 14 13 12 11
##
## $MHCII
## .
## 0.2 0.3 0.1 0.03 0.04 0.05 0.4 0.08 0.02 0.06
## 233 162 64 21 21 16 16 13 12 12
What do you notice? That the top values are disproportiontely decimal values rounded to the tenths place. 0.1, 0.2, 0.3, 0.4, and so forth.
Let’s put a number to this.
llm_freq_table <- llm_surface_raw %>%
pivot_longer(everything(), names_to = "marker", values_to = "value") %>%
filter(value >= 0, value < 1) %>%
count(value, sort = TRUE) %>%
mutate(pct = n / (nrow(llm_surface_raw) * ncol(llm_surface_raw)) * 100)
llm_freq_table
## # A tibble: 192 × 3
## value n pct
## <dbl> <int> <dbl>
## 1 0.2 6161 14.9
## 2 0.1 4316 10.5
## 3 0.3 4095 9.92
## 4 0.4 1746 4.23
## 5 0.03 725 1.76
## 6 0.04 649 1.57
## 7 0.05 536 1.30
## 8 0.06 529 1.28
## 9 0.02 484 1.17
## 10 0.08 419 1.02
## # ℹ 182 more rows
So you can see that nearly 15% of the dataset is 0.2 alone.
llm_freq_table$pct[1:4] %>% sum()
## [1] 39.54728
So this means that just around 39.5% of the data is just 0.1, 0.2, 0.3, 0.4. We now check the Samusik dataset as our control.
sam[type_markers] %>%
pivot_longer(everything(), names_to = "marker", values_to = "value") %>%
filter(value >= 0, value < 1) %>%
count(value, sort = TRUE) %>%
mutate(pct = n / (nrow(llm_surface_raw) * ncol(llm_surface_raw)) * 100)
## # A tibble: 63,075 × 3
## value n pct
## <dbl> <int> <dbl>
## 1 0.0728 2 0.00485
## 2 0.134 2 0.00485
## 3 0.209 2 0.00485
## 4 0.219 2 0.00485
## 5 0.258 2 0.00485
## 6 0.275 2 0.00485
## 7 0.277 2 0.00485
## 8 0.290 2 0.00485
## 9 0.310 2 0.00485
## 10 0.322 2 0.00485
## # ℹ 63,065 more rows
The frequency table does not snow anything rounded to the tens place. All of this is very telling, in terms of what tokens the LLM is choosing. There seems to be a heuristic of “tenths place denimal of 0.5 or below” as a “near zero” value, that has nothing to do with what CyTOF data look like.
And just to hammer the point home, we look at just 0.1 and 0.2.
llm_freq_table$pct[1:2] %>% sum()
## [1] 25.3914
This shows that a quarter of the LLM dataset’s “type” marker data are just 0.1, and 0.2, showing a very strong bias as to which tokens the LLM chooses to represent “near zero.”
Here, we find that the LLM generated data were not in-line with the actual Samusik data. You can see that in the UMAP, the LLM generated data primarily formed a separate island. This suggests that not only did the LLM not “rip” the data, but also it could not approximate the data convincingly, with the prompt at hand.
We note that the use of an agentic system like Claude Code would likely lead to the generation of a script that would literally simulate data from the distributions of the Samusik dataset, which it would pull. But here, we were concerned primarily with the question of to what extent the Samusik dataset was sitting in the LLM’s training data.
A follow-up experiment prompted Claude Sonnet to simulate 100 cells, rather than 1 cell. When this happened, it ran a python script under the hood rather than pulling from its own latent space. So for those doing “batch simulation” followups, this should be kept in mind, and the prompt must be engineered so that this does not happen. Note that if you’re using the likes of OpenRouter, this might be invisible to you. In fact, it is possible that in some of the runs above that this took place. But when I tested the “1-cell run” prompt on Claude Sonnet in Antrhopic’s UI, it did not run a script. It pulled from its “training data.”
We found a strange relationship between mean and standard deviation in the simulated data. An upside down parabola. Upon running some simulations, we determined that the parabola can be found in a bernouli process, whereby you’re sampling on/off as opposed to sampling from a bimodal distribution. This suggests that Claude is not doing sampling from a distribution, but rather thinking in terms of marker-positive or marker-negative.
Furthermore, the untransformed data it returned seemed to be bounded distributions (e.g. between 1000 and 4000), consistent with this positive-vs-negative model. Strikingly, the untranformed data also revealed that the marker-negative values were disproportiately represented by 0.1, 0.2, 0.3, and 0.4. But especially 0.1 and 0.2, which was already 25% of the data.
Additional prompt engineering may well lead to better approximated data. This first pass was an attempt to see what would happen with minimal instructions beyond “simulate a cell.” But if we were to be more explicit with the instructions, in terms of what it means to simulate a cell (pull from the marker distributions that are expected given you training data around all things CyTOF…) then maybe we would get more convincing simulated data (by this metric, data that line up with the Samusik dataset on the UMAP above, though there are more precise metrics we can use down the line).
We also note that the Samusik dataset is a hard dataset to start with. Perhaps taking a step back and trying with PBMCs might lead to more success here. This is because there is likely more PBMC data in the LLM’s training data, as compared to mouse bone marrow. But this is yet to be determined.