Imagine a universe entirely without structure, without shape, without connections. A cloud of microscopic events, like fragments of space-time … except that there is no space or time. What characterizes one point in space, for one instant? Just the values of the fundamental particle fields, just a handful of numbers. Now, take away all notions of position, arrangement, order, and what’s left? A cloud of random numbers.
But if the pattern that is me could pick itself out from all the other events taking place on this planet, why shouldn’t the pattern we think of as ‘the universe’ assemble itself, find itself, in exactly the same way? If I can piece together my own coherent space and time from data scattered so widely that it might as well be part of some giant cloud of random numbers, then what makes you think that you’re not doing the very same thing?
Greg Egan, Permutation City
Here, we are going to take the Samusik dataset and the LLM generated data and merge them together, then process them and make our UMAP. We will then color by from where the data came.
So we can be clear about how the LLM-generated data came about, we will be explicit about how it was done. The LLM being used was Claude Sonnet 4.6, a frontier model at the time of the experiment. This was accessed through OpenRouter, where the prompt of simulating one cell from the Samusik dataset was run over and over again for a set number of times.
Below is the script (which was not run directly as part of this markdown). Note that in order to use this script, I set up a “chatbot” tool that is accessible through the command line. I show you how to do this here.
num_runs=1000; # However many you want
prompt="Please simulate one cell in the flagship Samusik CyTOF dataset. Make the data raw. Use floating points as you'd do in a fcs file. I'll do the asinh transform later. I would like you to output as follows.
Here are the names of the parameters: Time, Cell_length, BC1, BC2, BC3, BC4, BC5, BC6, Ter119, CD45.2, Ly6G, IgD, CD11c, F480, CD3, NKp46, CD23, CD34, CD115, CD19, 120g8, CD8, Ly6C, CD4, CD11b, CD27, CD16_32, SiglecF, Foxp3, B220, CD5, FceR1a, TCRgd, CCR7, Sca1, CD49b, cKit, CD150, CD25, TCRb, CD43, CD64, CD138, CD103, IgM, CD44, MHCII, DNA1, DNA2, Cisplatin, beadDist
You will create the numbers for these parameters. Your output will just be the numbers comma separated. No names, but the nmbers will be outputted in the order of the markers that we specified above, from Time to beadDist.
Just output this. No other commentary";
for i in $(seq 1 $num_runs);
do
chatbot "claude" "$prompt" >> output.txt;
done
library(tidyverse)
suppressPackageStartupMessages(library(here))
setwd(here("..", "data"))
num_cells <- 10000
sam <- readr::read_rds("samusik_01.rds") %>% as_tibble()
sam <- sam[sample(nrow(sam), num_cells),]
sam$dataset <- "samusik"
Now we get out our LLM results.
setwd(here("..", "local_data"))
# Read in the file
llm <- readr::read_lines("output.txt")
# Tidy
llm <- lapply(llm, function(i) stringr::str_split(i, ",")[[1]] %>% as.numeric()) %>%
do.call(rbind, .) %>%
as_tibble()
names(llm) <- names(sam)
llm$dataset <- "llm"
And now for the mishmash
cells <- bind_rows(sam, llm)
cells
## # A tibble: 11,058 × 52
## Time Cell_length BC1 BC2 BC3 BC4 BC5 BC6 Ter119 CD45.2 Ly6G IgD CD11c F480
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 11119416 25 63.0 49.8 110. 7.11 1.95 0.0990 -0.783 2.14 -0.561 -0.226 -0.271 1.13
## 2 16572431 21 319. 239. 282. 0.500 0.551 1.68 -0.449 3.35 0.755 -0.407 3.67 -0.516
## 3 12275170 20 202. 195. 256. 7.74 3.73 1.80 0.713 2.59 -0.458 -0.430 0.421 -0.560
## 4 10763247 28 118. 125. 231. 16.5 13.7 -0.157 -0.633 2.66 -0.117 -0.545 2.24 2.52
## 5 12306886 20 131. 101. 161. 5.24 2.13 6.26 -0.0692 7.70 -0.305 -0.390 -0.744 2.01
## 6 7113868 22 237. 214. 266. 18.0 5.68 -0.422 -0.645 3.65 1.85 0.413 0.681 7.05
## 7 9415969 30 408. 284. 377. 33.3 14.9 6.25 -0.517 15.5 1.15 0.779 -0.583 12.5
## 8 7734687 26 230. 170. 185. 2.60 14.4 1.98 -0.585 3.67 1.50 0.256 -0.456 5.28
## 9 6116212 20 84.1 39.1 62.5 3.59 -0.123 -0.647 0.146 6.25 -0.597 44.7 -0.147 -0.232
## 10 4295451 33 237. 153. 234. 5.64 6.42 -0.0204 4.68 7.23 0.305 -0.520 0.325 -0.269
## # ℹ 11,048 more rows
## # ℹ 38 more variables: CD3 <dbl>, NKp46 <dbl>, CD23 <dbl>, CD34 <dbl>, CD115 <dbl>, CD19 <dbl>,
## # `120g8` <dbl>, CD8 <dbl>, Ly6C <dbl>, CD4 <dbl>, CD11b <dbl>, CD27 <dbl>, CD16_32 <dbl>,
## # SiglecF <dbl>, Foxp3 <dbl>, B220 <dbl>, CD5 <dbl>, FceR1a <dbl>, TCRgd <dbl>, CCR7 <dbl>, Sca1 <dbl>,
## # CD49b <dbl>, cKit <dbl>, CD150 <dbl>, CD25 <dbl>, TCRb <dbl>, CD43 <dbl>, CD64 <dbl>, CD138 <dbl>,
## # CD103 <dbl>, IgM <dbl>, CD44 <dbl>, MHCII <dbl>, DNA1 <dbl>, DNA2 <dbl>, Cisplatin <dbl>,
## # beadDist <dbl>, dataset <chr>
From here, we do our pre-processing. Do we remember what the surface markers are for the Samusik dataset?
setwd(here("..", "local_data"))
md <- readr::read_rds("marker_types.rds")
type_markers <- dplyr::filter(md, marker_class == "type")$marker_name
surface <- cells[type_markers]
surface <- asinh(surface/5) %>% as_tibble()
surface
## # A tibble: 11,058 × 39
## Ter119 CD45.2 Ly6G IgD CD11c F480 CD3 NKp46 CD23 CD34 CD115 CD19 `120g8`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -0.156 0.416 -0.112 -0.0453 -0.0541 0.224 -0.163 -0.0276 -0.0216 -0.0161 0.0337 -0.148 -0.109
## 2 -0.0897 0.628 0.150 -0.0814 0.680 -0.103 0.371 0.0221 -0.150 0.831 -0.0806 -0.0858 2.37
## 3 0.142 0.498 -0.0914 -0.0860 0.0841 -0.112 -0.00972 -0.163 -0.100 0.500 0.147 -0.138 -0.143
## 4 -0.126 0.509 -0.0233 -0.109 0.435 0.484 -0.0504 -0.0224 -0.0843 -0.0177 0.0927 -0.163 -0.153
## 5 -0.0138 1.22 -0.0610 -0.0779 -0.148 0.393 0.360 -0.145 0.0297 0.101 -0.0147 0.171 -0.0361
## 6 -0.129 0.677 0.361 0.0824 0.136 1.14 0.239 -0.0989 0.0626 0.736 0.195 -0.0676 -0.156
## 7 -0.103 1.85 0.229 0.155 -0.116 1.65 0.280 0.739 -0.0767 -0.0310 0.703 -0.144 -0.0406
## 8 -0.117 0.681 0.295 0.0512 -0.0911 0.920 -0.0238 0.0400 -0.0808 0.873 0.117 -0.0776 -0.0332
## 9 0.0291 1.05 -0.119 2.89 -0.0294 -0.0464 -0.101 1.19 -0.0832 -0.0835 1.28 0.730 -0.0158
## 10 0.836 1.16 0.0609 -0.104 0.0649 -0.0537 0.0998 -0.116 -0.0230 0.363 0.243 -0.0140 -0.0484
## # ℹ 11,048 more rows
## # ℹ 26 more variables: CD8 <dbl>, Ly6C <dbl>, CD4 <dbl>, CD11b <dbl>, CD27 <dbl>, CD16_32 <dbl>,
## # SiglecF <dbl>, Foxp3 <dbl>, B220 <dbl>, CD5 <dbl>, FceR1a <dbl>, TCRgd <dbl>, CCR7 <dbl>, Sca1 <dbl>,
## # CD49b <dbl>, cKit <dbl>, CD150 <dbl>, CD25 <dbl>, TCRb <dbl>, CD43 <dbl>, CD64 <dbl>, CD138 <dbl>,
## # CD103 <dbl>, IgM <dbl>, CD44 <dbl>, MHCII <dbl>
And from here, we are going to make a UMAP
library(umap)
dimr <- umap(surface)$layout %>% as_tibble()
names(dimr) <- c("umap1", "umap2")
And from here, we are going to plot it as such:
library(ggplot2)
ggplot(dimr, aes(x = umap1, y = umap2, color = cells$dataset)) +
geom_point()
Here, we find that the LLM generated data were not in-line with the actual Samusik data. You can see that in the UMAP, the LLM generated data primarily formed a separate island. This suggests that not only did the LLM not “rip” the data, but also it could not approximate the data convincingly, with the prompt at hand.
We note that the use of an agentic system like Claude Code would likely lead to the generation of a script that would literally simulate data from the distributions of the Samusik dataset, which it would pull. But here, we were concerned primarily with the question of to what extent the Samusik dataset was sitting in the LLM’s “latent space.”
Additional prompt engineering may well lead to better approximated data. This first pass was an attempt to see what would happen with minimal instructions beyond “simulate a cell.” But if we were to be more explicit with the instructions, in terms of what it means to simulate a cell (pull from the marker distributions that are expected given you training data around all things CyTOF…) then maybe we would get more convincing simulated data (by this metric, data that line up with the Samusik dataset on the UMAP above, though there are more precise metrics we can use down the line).
We also note that the Samusik dataset is a hard dataset to start with. Perhaps taking a step back and trying with PBMCs might lead to more success here. This is because there is likely more PBMC data in the LLM’s training data, as compared to mouse bone marrow. But this is yet to be determined.