Samusik data and LLM data mishmash test

Home

Imagine a universe entirely without structure, without shape, without connections. A cloud of microscopic events, like fragments of space-time … except that there is no space or time. What characterizes one point in space, for one instant? Just the values of the fundamental particle fields, just a handful of numbers. Now, take away all notions of position, arrangement, order, and what’s left? A cloud of random numbers.

But if the pattern that is me could pick itself out from all the other events taking place on this planet, why shouldn’t the pattern we think of as ‘the universe’ assemble itself, find itself, in exactly the same way? If I can piece together my own coherent space and time from data scattered so widely that it might as well be part of some giant cloud of random numbers, then what makes you think that you’re not doing the very same thing?

Greg Egan, Permutation City

0.1 Introduction

Here, we are going to take the Samusik dataset and the LLM generated data and merge them together, then process them and make our UMAP. We will then color by from where the data came.

So we can be clear about how the LLM-generated data came about, we will be explicit about how it was done. The LLM being used was Claude Sonnet 4.6, a frontier model at the time of the experiment. This was accessed through OpenRouter, where the prompt of simulating one cell from the Samusik dataset was run over and over again for a set number of times.

Below is the script (which was not run directly as part of this markdown). Note that in order to use this script, I set up a “chatbot” tool that is accessible through the command line. I show you how to do this here.

num_runs=1000; # However many you want
prompt="Please simulate one cell in the flagship Samusik CyTOF dataset. Make the data raw. Use floating points as you'd do in a fcs file. I'll do the asinh transform later. I would like you to output as follows.

Here are the names of the parameters: Time, Cell_length, BC1, BC2, BC3, BC4, BC5, BC6, Ter119, CD45.2, Ly6G, IgD, CD11c, F480, CD3, NKp46, CD23, CD34, CD115, CD19, 120g8, CD8, Ly6C, CD4, CD11b, CD27, CD16_32, SiglecF, Foxp3, B220, CD5, FceR1a, TCRgd, CCR7, Sca1, CD49b, cKit, CD150, CD25, TCRb, CD43, CD64, CD138, CD103, IgM, CD44, MHCII, DNA1, DNA2, Cisplatin, beadDist

You will create the numbers for these parameters. Your output will just be the numbers comma separated. No names, but the nmbers will be outputted in the order of the markers that we specified above, from Time to beadDist.

Just output this. No other commentary";

for i in $(seq 1 $num_runs);
do
    chatbot "claude" "$prompt" >> output.txt;
done

0.2 Process data

library(tidyverse)
suppressPackageStartupMessages(library(here))
setwd(here("..", "data"))

num_cells <- 10000
sam <- readr::read_rds("samusik_01.rds") %>% as_tibble()
sam <- sam[sample(nrow(sam), num_cells),]

sam$dataset <- "samusik"

Now we get out our LLM results.

setwd(here("..", "local_data"))

# Read in the file
llm <- readr::read_lines("output.txt")

# Tidy
llm <- lapply(llm, function(i) stringr::str_split(i, ",")[[1]] %>% as.numeric()) %>%
  do.call(rbind, .) %>%
  as_tibble()

names(llm) <- names(sam)
llm$dataset <- "llm"

And now for the mishmash

cells <- bind_rows(sam, llm)
cells

## # A tibble: 11,058 × 52
##        Time Cell_length   BC1   BC2   BC3    BC4    BC5     BC6  Ter119 CD45.2   Ly6G    IgD  CD11c   F480
##       <dbl>       <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 11119416          25  63.0  49.8 110.   7.11   1.95   0.0990 -0.783    2.14 -0.561 -0.226 -0.271  1.13 
##  2 16572431          21 319.  239.  282.   0.500  0.551  1.68   -0.449    3.35  0.755 -0.407  3.67  -0.516
##  3 12275170          20 202.  195.  256.   7.74   3.73   1.80    0.713    2.59 -0.458 -0.430  0.421 -0.560
##  4 10763247          28 118.  125.  231.  16.5   13.7   -0.157  -0.633    2.66 -0.117 -0.545  2.24   2.52 
##  5 12306886          20 131.  101.  161.   5.24   2.13   6.26   -0.0692   7.70 -0.305 -0.390 -0.744  2.01 
##  6  7113868          22 237.  214.  266.  18.0    5.68  -0.422  -0.645    3.65  1.85   0.413  0.681  7.05 
##  7  9415969          30 408.  284.  377.  33.3   14.9    6.25   -0.517   15.5   1.15   0.779 -0.583 12.5  
##  8  7734687          26 230.  170.  185.   2.60  14.4    1.98   -0.585    3.67  1.50   0.256 -0.456  5.28 
##  9  6116212          20  84.1  39.1  62.5  3.59  -0.123 -0.647   0.146    6.25 -0.597 44.7   -0.147 -0.232
## 10  4295451          33 237.  153.  234.   5.64   6.42  -0.0204  4.68     7.23  0.305 -0.520  0.325 -0.269
## # ℹ 11,048 more rows
## # ℹ 38 more variables: CD3 <dbl>, NKp46 <dbl>, CD23 <dbl>, CD34 <dbl>, CD115 <dbl>, CD19 <dbl>,
## #   `120g8` <dbl>, CD8 <dbl>, Ly6C <dbl>, CD4 <dbl>, CD11b <dbl>, CD27 <dbl>, CD16_32 <dbl>,
## #   SiglecF <dbl>, Foxp3 <dbl>, B220 <dbl>, CD5 <dbl>, FceR1a <dbl>, TCRgd <dbl>, CCR7 <dbl>, Sca1 <dbl>,
## #   CD49b <dbl>, cKit <dbl>, CD150 <dbl>, CD25 <dbl>, TCRb <dbl>, CD43 <dbl>, CD64 <dbl>, CD138 <dbl>,
## #   CD103 <dbl>, IgM <dbl>, CD44 <dbl>, MHCII <dbl>, DNA1 <dbl>, DNA2 <dbl>, Cisplatin <dbl>,
## #   beadDist <dbl>, dataset <chr>

From here, we do our pre-processing. Do we remember what the surface markers are for the Samusik dataset?

setwd(here("..", "local_data"))
md <- readr::read_rds("marker_types.rds")
type_markers <- dplyr::filter(md, marker_class == "type")$marker_name 

surface <- cells[type_markers]
surface <- asinh(surface/5) %>% as_tibble()
surface

## # A tibble: 11,058 × 39
##     Ter119 CD45.2    Ly6G     IgD   CD11c    F480      CD3   NKp46    CD23    CD34   CD115    CD19 `120g8`
##      <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 -0.156   0.416 -0.112  -0.0453 -0.0541  0.224  -0.163   -0.0276 -0.0216 -0.0161  0.0337 -0.148  -0.109 
##  2 -0.0897  0.628  0.150  -0.0814  0.680  -0.103   0.371    0.0221 -0.150   0.831  -0.0806 -0.0858  2.37  
##  3  0.142   0.498 -0.0914 -0.0860  0.0841 -0.112  -0.00972 -0.163  -0.100   0.500   0.147  -0.138  -0.143 
##  4 -0.126   0.509 -0.0233 -0.109   0.435   0.484  -0.0504  -0.0224 -0.0843 -0.0177  0.0927 -0.163  -0.153 
##  5 -0.0138  1.22  -0.0610 -0.0779 -0.148   0.393   0.360   -0.145   0.0297  0.101  -0.0147  0.171  -0.0361
##  6 -0.129   0.677  0.361   0.0824  0.136   1.14    0.239   -0.0989  0.0626  0.736   0.195  -0.0676 -0.156 
##  7 -0.103   1.85   0.229   0.155  -0.116   1.65    0.280    0.739  -0.0767 -0.0310  0.703  -0.144  -0.0406
##  8 -0.117   0.681  0.295   0.0512 -0.0911  0.920  -0.0238   0.0400 -0.0808  0.873   0.117  -0.0776 -0.0332
##  9  0.0291  1.05  -0.119   2.89   -0.0294 -0.0464 -0.101    1.19   -0.0832 -0.0835  1.28    0.730  -0.0158
## 10  0.836   1.16   0.0609 -0.104   0.0649 -0.0537  0.0998  -0.116  -0.0230  0.363   0.243  -0.0140 -0.0484
## # ℹ 11,048 more rows
## # ℹ 26 more variables: CD8 <dbl>, Ly6C <dbl>, CD4 <dbl>, CD11b <dbl>, CD27 <dbl>, CD16_32 <dbl>,
## #   SiglecF <dbl>, Foxp3 <dbl>, B220 <dbl>, CD5 <dbl>, FceR1a <dbl>, TCRgd <dbl>, CCR7 <dbl>, Sca1 <dbl>,
## #   CD49b <dbl>, cKit <dbl>, CD150 <dbl>, CD25 <dbl>, TCRb <dbl>, CD43 <dbl>, CD64 <dbl>, CD138 <dbl>,
## #   CD103 <dbl>, IgM <dbl>, CD44 <dbl>, MHCII <dbl>

0.3 Visualize results

And from here, we are going to make a UMAP

library(umap)

dimr <- umap(surface)$layout %>% as_tibble()
names(dimr) <- c("umap1", "umap2")

And from here, we are going to plot it as such:

library(ggplot2)
ggplot(dimr, aes(x = umap1, y = umap2, color = cells$dataset)) +
  geom_point()

0.4 Discussion

Here, we find that the LLM generated data were not in-line with the actual Samusik data. You can see that in the UMAP, the LLM generated data primarily formed a separate island. This suggests that not only did the LLM not “rip” the data, but also it could not approximate the data convincingly, with the prompt at hand.

We note that the use of an agentic system like Claude Code would likely lead to the generation of a script that would literally simulate data from the distributions of the Samusik dataset, which it would pull. But here, we were concerned primarily with the question of to what extent the Samusik dataset was sitting in the LLM’s “latent space.”

Additional prompt engineering may well lead to better approximated data. This first pass was an attempt to see what would happen with minimal instructions beyond “simulate a cell.” But if we were to be more explicit with the instructions, in terms of what it means to simulate a cell (pull from the marker distributions that are expected given you training data around all things CyTOF…) then maybe we would get more convincing simulated data (by this metric, data that line up with the Samusik dataset on the UMAP above, though there are more precise metrics we can use down the line).

We also note that the Samusik dataset is a hard dataset to start with. Perhaps taking a step back and trying with PBMCs might lead to more success here. This is because there is likely more PBMC data in the LLM’s training data, as compared to mouse bone marrow. But this is yet to be determined.

Samusik data and LLM data mishmash test

Tyler Burns

February 27, 2026

0.1 Introduction

0.2 Process data

0.3 Visualize results

0.4 Discussion