Using LLMs to Chart the Topography of Microbiome Research¶

Home


Mystics claim that their ecstasies reveal to them a circular chamber containing an enormous circular book with a continuous spine that goes completely around the walls. But their testimony is suspect, their words obscure. That cyclical book is God.

Jorge Luis Borges, The Library of Babel


Introduction¶

This writeup contains the code and instructions necessary to get a LLM to generate however many questions you like (as constrained by price and time). There are numerous unexplored applications of this at the time of writing (2025-07-30 Wed), with a practical one discussed here being the ability to plumb the latent of questions around a particular domain.

To this end, we have chosen the microbiome. This is an integrative field, which involves the relationship between the microbiome and a number of organisms, diseases, and conditions (including astronauts in space). Furthermore, my talented intern (the co-author of this project) intends to pursue a PhD in this domain, so I am compelled to help her in what little ways I can. Her relevance filter guides the product. What helps a burgeoning microbiome student?

Generating the questions¶

Below is a shell script that calls an LLM using the chatbot() function. You can install the chatbot() function on your command line here

This code asks an LLM (in this case, gemini flash lite) to generate a question using the prompt, num_iter times, and saves it to the desired filename stored in the output_file variable.

In [16]:
output_file = "raw_questions.txt"

with open(output_file, "w") as file:
    pass # simply emptying file before we overwrite
In [17]:
%%bash
output_file="raw_questions.txt"
num_iter=100
prompt="Please generate a random research question about the microbiome. Output only the question. No extra stuff."
echo "The prompt is: $prompt"
echo

for (( i = 0; i < $num_iter; i++ ))
do
    chatbot geminifl "$prompt" >> "$output_file"
    printf '\n' >> $output_file
done
The prompt is: Please generate a random research question about the microbiome. Output only the question. No extra stuff.

Making the embeddings¶

From here, we will convert these questions into spatial coordinates and then into UMAP coordinates. We will also perform clustering on the context map to help group the questions into groups with other relevant questions.

Necessary imports:

In [41]:
import os, warnings

warnings.filterwarnings('ignore')
os.environ['PYTHONWARNINGS'] = 'ignore'
os.environ["TOKENIZERS_PARALLELISM"] = "false" 
In [42]:
import sys
from pathlib import Path
import textwrap
from typing import List

import numpy as np
import pandas as pd
import plotly.express as px
from sentence_transformers import SentenceTransformer
import umap
from sklearn.cluster import KMeans

Defining constants:

In [43]:
MODEL_NAME = "all-mpnet-base-v2"
WRAP_WIDTH = 80
UMAP_RANDOM_STATE = 42

N_CLUSTERS = 10          # k for k means; change for desired cluster number

Defining helper functions:

In [44]:
split_by_blank = lambda t: [p.strip() for p in t.strip().split("\n\n") if p.strip()]

def embed(texts: List[str]):
    return SentenceTransformer(MODEL_NAME).encode(texts, convert_to_numpy=True, normalize_embeddings=True)

def umap_2d(embeddings):
    reducer = umap.UMAP(n_components=2, metric="cosine", random_state=UMAP_RANDOM_STATE)
    return reducer.fit_transform(embeddings)

def cluster_coords(coords):
    return KMeans(n_clusters=N_CLUSTERS, random_state=UMAP_RANDOM_STATE, n_init="auto").fit_predict(coords)

def wrap(text: str) -> str:
    return textwrap.fill(text, width=WRAP_WIDTH).replace("\n", "<br>")

def build_df(questions, coords, labels):
    df = pd.DataFrame({"x": coords[:, 0], "y": coords[:, 1],
                       "question": questions, "cluster": labels})
    
    df = df.reset_index(drop=True)
    df["question_wrapped"] = df["question"].apply(wrap)
    df["cluster_label"] = df["cluster"].astype(str)
    return df

def plot(df):
    fig = px.scatter(
        df,
        x="x",
        y="y",
        color="cluster_label",
        custom_data=["question_wrapped"],
        template="plotly_white",
        title="UMAP of research questions (clustered on UMAP coords)",
    )

    fig.update_traces(
        marker=dict(size=9, opacity=0.8),
        hovertemplate="%{customdata[0]}<extra></extra>",
    )
    fig.write_html("umap_questions_with_clustering.html", auto_open=True)

The following will output a dataframe "question_map_with_clustering_df.csv", which contains the x and y UMAP coordinates for each question. It will also give you an interactive plot (not the final one) to get a quick sanity check for your results.

In [45]:
with open("raw_questions.txt", "r") as file:
    questions = file.read().splitlines()

questions = [x for x in questions if x.strip()]

if not questions:
    sys.exit("No questions found.")

print(f"{len(questions)} questions → embed → UMAP → K‑means (on coords) → plot …")

emb = embed(questions)
coords = umap_2d(emb)
labels = cluster_coords(coords)

df = build_df(questions, coords, labels)
plot(df)

# Output
emb = pd.DataFrame(emb)
emb.columns = [f"emb_{i + 1}" for i in range(emb.shape[1])]
emb.to_csv("question_map_embeddings.csv")
df.to_csv("question_map_with_clustering_df.csv")
100 questions → embed → UMAP → K‑means (on coords) → plot …

Labelling the clusters as metaquestions¶

Next, for each cluster, we will plug all of the questions into an LLM (again, yes) to get the overarching "metaquestion" that captures the essense of the investigatory theme of the cluster.

In [46]:
questions_list = []
for i in range(len(df["cluster"].unique())):
    cluster_questions = []
    for j in range(pd.DataFrame(df.groupby(["cluster"])[["question"]])[1][i].shape[0]):
        cluster_questions.append(np.array2string(pd.DataFrame(df.groupby(["cluster"])[["question"]])[1][i].iloc[j].values).strip("[]"))
    questions_list.append(cluster_questions)
In [47]:
output_file = "meta_questions.txt"

with open(output_file, "w") as file:
    pass # simply emptying file before we overwrite

for questions in questions_list:
    questions_cleaned = [question.strip('\'"') for question in questions]
    questions_str = " ".join(questions_cleaned)
    query = f"Please give me a metaquestion that encapsulates the main investigatory essense of all of the following questions. Just the question please, no extra fluff. Here are the questions: {questions_str}"
    !chatbot geminifl "$query" >> "$output_file"
In [48]:
meta_questions_dict = {}

with open("meta_questions.txt", "r") as file:
    meta_questions_raw = file.read().splitlines()
meta_questions_list = [question.strip('\'"') for question in meta_questions_raw]
for i in range(len(meta_questions_list)):
    meta_questions_dict[i] = meta_questions_list[i]
In [49]:
df["meta_question"] = df["cluster"].map(meta_questions_dict)

Making the final interactive UMAP¶

Finally, we will create a UMAP with the metaquestions as the cluster labels, so you can hover over each point (and also visualize on the legend) to see how the different kinds of questions are interacting with each other in the context map space.

In [50]:
# wrapping the meta question and individual questions for a more aesthetically pleasing plot
df["meta_question_wrapped"] = df["meta_question"].apply(wrap)
In [51]:
fig = px.scatter(df, 
                 x = "x", 
                 y = "y", 
                 color = "meta_question_wrapped", 
                 hover_data = 
                 {"x": False, "y": False, "question_wrapped": True, "meta_question_wrapped": True, "cluster": False, "cluster_label": False},
                 template="plotly_white",
                 title="UMAP of research questions in metaquestion clusters"
                )

fig.update_traces(hoverlabel = dict(font_size=20))

fig.write_html("umap_questions_with_mq.html", auto_open=True)

The final output that you will open/use is "umap_questions_with_mq.html." This will contain a clustered UMAP of the questions. Hover text will include both the question the cursor is on, and the cluster label, which is the metaquestion that we generated earlier.

Play around with this. After you replicate my work, try different domains. Try different models. So far as I am aware, this is largely uncharted territory.

Date: July 30, 2025 - August 5, 2025