The Simulator’s Dilemma, Revisited

Notes from the NCI Foundation Models for Cancer Workshop, March 26, 2026

Mar 27, 2026

Click on the image for my slides presented at the workshop

Yesterday, the US National Cancer Institute (NCI) convened a workshop on foundation models in cancer biology. What struck me most was not the technical ambition on display but the quality of the skepticism, a virtual room full of people who have stopped debating whether these models belong in oncology and started asking something considerably harder: whether the frameworks we are using to build and evaluate them are capable of representing the biology we need them to predict. That question is one I have been working through for some time, and my talk was an attempt to lay out the argument as precisely as I could.

I. The Molecule that knew everything except what mattered

I opened with RG7112, an MDM2 inhibitor evaluated in a Phase I trial for acute myeloid leukemia. In silico it was immaculate: low nanomolar affinity for its target, greater than 1000-fold selectivity over off-targets, favorable ADMET profile, Lipinski’s Rule of Five cleared without objection. Every computational checkpoint had been passed. The molecule had been assessed using structure-based methods drawing on thousands of crystallographic structures and the full weight of what computational drug discovery could bring to bear.

In patients, it produced dose-limiting thrombocytopenia and neutropenia, along with systemic homeostatic disruption that none of the models had anticipated. The compound had raised p53 levels exactly as designed. What no simulation had represented was MDM2’s role in hematopoietic homeostasis, its function in maintaining the very cellular populations whose depletion closed the therapeutic window. The biology responded across tissues and timescales the model had no architecture to reach. The simulation was not wrong. It was operating in a different biological register than the one that determined the outcome.

I used RG7112 not because it failed spectacularly but because it failed instructively. Since 1950, the cost of bringing a drug to market has doubled roughly every nine years, now exceeding $2.6 billion per approval, while the enabling technologies have advanced exponentially across the same period. Drug output per research dollar has collapsed anyway. We are applying increasingly powerful instruments to a problem whose fundamental difficulty has not yielded to them -- and that asymmetry demands an explanation beyond the usual invocations of biological complexity.

The Simulator’s Dilemma

Sean Khozin, MD, MPH

Feb 4

Read full story

II. When the abstraction becomes the obstacle

The explanation I proposed at the workshop is architectural. Most modeling approaches in drug discovery are built on a reductionist biological view: Protein A influences Protein B, which influences Protein C. The abstraction is not false, it has produced real successes, most durably when the biology happens to be genuinely linear and the target genuinely necessary and sufficient, as with BCR-ABL in chronic myeloid leukemia or PD-1/PD-L1 signaling in tumors with pre-existing immune infiltration. But the danger is in treating the abstraction as a complete description of living systems, because biological systems are dense, adaptive networks of thousands of interacting signals, most unmeasured and many unknown. When models are built on simplifying assumptions that do not hold in vivo, their predictive validity remains limited regardless of how sophisticated the computation becomes.

This was the structural failure of the first wave of AI in drug discovery, which ran roughly from the 1980s through the 2010s. QSAR models, molecular docking, virtual screening, tools that operated within closed experimental loops, trained on proxy variables like binding affinity and validated against similar proxies. The correlations they captured were real within constrained systems. What they accumulated was not causal understanding of adaptive biological behavior but increasingly confident predictions about an abstraction that diverged from reality at the exact boundary where clinical outcomes are determined.

The current generation has a different vocabulary (generative design, foundation models, phenomics, etc) but a version of the same core limitation persists. AI today is genuinely effective at optimization: navigating the tradeoff space between molecular potency, selectivity, metabolic stability, and synthetic accessibility with a precision that accelerates medicinal chemistry substantially. But optimization and discovery are different problem classes.

Optimization assumes the underlying biological hypothesis is correct and improves performance within that framework. Discovery requires determining whether the hypothesis is valid in the first place. No amount of optimization capability resolves uncertainty in discovery, and a model that efficiently explores chemical space around a target that turns out to be dispensable in the context of a living organism is threading the needle through the wrong fabric.

The phenomics approach makes this concrete. High-content phenotypic assays measure morphological or transcriptional changes in cell-based systems and use them as proxies for therapeutic efficacy. The technical achievement is real. But cell lines on plastic lack immune infiltration, metabolic gradients, and tissue architecture, the features that determine whether a perturbation produces clinical benefit in a patient rather than a phenotypic readout in a well. A model may predict phenotypic rescue accurately and clinical outcome not at all, because the proxy and the target are operating in different biological registers.

The data I presented from a study we published in 2020 makes the same point at the level of the entire translational pipeline. Across 108 oncology drugs evaluated in mouse, rat, dog, and monkey models, we found median positive predictive values of 0.65 and negative predictive values of 0.50 for predicting human toxicity. A clean animal signal provides essentially coin-flip assurance of human safety.

Our challenge is not confined to computational modeling; the entire pipeline from in vitro systems through animal models does not reliably approximate human biology, and any foundation model trained on data generated by that pipeline inherits its limitations as structural features of the training distribution.

The mathematical frame for all of this is interpolation versus extrapolation. Machine learning systems are reliable when test conditions resemble training conditions closely enough that the learned mapping holds. Chemistry and physics generally meet this condition; governing principles are stable and well-characterized. Biology does not. Emergent behaviors, adaptive resistance, immune tolerance, and compensatory signaling across tissue types are the phenomena that determine clinical outcomes, and they exist outside the training distribution of any model built primarily on experimental proxies. This is why technical validation and clinical validation remain separated by five to seven years; which generative advances have not compressed. AlphaFold predicts protein structure with near-experimental accuracy. RFdiffusion designs proteins with no natural template. These are genuine inflection points in simulation fidelity at the molecular level. A protein that folds as predicted still has to work in a patient, and that distance has not shortened.

III. The language we use is part of the problem

Running beneath all of this is a representational problem I addressed in a Clinical and Translational Science paper, and that the workshop made more urgent. The classification systems that organize cancer biology were not designed to predict therapeutic response. They were built to describe what a pathologist sees under a microscope, and they achieved institutional authority through historical convenience rather than mechanistic precision. As I often say as an illustrative example, “non-small cell lung cancer” defines a disease by negation; it tells you where the tumor was found and what it failed to resemble histologically. It groups malignancies with distinct molecular drivers, immune microenvironments, and resistance mechanisms under a linguistic label that encodes none of that information.

When these inherited categories become training labels for foundation models, the models do not simply inherit their limitations; they amplify them. More parameters and more computing power make the model more confident in the wrong abstraction. The clinical consequence is measurable: overall response rates with doublet chemotherapy in morphologically defined non-small-cell lung cancer hover around 30%, whereas in molecularly defined, tissue-agnostic contexts they routinely reach 50-90%. The classification system produces the wrong answer the majority of the time because it was never designed for the question being asked of it.

What a mechanistically grounded alternative looks like in practice is what I discussed through TTBI, the Total Tumor Burden Index we are developing at Project Data Sphere. It is autonomous, volumetric, longitudinal, and independent of inherited tumor classification categories. It aims to create a biomarker that reflects what the disease is actually doing rather than relying on a label derived from human sensory input.

Drew Williamson presented PathChat at the workshop, a multimodal language model built on whole-slide images and pathology case reports, and the evaluation findings aligned closely with my argument. The NLP metrics the field has inherited for assessing language models, METEOR, ROUGE, BLEU, measure textual similarity to reference answers. In pathology, that is not the same as diagnostic utility. A model can reproduce the vocabulary and syntax of a pathology report while conveying nothing actionable for a clinician, and standard benchmarks lack a mechanism to detect this difference. When his group convened seven experienced pathologists to evaluate open-ended model responses, what emerged was not a performance score but a structural disagreement: different pathologists had substantially different preferences for granularity, format, and the appropriate balance between completeness and precision. The ground truth the field has been optimizing against does not constitute a coherent standard, meaning every model trained on it has been learning to satisfy a consensus that does not actually exist. It is the evaluation version of the same problem I raised about classification: a methodology that accumulated institutional authority before its limitations were understood, now quietly shaping what the next generation of models learns to produce.

IV. What getting it right actually requires

The three paths forward I proposed in my talk each involve changing what models are trained to represent rather than how efficiently they represent it.

The first is multimodal longitudinal foundation models that learn biology rather than chemistry, integrating structure, function, and time so that a model can represent the relationship among TGF-beta signaling, CD8+ T-cell exclusion, and checkpoint inhibitor resistance as a coherent biological system rather than a set of independent variables. The biological context is not optional. It is the prediction target. Kun-Hsing Yu’s work from Harvard demonstrated what becomes accessible when models are built with this orientation: predicting HER2 mutation status in diffuse large B-cell lymphoma at 0.95 AUC from routine H&E slides, without molecular training data, by recovering the genomic signal that histological images contain but that conventional pathology was never designed to extract. That same capacity for hidden signal recovery, his group also showed, applies to demographic confounding; significant performance disparities across race, sex, and age in cancer diagnosis tasks, reduced by 88% across 15 independent cohorts through a contrastive learning approach that separates outcome-relevant signal from demographically correlated noise. The engineering is real. What it is compensating for is a data generation process that encodes the access disparities of a healthcare system with uneven representation, and that process will continue to produce the same confounding in the next round of training data unless something changes upstream of the model.

The second path is a shift from target-centric to systems-level modeling, treating feedback loops and compensatory signaling as the biology to be represented rather than the complexity to be abstracted away. A molecular target is an incomplete abstraction. Intervention at any node propagates through feedback mechanisms, compensatory pathways, and cross-tissue signaling that single-target models cannot represent. Eric Schadt’s patient trajectory engine from Pathos is among the most architecturally promising attempts I've encountered to operationalize this: a large language model orchestrating frozen multimodal experts across DNA, H&E, RNA, and clinical text, with a biological knowledge graph imposing mechanistic validity constraints on the reinforcement-learning reward structure. The model is penalized not just for inaccurate predictions but for predictions that are statistically plausible and biologically incoherent. In an NSCLC case he walked through, he integrated EGFR mutation status, immune microenvironment characterization, and p53 loss into a single patient-level resistance narrative, which is how oncologists actually reason about therapeutic failure, and which current classification frameworks are structurally unable to produce. Whether this architecture generalizes from compelling case demonstrations to reliable population-level predictions is what prospective trials are for. Schadt was explicit that the clinical trial is the real test, which is the right frame, and also the one that the first wave of AI in biomedicine systematically deferred.

The third path is reorienting the field’s objective toward failure prediction. The value of AI in drug development is not only accelerating winners. It is identifying losers before they consume years of capital and patient exposure. Failure modes, toxicity, off-target cascades, and adaptive resistance are, in principle, predictable given the right training signal, but the field has organized its investment almost entirely around abundance narratives. The leverage in systematic failure prediction remains largely unclaimed. Richard Pazdur used to say during my time at the FDA that sins of omission belong to a higher circle of the Inferno than sins of commission. At current attrition rates, an early no is worth as much as a late yes.

V. What passing the test would actually look like

Among the day's presentations, Susan Galbraith's account of AstraZeneca's computational pathology work stood out for a specific reason: it had outcome data. The TROP2 biomarker she described was not a refinement of conventional IHC scoring but a replacement for it: a continuous, normalized ratio derived from subcellular membrane quantification that captures internalization rate as a direct mechanistic determinant of ADC efficacy, rather than a categorical approximation of staining intensity that has historically correlated with it. The result in TROPION-Lung01 was a hazard ratio of 0.52 in biomarker-positive patients and no detectable benefit in biomarker-negative patients. That degree of outcome stratification does not emerge from morphological proxies in such settings. It emerges from measuring what the biology is actually doing. The EGFR mutation prediction model she presented (0.905 AUC with 96% negative predictive value from histological slides) extends the same logic to the population that tissue scarcity and access barriers can currently exclude from molecular testing.

These results point toward what I framed as the Turing test for drug discovery: not whether AI can design a molecule indistinguishable from human work, but whether it can design an intervention that produces the intended effect in humans, indistinguishable from our intent. The TROP2 work is closer to passing a version of that test.

Our simulations have gotten considerably better. But the distance between what our best models predict and what happens in patients is real, documented, and not closing as fast as the technical progress would suggest it should, because the technical progress is happening at isolated discrete levels and the problem is at the systems level, at the representational level, at the level of what we choose to measure and call ground truth.

Closing that distance requires not more compute applied to inherited frameworks but the reconstruction of those frameworks from biological first principles. That reconstruction is the work ahead. And the workshop yesterday was a useful reminder of how much of it remains ahead.

The Simulator’s Dilemma

Discussion about this post

Ready for more?