Building and Breaking Models of Life
In this TL;DR I'm reflecting on three different Substacks that circle the sharks around digital biology and models of life
This essay has three diving boards, each approaches digital biology and models of life from slightly different angles and I highly recommend them for a read.
I've been writing through AI hype cycles since 2015 and the longer I watch these sine waves of sentiment the more questions I have. Even now, in the thick of another crescendo, the claims made on behalf of the unpluggable algorithm give me reason to pause. And there's a very simple reason why. I find the digital world overly simplistic, reductionist and well, binary. Digital is pixelated biology and it's high time computer scientists and engineers learnt the mandatory humility that comes with doing the life sciences.
But I am so entranced by digital biology and I’m torn between the visions its technology leaders put forward and my own view of the technical limitations. This piece is an attempt to square this circle. Hopefully it is a combination of technological optimism and scientific skepticism.
The problem statement
The last ten years of AI have thrown up concerns that cross the quandaries of science, fiction and ethics. Strangely enough, many of these issues have been kicking around in biology’s problem set for so long that life scientists are straight up habituated to them. Black box issues, you haven't met the human brain. Your biological Von Neumann Machine is accidentally replicating in the wilderness,1 there’s a biosafety form for that. Issues with labelling and data quality, um, we were still employing professors who only did yeast taxonomy in the 1970s.
Subcellular data has been around for just shy of a hundredth the time of writing. Yet we’ve only enjoyed collecting standardised biological data with meaningful labels and sensible metadata for a couple of decades.
Life has been kicking tires for eons longer than either of these time scales and when we look at biology we observe one thing and one thing only, complexity. Complexity is an overused word yet it is a deeply underappreciated phenomenon from which emerges the biophysical and bioinformational forces that animate life. For example, if we compare biological universal constructors with AI, only one of these things requires a nuclear reactor to turn its training wheels, the other can do nearest neighbour computations at room temperature with sugar as an energy source.
Let’s drill down on the idea of large language models of life and the notion that these models will unify over time. As we work our way through this I want you to hold in mind a joke I first picked up from a lecture series on quantum biology from about twenty years ago.
The joke goes like this, what kind of computer would you use to perfectly simulate the universe… The answer, another universe. Only analogue computing can perfectly emulate analogue computing, all else is approximation.
Large language models of life
We approximate the world via our models, and as Patrick Hsu calls out, digital biology is now filled with task specific examples of this. Need a protein, bam. Want some RNA, we're all over that. Want to use token generation to write DNA that will boot up a wild type organism and do what you intend it to do… maybe not.
Yet this is were the ARC Institute is going, creating model unity to enable DNA token generation based on prompting. This is an amazing moonshot. If you combine prompting with combinatorial high throughout screening and self-driving laboratories, you have a soft-robotic AI powerful enough it can instantiate xenobiological form into reality. That is something genuinely interesting. But is it feasible? And even if it is feasible, is it gaseous waste fermentation feasible, or commercial nuclear fusion feasible? Because one of these scenarios has a slightly longer lead time.
Biology is beautiful, but it is crushingly difficult to digitally compute.
Multiscalarity
For those who haven't read Zen and the Art of Motorcycle Maintenance, I highly recommend it as a diagonal way to learn multiscalar reasoning. In short, we've grown comfortable not knowing how things work at temporal or spatial scales irrelevant to our day-to-day. It makes it difficult for us to maintain things when they break down. I don't know how to resolder my phone, I don't even know how to open it and tear it down. I am comfortable with this because I know someone else can. I also don't know how neurons work but experience has taught me they're pretty reliable. Whether it be the quantum scale or the planetary scale we lack intrinsic understanding of all scales at all times, and this bottlenecks AI development for digital biology due to a lack of training data.
I co-authored a paper back in 2021 and I regularly reference the key figure when talking about multiscalarity. I spent weeks working on this figure and I made two huge mistakes.2

The point I was trying to make in this figure was one of how information abstraction stacks across spatial scales in biotic and abiotic substrates. AI black box problems are infantile in comparison to resolving whether non-trivial quantum phenomena occur in the mammalian brain.
I have severe doubts that we can develop unified models of life without mechanisms for high quality data collection and biosensing across all scales of biological instantiation. I have no confidence we've even turned the lights on when it comes to the inner workings of a standard cell, let alone cell consortia signalling or complex organism dynamics. It has never been a more exciting time to be in biotech, but I hold grave doubts we will see reliable models of life until we have simple things like a standardised multiscalar taxonomy for biological mechanisms and function. We just don’t know how things work yet. Basic training for using any LLM begins and ends with “don’t use it for something you don’t know how to do yourself” because you can’t tell if it is making a mistake or generating some statistically rare false positive. Let’s not build digital biology that is set up to make mistakes we can’t even identify.
Multidimensionality
If multiscalarity wasn't enough of a confounding factor then there's also multidimensionality. DNA is the book of life, but it reads like a ‘choose your own adventure’ driven by an industrial robot casually rolling D20s every femtosecond. The combination of stochastic environmental dynamics and gene expression creates a multidimensional layering of information within, across and between DNA,3 and DNA on its own doesn't even give you life. Half the time DNA is just a floppy disk from the 90s and the disk drive has instant coffee crusted through it.
Even if you can generate A, T, C and G in a meaningful string through LLM-style prompting, an unknown level of information is stored in the three dimensional structure. Preferential chromosonal touchpoints carry meaning. DNA not only carries multiples levels of abstracted information along it's two dimensional structure, it augments this with additional meaning embodied in its three dimensional formations. If we follow the trail of emergent meaning then it stands to reason DNA is harbouring four-dimensional information in conformational shape changes too. Methylation is already associated with aging. We just don't know the full extent of information stored in this dimension because we lack robust tools for measurement.
Moving on from DNA, intracellular dynamics are so jam packed that being a protein is like standing in the middle of a moshpit at a Rage Against the Machine concert. Everything is rubbing up against everything else, crowd surfing viruses are being thrown out by the bouncers, mitochondrial DNA is having its own party in the green room, all the while non-trivial gravitational and magnetic forces mean the floor can change at any moment. Somehow proteins whir, ligands bind, DNA reads, electrons shuttle, and the world keeps spinning. It's a symphony of chaos that makes LEO orbit post-Starlink seem like an empty sandpit of simplicity.
Each of these intracellular interactions adds dimensionality to the computational problem set. Add in extracellular signalling and resource sharing and you quickly balloon out to large infinities of combinatorial potential. AI is good at dealing with these high-dimensional topographies, but only if it has high quality well tagged meaningful training data. We're still freezing and inflating cells just to see what's going on at t = 1.4 We're not even past the insect collecting, flower pressing, funghi classifying stage of discovery. The historical equivalent is in the 1800s when rich people wrote letters to each other about a weird mould they found while walking in the woods.
Without data we can't copilot with AI, and a vast magnitude of data needs to be collected. I don't even know if this planet has the necessary abiotic storage and energy sources required to replicate these data stores in hard media. There's a reason life maintains its data in wet energy efficient biomass.
The bigger issue is that collecting this data will be boring. Sure, the first new collection protocol or measurement system will get into Nature, Science or Cell, but after that it's just a churn of boring experiments replicating methodology to capture every conceivable dimension of data. Public research funding organisations don't exist to fund this work, scientists won’t write a grant for it, even bachelor-student interns can't scale it up except in a citizen sciencey sort of way. Only companies that employ technical personnel stand a chance at maintaining motivation to run the same experiment a million times. And the companies that do this will expect a commercial return on the emergent properties of their data collection.5
I am actually optimistic
Despite the above, I do subscribe to the idea that models of life are possible and a genuinely exciting pathway science is progressing down. But I have significant concerns that the current technological promises being bandied about are nuclear fusion levels of feasible. That doesn't mean we shouldn't be planning for their emergence or their impact, just that it's a multi-decade trillion dollar endeavour that will be hard-wired to great power technological rivalry.
Even in the last few weeks we've seen a publication come out using AI to predict 3D folds for DNA and another ground breaking model for enzyme function. But I can't help thinking we're still looking at DNA and the rest of the molecular world in the same way we listen to whale songs. We know there is complex meaning embodied in these information substrates but we haven't the foggiest what that meaning is. It really could be “so long and thanks for all the fish” and we wouldn't know
Technological blocks to look out for that if overcome would change this assessment:
Dramatic increase in high-throuput soft-robotic workflows in biofoundries that incorporate analog bio-hybrid chips to churn through data collection with AI-enabled tagging of metadata. Watch commercial biofoundries #Ginkgo to track this space.
Experimental examples of model of life unity that incorporate two or three models in one protocol. Early warning of change.
New measurement tools for intracellular dynamics that are amenable to industrial scale up via robot handling. Key enabling capability.
The TL;DR series offers diving boards for reflection on the hinge and anchor points of the bioeconomy.
I am taking a bit of liberty here in assuming knowledge about biological universal constructors and the interwoven history of thought across physics, computing and biology.
Hindsight is wonderful isn’t it. Where is the DNA and RNA? That’s the question I immediately asked myself after this figure was published. Then I wondered why I’d focused so wholly on the visible spectrum because it turns out infrared is a cool way to cage genes in optogenetics.
Horizontal gene transfer, viral invasion of DNA, reverse transcription, on and on and on, we just keep finding weird and wonderful ways that information gets in and out of the helix. Central dogma be damned.
I highly recommend reading this recent post on Fast Biology for a closer look.
Maybe a self-driving laboratory will be able to do this at speed and scale. An exciting area to watch for potential solves of this bottleneck.