When you think about what biology research will look like ten years from now, what do you see? AI driven data analysis? AIs running and automating labs? Or what about AIs hypothesizing, designing, and executing experiments all together?
In a recent episode of the Translating Proteomics podcast, Nautilus co-founder Parag Mallick spoke with two AI visionaries, Vijay Pande and Matt McIlwain about the potential of AI in biotech. Vijay is the founder of a16z Bio + Health, where he leads the firm’s investment in companies at the intersection of life science and data science. He was the founding director of Folding at Home, a distributed computing project that simulates protein dynamics. Matt McIlwain is managing director at Madrona Venture Group where he invests in companies at the forefront of machine learning and cloud computing.

Image generated using Google Gemini. While there are some small issues in the image, the image does an okay job as a thumbnail for this blog post. Issues of similar magnitude in the diagnosis of a disease could lead to impactful problems.
In the podcast, they dive into the complexities of biological data, how advances in computing power and algorithms drive the machine learning boom, and how these gains power the future of biological research.
Let’s unpack that further.
Why is biological data so complex?
Long gone are the days when scientists only analyzed proteins and genes one at a time. With the ability to analyze whole genomes, transcriptomes, and proteomes in multiple samples across multiple variables at once, scientists are generating a vast amount of data. It’s impossible to comb through this data by hand so biologists are turning to AI algorithms and tools to aid their analysis. But it’s not just omics that benefit from AI. AI can also help analyze things like patient data or microscopy images.
Nonetheless, analyzing biological data with AI does have its challenges. Such data is complex and multifaceted. This makes it more difficult for AI to analyze biological data compared to other data types. Parag, Vijay, and Matt cited many reasons biological data is complex:
- Biological data is less predictable. We see unpredictability in biology because there are so many variables associated with specific outcomes. Things like environmental influences and genetic variation between individuals can make patterns in biological data harder to decipher.
- The bar for “good enough” is high. We all know that AI tools like ChatGPT hallucinate from time to time. This is somewhat acceptable, though annoying, if you’re using AI to create art, and it gives you an image that is not quite correct. When it comes to healthcare applications, on the other hand, even small errors can have consequential impacts. Getting a misdiagnosis is a much bigger deal than getting an image of a hand with 6 fingers for example. Parag says that the challenge for many of these tools is that they “fail incredibly confidently” and in biology, it’s better to have a tool that can tell you it doesn’t know the answer rather than giving you the wrong one.
- Biologists often work on multiple scales at once. They may study events such as protein phosphorylation or metabolite consumption that are only observable at microsecond timescales. At the same time, they may try to connect such small-scale events with processes occurring over large time scales such as cartilage deterioration as the result of injury. There are many gaps in our basic understanding of the connections across these time scales.
A great example of the complexity of biological data lies in proteomics. Proteomic data can encompass many variables like tissue types, timepoints, and drug conditions. Because efforts in proteomics often aim to capture the entire proteome, proteomic data analysis requires tools that can analyze thousands of proteins across all of these variables. If proteomic data is layered with other data types like genome sequencing or clinical data, it becomes even more complex.
How AI tackles the growth of biological data
The proteomics example above demonstrates the need to find ways to more easily analyze large biological data sets. Omics data are becoming higher resolution, less costly, and more ubiquitous within the life sciences. “With data abundance, you need algorithmic approaches to get to the needle in the haystack,” says Matt. Thanks to increasing computing power and new AI algorithms that can handle large amounts of data, scientists can analyze complicated data more easily. Broad descriptions of the forms this data can take and how AI is trained with them are listed below.
Structured vs unstructured data
Biological data can be structured or unstructured. Structured data, which includes things like dates and read counts, has discrete values associated with it (ex: numbers, yes/no, etc.) and is easily organized and searchable in a database. In contrast, unstructured data, which includes qualitative data like medical records, images, and videos, contains a lot of information that is more difficult to analyze because it doesn’t come in the form of discrete units or values from the get-go. In the past, scientists believed that structured data was necessary for machine learning, but the field has shifted towards using unstructured data, where AI models learn features and patterns to add structure to unstructured data. This opens a range of possibilities for machine learning-based inquiry.
Types of learning
Various kinds of learning models can train AI systems using structured and unstructured data. These include:
- Supervised learning: Supervised learning uses labeled and annotated data where data points have associated outputs or answers. This labeled data is then used to train AI which finds patterns and relationships between the inputs and outputs. Example: Researchers have the protein expression levels of a hundred proteins for both healthy individuals and people with a specific disease. Supervised learning can take these labeled data to train AI to classify new samples as healthy or diseased.
- Unsupervised learning: In contrast to supervised learning, unsupervised learning uses data without known groupings or outcome associations and finds patterns in it. Unsupervised learning algorithms then put the data into groups based on the identified patterns. These groups do not have pre-identified parameters. These types of analyses include clustering and principal component analysis which are often used to group cells based on similar expression patterns or identify cell subtypes within a population.
Example: Researchers have proteomic data from cancer patients. This data does not include information about disease or treatment outcomes and is considered unlabeled with respect to these characteristics. AI uses unsupervised learning to cluster patients based on similarities in their proteomic profiles. Afterwards researchers are able to compare these new clusters to existing breast cancer subtypes to find associations between the new clusters and clinical characteristics such as aggressiveness or response to treatment. These associations may enable future clinicians to more accurately diagnose future patients as having different subtypes of cancer. This may lead to more effective treatment decisions. In the real world, this kind of work has been done to identify new breast cancer subtypes based on proteomics data from breast cancer tissue samples.
- Self-supervised learning: A subset of unsupervised learning, self-supervised learning uses unlabeled data and labels the data by finding latent or buried patterns in it. Later these model-derived labels are used to accomplish a particular task. If the labels are not sufficient to accomplish the task, the self-supervised learning model can look for additional patterns in the data, further label the data, and attempt to accomplish the task again. This process may be iterated until the model achieves a set level of completion for the task. Self-supervised learning splits tasks into pretext tasks, which are used to learn meaningful features from unlabeled data, and downstream tasks, which use these learned features to solve specific problems.
Example: Cytoself is a protein localization profiling tool. It uses images of endogenously tagged proteins from the OpenCell database as unlabeled data. As pretext tasks, it masks out parts of the figures and reconstructs hidden parts of images, and identifies proteins based on their localizations. These tasks help Cytoself learn what features from the images are relevant. They also allow Cytoself to perform a variety of downstream activities such as creating a protein localization atlas and predicting subcellular localization of new proteins.
- Semi-supervised learning: Semi-supervised learning uses both labeled and unlabeled data to train AI models. It uses a small amount of labeled data to infer labels for the rest of the unlabeled data.
Example: Researchers have proteomics data from a new bacteria. A small fraction of the proteins from the organism have their functions verified empirically. The rest of the proteins have unknown functions. Semi-supervised learning initially trains on the mass spectrometry profile from the proteins with known function (labeled data). Then it uses this model to predict the functions of other proteins from the unlabeled data. Predictions that are likely correct are then added into the training data and the model is retrained on the new dataset. However, this model might not identify completely new protein functions.
AI agents redefine the future of biology research
Beyond the advances in computing power and algorithms that have propelled the machine learning boom, Matt thinks that the value of AI will be captured in the application and agentic layers, rather than the model layer. In his view, it is the user interfaces and AI agents that make AI more accessible and capable of solving tangible problems for people. But what are AI agents exactly and what can they do?
AI agents are autonomous programs that interact with the real-world, collect data, and use the data to perform tasks to meet preset goals. Some examples of this include self-driving cars and customer service agents that perform tasks beyond the simple chatbot. In the scientific realm, AI agents help scientists design and execute experiments in addition to analyzing data.
Parag outlines how these agents could facilitate scientific experiments in his 2024 Gilbert S. Omenn Computational Proteomics Award lecture: a scientist could share their hypothesis with an AI agent, which then designs an experiment, instructs an autonomous cloud lab to execute the experiment, and delivers results. AI agents could iterate on this process hundreds or thousands of times all while documenting everything that happens in the workflow.
When used this way, AI agents could enable experiments to happen continuously and eliminate gaps in scientific progress and reproducibility that arise when a researcher leaves a lab and the work changes hands.
In multiomics research, this documentation is particularly important for scientific reproducibility as these agents may more accurately capture the exact tools, parameters, and raw data used. As Parag has demonstrated in collaboration with Yolanda Gil’s lab at University of Southern California, it’s very difficult to reproduce multiomic experiments based on a paper’s methods section. Their team tried to develop a multiomic workflow based on the methods section of a published proteogenomics paper. After many iterations, they were able to get 90% of the data to match what was originally reported. The good news is that they were able to reproduce the cancer subtypes from the original paper. The bad news is it took a lot of work to get there.
“In the end, our work revealed that method sections really don’t capture the fullness of the analyses,” Parag said in his lecture. “To reproduce a study accurately, we need not just the raw data files, but all the auxiliary files, the different tools that were used in the analysis, the sets of parameters selected, and versions for everything.” AI agents can help here.
Scientists are already using AI agents today. A recent preprint documents the design and validation of SARS-CoV-2 nanobodies using a virtual lab with a team of AI agents each with their own expertise. These agents conducted research and met in a series of meetings (each lasting 5-10 minutes) to come up with the design of 92 nanobodies.
While AI agent workflows seem to take the scientist out of the research process, some researchers suggest that these agents perform the analyses and repetitive tasks so humans don’t need to do them. It allows researchers to reach speeds and scales that wouldn’t have been possible otherwise, while allowing scientists time for more high-level, creative tasks. To build new companies, it’s possible that a small team with AI can accomplish what larger teams can do. “[AI] is going to really change the social dynamics of how we work, how we do things, and how we build startups,” Vijay says.
Instead of being limited by computing power, headcount, or data complexity, AI is going to tremendously expand the types of scientific questions we can answer more quickly. Vijay adds, “how we do our work is going to be revolutionized.”
AI and biology resources
- 2024 Gilbert S. Omenn Computational Proteomics Award lecture: Parag’s lecture about reproducibility of computational workflows and AI agents
- AI in Proteomics Data Analysis: Revolutionizing Protein Research: An overview of how AI is used in proteomics and tools for AI-based biomarker prediction
- Empowering biomedical discovery with AI agents: A perspective in Cell on the use of AI agents in biomedical research
- Pan-cancer proteogenomics expands the landscape of therapeutic targets: Proteogenomics publication that was the basis of Parag’s reproducibility project
- Self-supervised deep learning encodes high-resolution features of protein subcellular localization: Cytoself publication in Nature Methods
- The Virtual Lab: AI Agents Design New SARS-CoV-2 Nanobodies with Experimental Validation: bioRxiv preprint on using AI agents to design SARS-CoV-2 nanobodies
MORE ARTICLES
