From foundation models to autonomous agents in biology

Shenghui Huang; Mei Lang; Zihan Chen; Chenxu Yang; Xiaoying Huang; Zeynab Mohtashaminia; Yuzhong Peng; Shenghui Huang; Mei Lang; Zihan Chen; Chenxu Yang; Xiaoying Huang; Zeynab Mohtashaminia; Yuzhong Peng

doi:10.48130/gcomm-0026-0005

2026 Volume 3

Article Contents

Next Previous

REVIEW Open Access

From foundation models to autonomous agents in biology

1.
Webioinfo, Macao-Hengqin Youth Entrepreneurship Valley, Zhuhai 519031, China
2.
Department of Molecular Biotechnology and Health Sciences, University of Turin, Turin 10124, Italy
3.
Faculty of Health Sciences, University of Macau, Avenida da Universidade, Taipa, Macau 999078, China
4.
School of Pharmacy, Macau University of Science and Technology, Taipa, Macau 999078, China
5.
The State Key Laboratory of Mechanism and Quality of Chinese Medicine, Macau University of Science and Technology, Taipa, Macau 999078, China
6.
Department of Computer Science, Dokuz Eylul University, İzmir 35220, Türkiye

More Information

Corresponding author: peng@webioinfo.top (Peng Y)

Received: 13 December 2025
Revised: 27 January 2026
Accepted: 25 February 2026
Published online: 28 March 2026
Genomics Communications 3, Article number: e006 (2026) | Cite this article

Abstract

Advances in sequencing and multi-omics have unleashed exponential biological data growth, from genomes and transcriptomes to single-cell and spatial profiles. Traditional pipelines, reliant on manual curation, strain under this deluge. This bottleneck hampers discovery from terabyte-scale datasets. Large Language Models (LLMs) and AI agents are emerging as a powerful paradigm to address these challenges. Breakthroughs in foundation models pre-trained on biological 'languages' offer in-context learning and generative capabilities far beyond prior bioinformatics tools. By coupling LLM reasoning with multi-agent systems, Retrieval-Augmented Generation (RAG), and the Model Context Protocol (MCP), autonomous AI research agents can plan experiments, execute analyses, and even generate hypotheses with minimal human guidance. While these technologies promise to augment human intellect, their deployment presents critical challenges in reliability, biosecurity, and accessibility. Navigating these obstacles is key to ushering in an era of accelerated discovery and personalized medicine. By moving from static models to active agents, we are witnessing the rise of the 'digital biologist'—an AI collaborator poised to reshape biomedical research. We trace this paradigm's rapid evolution, from foundation models learning the language of biological sequences and single-cell data, to autonomous agents capable of automating analysis, designing experiments, and driving drug discovery. By synthesizing these developments, we offer a strategic roadmap for researchers to navigate the opportunities and challenges of this AI-driven era. Finally, to support the community, a public, actively maintained resource list of models, agents, and datasets is available on our project website: http://awesomebio.webioinfo.top.
- Large Language Models,
- AI Agents,
- Foundation Models,
- Computational Biology,
- Autonomous Science
Rights and permissions
Copyright: © 2026 by the author(s). Published by Maximum Academic Press, Fayetteville, GA. This article is an open access article distributed under Creative Commons Attribution License (CC BY 4.0), visit https://creativecommons.org/licenses/by/4.0/.

References

[1]	Youngblut ND, Carpenter C, Nayebnazar A, Adduri A, Shah R, et al. 2025. scBaseCount: an AI agent-curated, uniformly processed, and continually expanding single cell data repository. bioRxiv 640494 doi: 10.1101/2025.02.27.640494 CrossRef Google Scholar
[2]	Ruan W, Lyu Y, Zhang J, Cai J, Shu P, et al. 2025. Large language models for bioinformatics. arXiv 2501.06271v1 doi: 10.48550/arXiv.2501.06271 CrossRef Google Scholar
[3]	Gao S, Fang A, Huang Y, Giunchiglia V, Noori A, et al. 2024. Empowering biomedical discovery with AI agents. Cell 187(22):6125−6151 doi: 10.1016/j.cell.2024.09.022 CrossRef Google Scholar
[4]	Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, et al. 2023. Attention is all you need. arXiv 1706.03762v7 doi: 10.48550/arXiv.1706.03762 CrossRef Google Scholar
[5]	Ji Y, Zhou Z, Liu H, Davuluri RV. 2021. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 37(15):2112−2120 doi: 10.1093/bioinformatics/btab083 CrossRef Google Scholar
[6]	Zhou Z, Ji Y, Li W, Dutta P, Davuluri R, et al. 2024. DNABERT-2: efficient foundation model and benchmark for multi-species genome. arXiv 2306.15006v2 doi: 10.48550/arXiv.2306.15006 CrossRef Google Scholar
[7]	Zhou Z, Wu W, Ho H, Wang J, Shi L, et al. 2024. DNABERT-S: pioneering species differentiation with species-aware DNA embeddings. arXiv 2402.08777v3 doi: 10.48550/arXiv.2402.08777 CrossRef Google Scholar
[8]	Dalla-Torre H, Gonzalez L, Mendoza-Revilla J, Lopez Carranza N, Grzywaczewski AH, et al. 2025. Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nature Methods 22(2):287−297 doi: 10.1038/s41592-024-02523-z CrossRef Google Scholar
[9]	Boshar S, Evans B, Tang Z, Picard A, Adel Y, et al. 2025. A foundational model for joint sequence-function multi-species modeling at scale for long-range genomic prediction. bioRxiv 695963 doi: 10.64898/2025.12.22.695963 CrossRef Google Scholar
[10]	Nguyen E, Poli M, Faizi M, Thomas A, Birch-Sykes C, et al. 2023. HyenaDNA: long-range genomic sequence modeling at single nucleotide resolution. arXiv 2306.15794v2 doi: 10.48550/arXiv.2306.15794 CrossRef Google Scholar
[11]	Fishman V, Kuratov Y, Shmelev A, Petrov M, Penzar D, et al. 2025. GENA-LM: a family of open-source foundational DNA language models for long sequences. Nucleic Acids Research 53(2):gkae1310 doi: 10.1093/nar/gkae1310 CrossRef Google Scholar
[12]	Brixi G, Durrant MG, Ku J, Poli M, Brockman G, et al. 2025. Genome modeling and design across all domains of life with Evo 2. bioRxiv 638918 doi: 10.1101/2025.02.18.638918 CrossRef Google Scholar
[13]	Nguyen E, Poli M, Durrant MG, Kang B, Katrekar D, et al. 2024. Sequence modeling and design from molecular to genome scale with Evo. Science 386:eado9336 doi: 10.1126/science.ado9336 CrossRef Google Scholar
[14]	Wu W, Zhou Z, Riley R, Abdulqader M, Song X, et al. 2025. Uncovering the Genomic Manifold via Scalable Learning from the Global Microbiome. bioRxiv 635558 doi: 10.1101/2025.01.30.635558 CrossRef Google Scholar
[15]	Avsec Ž, Latysheva N, Cheng J, Novati G, Taylor KR, et al. 2025. AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model. bioRxiv 661532 doi: 10.1101/2025.06.25.661532 CrossRef Google Scholar
[16]	Penić RJ, Vlašić T, Huber RG, Wan Y, Šikić M. 2025. RiNALMo: general-purpose RNA language models can generalize well on structure prediction tasks. Nature Communications 16:5671 doi: 10.1038/s41467-025-60872-5 CrossRef Google Scholar
[17]	Hayes T, Rao R, Akin H, Sofroniew NJ, Oktay D, et al. 2025. Simulating 500 million years of evolution with a language model. Science 387(6736):850−858 doi: 10.1126/science.ads0018 CrossRef Google Scholar
[18]	Chen B, Cheng X, Li P, Geng YA, Gong J, et al. 2024. xTrimoPGLM: unified 100B-scale pre-trained transformer for deciphering the language of protein. arXiv 2401.06199v2 doi: 10.48550/arXiv.2401.06199 CrossRef Google Scholar
[19]	Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, et al. 2020. Improved protein structure prediction using potentials from deep learning. Nature 577(7792):706−710 doi: 10.1038/s41586-019-1923-7 CrossRef Google Scholar
[20]	Agarwal V, McShan AC. 2024. The power and pitfalls of AlphaFold2 for structure prediction beyond rigid globular proteins. Nature Chemical Biology 20(8):950−959 doi: 10.1038/s41589-024-01638-w CrossRef Google Scholar
[21]	Jumper J, Evans R, Pritzel A, Green T, Figurnov M, et al. 2021. Highly accurate protein structure prediction with AlphaFold. Nature 596(7873):583−589 doi: 10.1038/s41586-021-03819-2 CrossRef Google Scholar
[22]	Abramson J, Adler J, Dunger J, Evans R, Green T, et al. 2024. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630(8016):493−500 doi: 10.1038/s41586-024-07487-w CrossRef Google Scholar
[23]	Lewis S, Hempel T, Jiménez-Luna J, Gastegger M, Xie Y, et al. 2025. Scalable emulation of protein equilibrium ensembles with generative deep learning. Science 389(6761):eadv9817 doi: 10.1126/science.adv9817 CrossRef Google Scholar
[24]	Nijkamp E, Ruffolo JA, Weinstein EN, Naik N, Madani A. 2023. ProGen2: exploring the boundaries of protein language models. Cell Systems 14(11):968−978.e3 doi: 10.1016/j.cels.2023.10.002 CrossRef Google Scholar
[25]	Yang J, Bhatnagar A, Ruffolo JA, Madani A. 2024. Function-guided conditional generation using protein language models with adapters. arXiv 2410.03634v2 doi: 10.48550/arXiv.2410.03634 CrossRef Google Scholar
[26]	Garau-Luis JJ, Bordes P, Gonzalez L, Roller M, de Almeida BP, et al. 2024. Multi-modal transfer learning between biological foundation models. arXiv 2406.14150v1 doi: 10.48550/arXiv.2406.14150 CrossRef Google Scholar
[27]	de Almeida BP, Richard G, Dalla-Torre H, Blum C, Hexemer L, et al. 2025. A multimodal conversational agent for DNA, RNA and protein tasks. Nature Machine Intelligence 7(6):928−941 doi: 10.1038/s42256-025-01047-1 CrossRef Google Scholar
[28]	Liu T, Xiao Y, Luo X, Xu H, Zheng WJ, et al. 2024. Geneverse: a collection of open-source multimodal large language models for genomic and proteomic research. arXiv 2406.15534v1 doi: 10.48550/arXiv.2406.15534 CrossRef Google Scholar
[29]	St John P, Lin D, Binder P, Greaves M, Shah V, et al. 2024. BioNeMo Framework: a modular, high-performance library for AI model development in drug discovery. arXiv 2411.10548v5 doi: 10.48550/arXiv.2411.10548 CrossRef Google Scholar
[30]	Theodoris CV, Xiao L, Chopra A, Chaffin MD, Al Sayed ZR, et al. 2023. Transfer learning enables predictions in network biology. Nature 618(7965):616−624 doi: 10.1038/s41586-023-06139-9 CrossRef Google Scholar
[31]	Chen H, Venkatesh MS, Ortega JG, Mahesh SV, Nandi TN, et al. 2024. Quantized multi-task learning for context-specific representations of gene network dynamics. bioRxiv 608180 doi: 10.1101/2024.08.16.608180 CrossRef Google Scholar
[32]	Cui H, Wang C, Maan H, Pang K, Luo F, et al. 2024. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nature Methods 21(8):1470−1480 doi: 10.1038/s41592-024-02201-0 CrossRef Google Scholar
[33]	Wang C, Cui H, Zhang A, Xie R, Goodarzi H, et al. 2025. scGPT-spatial: continual pretraining of single-cell foundation model for spatial transcriptomics. bioRxiv 636714 doi: 10.1101/2025.02.05.636714 CrossRef Google Scholar
[34]	Zeng Y, Xie J, Shangguan N, Wei Z, Li W, et al. 2025. CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells. Nature Communications 16:4679 doi: 10.1038/s41467-025-59926-5 CrossRef Google Scholar
[35]	Hao M, Gong J, Zeng X, Liu C, Guo Y, et al. 2024. Large-scale foundation model on single-cell transcriptomics. Nature Methods 21(8):1481−1491 doi: 10.1038/s41592-024-02305-7 CrossRef Google Scholar
[36]	Cao S, Yang K, Cheng J, Li J, Shen HB, et al. 2024. stFormer: a foundation model for spatial transcriptomics. bioRxiv 615337 doi: 10.1101/2024.09.27.615337 CrossRef Google Scholar
[37]	Schaar AC, Tejada-Lapuerta a, Palla G, Gutgesell R, Halle L, et al. 2024. Nicheformer: a foundation model for single-cell and spatial omics. bioRxiv 589472 doi: 10.1101/2024.04.15.589472 CrossRef Google Scholar
[38]	Levine D, Rizvi SA, Lévy S, Pallikkavaliyaveetil N, Zhang D, et al. 2024. Cell2Sentence: teaching large language models the language of biology. bioRxiv 557287 doi: 10.1101/2023.09.11.557287 CrossRef Google Scholar
[39]	Rizvi SA, Levine D, Patel A, Zhang S, Wang E, et al. 2025. Scaling large language models for next-generation single-cell analysis. bioRxiv 648850 doi: 10.1101/2025.04.14.648850 CrossRef Google Scholar
[40]	Su Z, Fang M, Smolnikov A, Dinger ME, Oates EC, et al. 2025. GeneRAIN: multifaceted representation of genes via deep learning of gene expression networks. Genome Biology 26(1):288 doi: 10.1186/s13059-025-03749-6 CrossRef Google Scholar
[41]	Ouyang Z, Li J. 2026. Scouter predicts transcriptional responses to genetic perturbations with large language model embeddings. Nature Computational Science 6(1):21−28 doi: 10.1038/s43588-025-00912-8 CrossRef Google Scholar
[42]	Luo E, Hao M, Wei L, Zhang X. 2024. scDiffusion: conditional generation of high-quality single-cell data using diffusion model. Bioinformatics 40(9):btae518 doi: 10.1093/bioinformatics/btae518 CrossRef Google Scholar
[43]	Luo E, Wei L, Hao M, Zhang X, Liu Q. 2025. Multi-modal diffusion model with dual-cross-attention for multi-omics data generation and translation. bioRxiv 640020 doi: 10.1101/2025.02.27.640020 CrossRef Google Scholar
[44]	Cornejo-Páramo P, Zhang X, Louis L, Li Z, Yang Y, et al. 2025. Motif-based models accurately predict cell type-specific distal regulatory elements. Nature Communications 16:10370 doi: 10.1038/s41467-025-65362-2 CrossRef Google Scholar
[45]	Chen W, Zhang P, Tran TN, Xiao Y, Li S, et al. 2025. A visual–omics foundation model to bridge histopathology with spatial transcriptomics. Nature Methods 22(7):1568−1582 doi: 10.1038/s41592-025-02707-1 CrossRef Google Scholar
[46]	Ding T, Wagner SJ, Song AH, Chen RJ, Lu MY, et al. 2025. A multimodal whole-slide foundation model for pathology. Nature Medicine 31(11):3749−3761 doi: 10.1038/s41591-025-03982-3 CrossRef Google Scholar
[47]	Kong Z, Qiu M, Boesen J, Lin X, Yun S,et al. 2025. SPATIA: multimodal model for prediction and generation of spatial cell phenotypes. arXiv 2507.04704v2 doi: 10.48550/arXiv.2507.04704 CrossRef Google Scholar
[48]	Qian L, Dong Z, Guo T. 2025. Grow AI virtual cells: three data pillars and closed-loop learning. Cell Research 35(5):319−321 doi: 10.1038/s41422-025-01101-y CrossRef Google Scholar
[49]	Bunne C, Roohani Y, Rosen Y, Gupta A, Zhang X, et al. 2024. How to build the virtual cell with artificial intelligence: priorities and opportunities. Cell 187(25):7045−7063 doi: 10.1016/j.cell.2024.11.015 CrossRef Google Scholar
[50]	Noutahi E, Hartford J, Tossou P, Whitfield S, Denton AK, et al. 2025. Virtual cells: predict, explain, discover. arXiv 2505.14613v3 doi: 10.48550/arXiv.2505.14613 CrossRef Google Scholar
[51]	Wei Z, Ma R, Wang Z, Li Z, Song S, et al. 2025. VCWorld: a biological world model for virtual cell simulation. arXiv 2512.00306v2 doi: 10.48550/arXiv.2512.00306 CrossRef Google Scholar
[52]	Johnson JAI, Bergman DR, Rocha HL, Zhou DL, Cramer E, et al. 2025. Human interpretable grammar encodes multicellular systems biology models to democratize virtual cell laboratories. Cell 188(17):4711−4733.e37 doi: 10.1016/j.cell.2025.06.048 CrossRef Google Scholar
[53]	Chen Z, Tian S, Pei J, Gu R, Li Y, et al. 2025. UniCure: a foundation model for predicting personalized cancer therapy response. bioRxiv 658531 doi: 10.1101/2025.06.14.658531 CrossRef Google Scholar
[54]	Adduri AK, Gautam D, Bevilacqua B, Imran A, Shah R, et al. 2025. Predicting cellular responses to perturbation across diverse contexts with State. bioRxiv 661135 doi: 10.1101/2025.06.26.661135 CrossRef Google Scholar
[55]	Zhang J, Ubas AA, de Borja R, Svensson V, Thomas N, et al. 2025. Tahoe-100M: a giga-scale single-cell perturbation atlas for context-dependent gene function and cellular modeling. bioRxiv 639398 doi: 10.1101/2025.02.20.639398 CrossRef Google Scholar
[56]	Ji Y, Tejada-Lapuerta A, Schmacke NA, Zheng Z, Zhang X, et al. 2025. Scalable and universal prediction of cellular phenotypes enables in silico experiments. bioRxiv 607533 doi: 10.1101/2024.08.12.607533 CrossRef Google Scholar
[57]	Xu J, Yang X, Li Y, Wang H, Li Y, et al. 2025. ODFormer: a virtual organoid for predicting personalized therapeutic responses in pancreatic cancer. bioRxiv 663664 doi: 10.1101/2025.07.08.663664 CrossRef Google Scholar
[58]	Peidli S, Green TD, Shen C, Gross T, Min J, et al. 2024. scPerturb: harmonized single-cell perturbation data. Nature Methods 21(3):531−540 doi: 10.1038/s41592-023-02144-y CrossRef Google Scholar
[59]	Chandrasekaran SN, Cimini BA, Goodale A, Miller L, Kost-Alimova M, et al. 2024. Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations. Nature Methods 21(6):1114−1121 doi: 10.1038/s41592-024-02241-6 CrossRef Google Scholar
[60]	Kraus O, Comitani F, Urbanik J, Kenyon-Dean K, Arumugam L, et al. 2025. RxRx3-core: benchmarking drug-target interactions in high-content microscopy. arXiv 2503.20158v2 doi: 10.48550/arXiv.2503.20158 CrossRef Google Scholar
[61]	Huang AC, Hsieh THS, Zhu J, Michuda J, Teng A, et al. 2025. X-Atlas/Orion: genome-wide perturb-seq datasets via a scalable fix-cryopreserve platform for training dose-dependent biological foundation models. bioRxiv 659105 doi: 10.1101/2025.06.11.659105 CrossRef Google Scholar
[62]	Wu Y, Wershof E, Schmon SM, Nassar M, Osiński B, et al. 2025. PerturBench: benchmarking machine learning models for cellular perturbation analysis. arXiv 2408.10609v4 doi: 10.48550/arXiv.2408.10609 CrossRef Google Scholar
[63]	Li C, Ziyadeh E, Sharma Y, Dumoulin B, Levinsohn J, et al. 2025. Nephrobase cell+: multimodal single-cell foundation model for decoding kidney biology. arXiv 2509.26223v1 doi: 10.48550/arXiv.2509.26223 CrossRef Google Scholar
[64]	Liu L, Li W, Wang F, Li Y, Huang LK, et al. 2025. A pre-trained large generative model for translating single-cell transcriptomes to proteomes. Nature Biomedical Engineering 1−20 doi: 10.1038/s41551-025-01528-z CrossRef Google Scholar
[65]	Kedzierska KZ, Crawford L, Amini AP, Lu AX. 2025. Zero-shot evaluation reveals limitations of single-cell foundation models. Genome Biology 26(1):101 doi: 10.1186/s13059-025-03574-x CrossRef Google Scholar
[66]	DenAdel A, Hughes M, Thoutam A, Gupta A, Navia AW, et al. 2025. Evaluating the role of pre-training dataset size and diversity on single-cell foundation model performance. bioRxiv 628448 doi: 10.1101/2024.12.13.628448 CrossRef Google Scholar
[67]	Wang Q, Pan Y, Zhou M, Tang Z, Wang Y, et al. 2025. scDrugMap: benchmarking large foundation models for drug response prediction. arXiv 2505.05612v1 doi: 10.48550/arXiv.2505.05612 CrossRef Google Scholar
[68]	Zhang F, Liu T, Zhu Z, Wu H, Wang H, et al. 2025. CellVerse: do large language models really understand cell biology. arXiv 2505.07865v1 doi: 10.48550/ARXIV.2505.07865 CrossRef Google Scholar
[69]	Xiao Y, Liu J, Zheng Y, Jiao S, Hao J, et al. 2025. CellAgent: LLM-driven multi-agent framework for natural language-based single-cell analysis. bioRxiv 593861 doi: 10.1101/2024.05.13.593861 CrossRef Google Scholar
[70]	Wang H, He Y, Coelho PP, Bucci M, Nazir A, et al. 2025. SpatialAgent: an autonomous ai agent for spatial biology. bioRxiv 646459 doi: 10.1101/2025.04.03.646459 CrossRef Google Scholar
[71]	Alber S, Chen B, Sun E, Isakova A, Wilk AJ, et al. 2025. CellVoyager: AI compbio agent generates new insights by autonomously analyzing biological data. bioRxiv 657517 doi: 10.1101/2025.06.03.657517 CrossRef Google Scholar
[72]	Schaefer M, Peneder P, Malzl D, Lombardo SD, Peycheva M, et al. 2025. Multimodal learning enables chat-based exploration of single-cell data. Nature Biotechnology 1−11 doi: 10.1038/s41587-025-02857-9 CrossRef Google Scholar
[73]	Huang S, Šabanović B, Peng Y, Zheng Q, Alessandri L, et al. 2026. GPTBioInsightor − leveraging large language models for transparent scRAN-Seq cell type annotations. Bioinformatics Advances 6:vbag025 doi: 10.1093/bioadv/vbag025 CrossRef Google Scholar
[74]	Xie E, Cheng L, Shireman J, Cai Y, Liu J, et al. 2026. CASSIA: a multi-agent large language model for automated and interpretable cell annotation. Nature Communications 17:389 doi: 10.1038/s41467-025-67084-x CrossRef Google Scholar
[75]	Liu W, Li J, Tang Y, Zhao Y, Liu C, et al. 2025. DrBioRight 2.0: an LLM-powered bioinformatics chatbot for large-scale cancer functional proteomics analysis. Nature Communications 16:2256 doi: 10.1038/s41467-025-57430-4 CrossRef Google Scholar
[76]	Zhou J, Zhang B, Li G, Chen X, Li H, et al. 2024. An AI agent for fully automated multi-omic analyses. Advanced Science 11:2407094 doi: 10.1002/advs.202407094 CrossRef Google Scholar
[77]	Mehandru N, Hall AK, Melnichenko O, Dubinina Y, Tsirulnikov D, et al. 2025. BioAgents: bridging the gap in bioinformatics analysis with multi-agent systems. Scientific Reports 15:39036 doi: 10.1038/s41598-025-25919-z CrossRef Google Scholar
[78]	Hong G, Banos DT. 2025. Nano bio-agents (NBA): small language model agents for genomics. arXiv 2509.19566v1 doi: 10.48550/arXiv.2509.19566 CrossRef Google Scholar
[79]	Roohani Y, Lee A, Huang Q, Vora J, Steinhart Z, et al. 2025. BioDiscoveryAgent: an AI agent for designing genetic perturbation experiments. arXiv 2405.17631v3 doi: 10.48550/arXiv.2405.17631 CrossRef Google Scholar
[80]	Xu Q, Soto C, Shahnawaz M, Liu X, Jiang X, et al. 2025. Multi agent large language models for biomedical hypothesis generation in drug combination discovery. iScience 28(12):113984 doi: 10.1016/j.isci.2025.113984 CrossRef Google Scholar
[81]	Qu Y, Huang K, Yin M, Zhan K, Liu D, et al. 2026. CRISPR-GPT for agentic automation of gene-editing experiments. Nature Biomedical Engineering 10(2):245−258 doi: 10.1038/s41551-025-01463-z CrossRef Google Scholar
[82]	Ghafarollahi A, Buehler MJ. 2024. ProtAgents: protein discovery via large language model multi-agent collaborations combining physics and machine learning. arXiv 2402.04268v1 doi: 10.48550/arXiv.2402.04268 CrossRef Google Scholar
[83]	Liu S, Lu Y, Chen S, Hu X, Zhao J, et al. 2025. DrugAgent: automating AI-aided drug discovery programming through LLM multi-agent collaboration. arXiv 2411.15692v2 doi: 10.48550/arXiv.2411.15692 CrossRef Google Scholar
[84]	Averly R, Baker FN, Watson IA, Ning X. 2025. LIDDIA: language-based intelligent drug discovery agent. arXiv 2502.13959v3 doi: 10.48550/arXiv.2502.13959 CrossRef Google Scholar
[85]	Zhang F, Zhao Y, Zhang W, Lai L. 2025. BioScientist agent: designing LLM-biomedical agents with KG-augmented RL reasoning modules for drug repurposing and mechanistic of action elucidation. bioRxiv 669291 doi: 10.1101/2025.08.08.669291 CrossRef Google Scholar
[86]	Velez-Arce A, Lin X, Li MM, Huang K, Gao W, et al. 2024. Signals in the cells: multimodal and contextualized machine learning foundations for therapeutics. bioRxiv 598655 doi: 10.1101/2024.06.12.598655 CrossRef Google Scholar
[87]	Gao S, Zhu R, Kong Z, Noori A, Su X, et al. 2025. TxAgent: an AI agent for therapeutic reasoning across a universe of tools. arXiv 2503.10970v1 doi: 10.48550/arXiv.2503.10970 CrossRef Google Scholar
[88]	Schmidgall S, Su Y, Wang Z, Sun X, Wu J, et al. 2025. Agent laboratory: using LLM agents as research assistants. arXiv 2501.04227v2 doi: 10.48550/arXiv.2501.04227 CrossRef Google Scholar
[89]	Lu C, Lu C, Lange RT, Foerster J, Clune J, et al. 2024. The AI scientist: towards fully automated open-ended scientific discovery. arXiv 2408.06292v3 doi: 10.48550/arXiv.2408.06292 CrossRef Google Scholar
[90]	Penadés JR, Gottweis J, He L, Patkowski JB, Daryin A, et al. 2025. AI mirrors experimental science to uncover a mechanism of gene transfer crucial to bacterial evolution. Cell 188(23):6654−6665.e2 doi: 10.1016/j.cell.2025.08.018 CrossRef Google Scholar
[91]	Swanson K, Wu W, Bulaong NL, Pak JE, Zou J. 2025. The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies. Nature 646(8085):716−723 doi: 10.1038/s41586-025-09442-9 CrossRef Google Scholar
[92]	Huang K, Zhang S, Wang H, Qu Y, Lu Y, et al. 2025. Biomni: a general-purpose biomedical AI agent. bioRxiv 656746 doi: 10.1101/2025.05.30.656746 CrossRef Google Scholar
[93]	Zhang Z, Qiu Z, Wu Y, Li S, Wang D, et al. 2026. OriGene: a self-evolving virtual disease biologist automating therapeutic target discovery. bioRxiv 657658 doi: 10.1101/2025.06.03.657658 CrossRef Google Scholar
[94]	Cong L, Smerkous D, Wang X, Yin D, Zhang Z, et al. 2025. LabOS: the AI-XR co-scientist that sees and works with humans. arXiv 2510.14861v2 doi: 10.48550/arXiv.2510.14861 CrossRef Google Scholar
[95]	Zhu L, Lai Y, Xie J, Mou W, Huang L, et al. 2025. Evaluating the potential risks of employing large language models in peer review. Clinical and Translational Discovery 5(4):e70067 doi: 10.1002/ctd2.70067 CrossRef Google Scholar
[96]	Zhu L, Lai Y, Mou W, Zhang H, Lin A, et al. 2024. ChatGPT's ability to generate realistic experimental images poses a new challenge to academic integrity. Journal of Hematology & Oncology 17(1):27 doi: 10.1186/s13045-024-01543-8 CrossRef Google Scholar
[97]	Rudin C. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1(5):206−215 doi: 10.1038/s42256-019-0048-x CrossRef Google Scholar
[98]	Kim Y, Jeong H, Chen S, Li SS, Park C, et al. 2025. Medical hallucinations in foundation models and their impact on healthcare. arXiv 2503.05777v2 doi: 10.48550/arXiv.2503.05777 CrossRef Google Scholar
[99]	Zhao H, Chen H, Yang F, Liu N, Deng H, et al. 2024. Explainability for large language models: a survey. ACM Transactions on Intelligent Systems and Technology 15(2):1−38 doi: 10.1145/3639372 CrossRef Google Scholar
[100]	Atti S, Subramaniam S. 2025. Fundamental limitations of foundation models in single-cell transcriptomics. bioRxiv 661767 doi: 10.1101/2025.06.26.661767 CrossRef Google Scholar
[101]	Li H, Zhang Z, Squires M, Chen X, Zhang X. 2025. scMultiSim: simulation of single-cell multi-omics and spatial data guided by gene regulatory networks and cell–cell interactions. Nature Methods 22(5):982−993 doi: 10.1038/s41592-025-02651-0 CrossRef Google Scholar
[102]	Li CP, Kalisa AT, Roohani S, Hummedah K, Menge F, et al. 2025. The imitation game: large language models versus multidisciplinary tumor boards: benchmarking AI against 21 sarcoma centers from the ring trial. Journal of Cancer Research and Clinical Oncology 151(9):248 doi: 10.1007/s00432-025-06304-9 CrossRef Google Scholar
[103]	Zhang Z, Zhou Z, Jin R, Cong L, Wang M. 2025. GeneBreaker: jailbreak attacks against DNA language models with pathogenicity guidance. arXiv 2505.23839v1 doi: 10.48550/arXiv.2505.23839 CrossRef Google Scholar
[104]	Wang M, Dupré la Tour T, Watkins O, Makelov A, Chi RA, et al. 2025. Persona features control emergent misalignment. arXiv 2506.19823v2 doi: 10.48550/arXiv.2506.19823 CrossRef Google Scholar
[105]	Guo W, Kundu J, Tos U, Kong W, Sisto G, et al. 2025. System-performance and cost modeling of large language model training and inference. arXiv 2507.02456v1 doi: 10.48550/arXiv.2507.02456 CrossRef Google Scholar
[106]	Wang Y, He J, Du Y, Chen X, Li JC, et al. 2025. Large language model is secretly a protein sequence optimizer. arXiv 2501.09274v2 doi: 10.48550/arXiv.2501.09274 CrossRef Google Scholar
[107]	Gao Y, Xiong Y, Gao X, Jia K, Pan J, et al. 2024. Retrieval-augmented generation for large language models: a survey. arXiv 2312.10997v5 doi: 10.48550/arXiv.2312.10997 CrossRef Google Scholar
[108]	Wang C, Long Q, Xiao M, Cai X, Wu C, et al. 2024. BioRAG: a RAG-LLM framework for biological question reasoning. arXiv 2408.01107v2 doi: 10.48550/arXiv.2408.01107 CrossRef Google Scholar
[109]	Jeong M, Sohn J, Sung M, Kang J. 2024. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. arXiv 2401.15269v3 doi: 10.48550/arXiv.2401.15269 CrossRef Google Scholar
[110]	Anthropic Public Benefit Corporation (Anthropic PBC). 2024. Introducing the model context protocol, Anthropic PBC, USA. www.anthropic.com/news/model-context-protocol
[111]	Khoei TT, Ehtesham A, Kumar S, Khoei TT. 2025. A survey of the model context protocol (MCP): standardizing context to enhance large language models (LLMs). Preprints doi: 10.20944/preprints202504.0245.v1 CrossRef Google Scholar
[112]	Hou X, Zhao Y, Wang S, Wang H. 2025. Model context protocol (MCP): landscape, security threats, and future research directions. arXiv 2503.23278v3 doi: 10.48550/arXiv.2503.23278 CrossRef Google Scholar
[113]	Haase J, Pokutta S. 2026. Human − AI cocreativity: exploring synergies across levels of creative collaboration. In Generative Artificial Intelligence and Creativity, eds. Worwood MJ, Kaufman JC. Amsterdam: Elsevier. pp. 205−221 doi: 10.1016/B978-0-443-34073-4.00009-5
[114]	Kim Y, Lee SJ, Donahue C. 2025. Amuse: human-AI collaborative songwriting with multimodal inspirations. arXiv 2412.18940v2 doi: 10.48550/arXiv.2412.18940 CrossRef Google Scholar
[115]	Wu A, Kuang K, Zhu M, Wang Y, Zheng Y, et al. 2024. Causality for large language models. arXiv 2410.15319v1 doi: 10.48550/arXiv.2410.15319 CrossRef Google Scholar
[116]	Liang H, Wang C, Yu H, Kirsch D, Pant R, et al. 2025. Real-time experiment-theory closed-loop interaction for autonomous materials science. Science Advances 11(27):eadu7426 doi: 10.1126/sciadv.adu7426 CrossRef Google Scholar
[117]	Bayley O, Savino E, Slattery A, Noël T. 2024. Autonomous chemistry: navigating self-driving labs in chemical and material sciences. Matter 7(7):2382−2398 doi: 10.1016/j.matt.2024.06.003 CrossRef Google Scholar

About this article

Cite this article

Huang S, Lang M, Chen Z, Yang C, Huang X, et al. 2026. From foundation models to autonomous agents in biology. Genomics Communications 3: e006 doi: 10.48130/gcomm-0026-0005

Huang S, Lang M, Chen Z, Yang C, Huang X, et al. 2026. From foundation models to autonomous agents in biology. Genomics Communications 3: e006 doi: 10.48130/gcomm-0026-0005

Figures(6) / Tables(1)

Download PDF

Special Issue

User-Centric Bioinformatics Tools: Development and Long-Term Maintenance

Article Metrics

Article views(234) PDF downloads(37)

Other Articles By Authors

on this site
on Google Scholar

HTML

Foundational language models: the engines of biological discovery

The autonomous scientist: AI agents in action

Challenges

Reliability and hallucination

A fundamental weakness of current LLMs is their propensity to 'hallucinate'—generating plausible but factually incorrect information. In a scientific context where accuracy is paramount, this is a critical failure mode^[95]. An incorrect gene function or a flawed experimental protocol generated by an AI could waste months of research or, in a clinical setting, lead to dangerous outcomes^[96]. Beyond mere correctness, a lack of model interpretability poses another barrier; for an AI's output to be trusted, scientists must be able to understand the 'reasoning' behind its conclusions, a feature often lacking in complex 'black box' models^[97]. Ensuring the scientific accuracy of AI outputs and developing methods for models to provide calibrated confidence scores are urgent research priorities^[98]. To address this, Explainable AI (XAI) and attribution methods are emerging as essential tools. Feature attribution techniques, such as SHAP and Integrated Gradients allow researchers to quantify the contribution of a specific biological input, like nucleotide sequences or chemical substructures to a model's prediction, distinguishing valid biological reasoning from spurious correlations^[99]. Furthermore, there is a recurring tension between the power of massive, generalist pre-training and the superior performance of specialized models on niche tasks. Some benchmarks have shown that simpler statistical models can outperform large foundation models on specific, well-defined problems, suggesting a 'No Free Lunch' theorem for biological LLMs^[65,100]. This indicates that while large-scale pre-training provides a powerful starting point, deep domain-specific knowledge remains essential, and reinforces the idea that scale isn't everything.

Evaluation
Evaluating the performance of an AI scientist is a complex challenge. Unlike tasks with static benchmarks, the output of a research agent can be a multifaceted hypothesis, a dataset analysis, or a draft paper. Rigorous evaluation is difficult, as success might be a novel finding that is only recognizable in hindsight. One solution is to use simulated environments where the ground truth is known. For example, hiding a known pathway in simulated data and testing if the agent can discover it. Frameworks like scMultiSim, which can generate realistic single-cell and spatial data guided by known gene regulatory networks, are essential for creating these controlled environments for validation^[101]. Another approach is direct competition, where AI agents and human teams enter the same scientific challenges^[102]. Developing standardized benchmarks for 'AI as a scientist' is an active and crucial area for tracking progress.

Data privacy and bias
Biomedical AI models are trained on vast datasets that often contain sensitive patient information and reflect historical inequities in healthcare, creating two major ethical risks. The first is the potential for violating patient privacy, which necessitates strict adherence to regulations and robust cybersecurity. The second is the risk of perpetuating or amplifying existing biases. A model trained on data that overrepresents certain demographic groups may perform poorly for underrepresented populations, leading to unequal health outcomes. Algorithmic bias can also reinforce scientific dogma; for instance, an agent might prioritize well-studied drug targets over novel ones, thereby narrowing the scientific horizon. Careful data curation and explicit debiasing strategies are needed to mitigate these risks.

Safety and security
The generative power of biological AI comes with a significant dual-use risk. The GeneBreaker framework has demonstrated that DNA foundation models, including the powerful Evo2, are vulnerable to 'jailbreak' attacks^[103]. Through carefully crafted prompts, these models can be coerced into bypassing their safety filters and generating DNA sequences with high homology to known human pathogens, such as the SARS-CoV-2 spike protein. Although end-to-end physical synthesis from such jailbreaks has not yet been demonstrated, this in silico ability to generate potentially harmful biological material poses a profound biosecurity threat by significantly lowering the technical barrier for designing hazardous agents, highlighting the urgent need for the development of more robust safety alignment techniques, monitoring, and tracing mechanisms for these powerful generative models. As agents become capable of not just designing, but also orchestrating wet-lab experiments, whether via robotics or by guiding human operators, the risk graduates from generating harmful information to initiating harmful synthesis.

Besides, it is crucial to differentiate the urgency of these threats. Cybersecurity risks are immediate and pervasive: an AI agent writing analysis code can already inadvertently or maliciously damage infrastructure or leak proprietary data. In contrast, autonomous biological synthesis remains an emerging, though high-stakes, future threat. To mitigate these distinct risks, a layered defense strategy is required: developers must implement strict sandboxing for AI-generated code to prevent digital damage, while the community must enforce rigorous screening of gene synthesis orders and hardware-level access controls for robotic platforms to prevent physical misuse.

A recent study from OpenAI also identified a 'toxic persona' feature that most strongly controls emergent misalignment, which could potentially be exploited to make AI systems malicious^[104]. On the cybersecurity side, an AI agent that writes code for analysis presents another attack surface. If an adversary poisons a dataset or library, the agent could inadvertently execute malicious code, damage compute infrastructure, or leak data. Another subtle security issue is intellectual property: if an agent is trained on proprietary databases or papers, does it 'know' things that are under patent or confidential? This necessitates 'AI IP firewalls' to ensure, for example, that a pharmaceutical company's AI doesn't inadvertently incorporate a competitor's private data that it may have seen in public leaks during its training.

Cost and accessibility

The development of state-of-the-art foundation models is an immensely resource-intensive endeavor. Training a model like GPT-3 is estimated to cost millions of dollars, and the computational demands continue to grow exponentially^[105]. This high cost concentrates the power to build and train these models in the hands of a few large, well-funded technology companies and research institutions. This creates a significant barrier to entry for most academic labs, startups, and researchers in lower-resource settings, risking the creation of a new digital divide that could stifle innovation and competition. To quantify these practical barriers, we summarize the computational requirements and accessibility of representative models in Table 1. As illustrated, the landscape is nuanced. To mitigate resource constraints, developers of massive models like Evo2 and xTrimoPGLM often release scaled-down versions (e.g., Evo2 offers 1B, 7B, and 40B variants), enabling broader adoption on modest hardware. However, accessibility remains bifurcated by licensing and deployment models: while some are fully open-source, others, like AlphaGenome, are restricted to API access, and AlphaFold3 is currently limited to non-commercial academic use via application. Consequently, while 'lightweight' options exist, access to the most powerful frontiers sometimes remains gated. Energy use is another aspect: widespread use of huge models has an environmental footprint, and the scientific community will need to consider the trade-off between compute usage and the benefits of discoveries made faster. Some have suggested that AI should tackle the sustainability of its own workflows, perhaps by having agents optimize their experiments to use less compute by intelligently pruning the search space.

Table 1. Computational landscape of major bio-foundation models.

Model name	Type	Parameters	Training hardware	Accessibility
DNABERT-2	Genome	117 M	8 RTX 2080Ti	https://huggingface.co/zhihan1996/DNABERT-2-117M
GenomeOcean	Genome	100 M, 500 M, 4 B	64 NVIDIA A100	https://huggingface.co/DOEJGI/GenomeOcean-4B
HyenaDNA	Genome	1 K, 16 K, 32 K, 160 K, 450 K, 1 M	8 NVIDIA A100	https://huggingface.co/collections/LongSafari/hyenadna-models
Nucleotide Transformer	Genome	100 M, 500 M, 2.5 B	128 NVIDIA A100	https://huggingface.co/collections/InstaDeepAI/nucleotide-transformer
Nucleotide Transformer v3	Genome	8 M, 100 M, 650 M	/	https://huggingface.co/spaces/InstaDeepAI/ntv3
Evo2	Genome	1 B, 7 B, 40 B	> 2,000x NVIDIA H100	https://huggingface.co/collections/arcinstitute/evo
AlphaGenome	Genome	450 M	512 GOOGLE TPU + 64 NVIDIA H100	API only
RiNALMo	Genome	650 M	7 NVIDIA A100	https://zenodo.org/records/15043668
ESM3	Proteome	1.4 B, 7 B, 98 B	/	https://huggingface.co/EvolutionaryScale/esm3-sm-open-v1
xTrimoPGLM	Proteome	1 B, 3 B, 7 B, 100 B	768 NVIDIA A100	https://huggingface.co/biomap-research/proteinglm-100b-int4
AlphaFold2	Proteome	93 M	128 GOOGLE TPU	https://github.com/google-deepmind/alphafold
AlphaFold3	Proteome	/	/	Apply for academic usage
BioEmu	Proteome	31 M	/	https://huggingface.co/microsoft/bioemu
ProGen2	Proteome	151 M, 764 M, 2.7 B, 6.4 B	/	https://github.com/salesforce/progen/tree/main/progen2

{{lists.name}}

From foundation models to autonomous agents in biology

Abstract