Homotools: a suite of genomic tools for homologous retrieval and comparison

Hui Liu; Olamide Adesina; Ravi Bika; Rishabh Singh; Mithila Jugulam; Sanzhen Liu; Hui Liu; Olamide Adesina; Ravi Bika; Rishabh Singh; Mithila Jugulam; Sanzhen Liu

doi:10.48130/gcomm-0024-0002

2024 Volume 1

Article Contents

Next Previous

ARTICLE Open Access

Homotools: a suite of genomic tools for homologous retrieval and comparison

1.
Department of Plant Pathology, Kansas State University, Manhattan, KS 66506-5502, USA
2.
Department of Agronomy, Kansas State University, Manhattan, KS 66506-5502, USA

More Information

Corresponding author: liu3zhen@ksu.edu

Received: 07 August 2024
Revised: 04 September 2024
Accepted: 04 September 2024
Published online: 19 September 2024
Genomics Communications 1, Article number: e002 (2024) | Cite this article

Abstract

Genome sequencing and assemblies offer fundamental data for comprehensively exploring genomic variation among individuals. Genomic variation in genic regions is of particular interest. However, identifying homologous genes or sequences and DNA variation, including structural variation, from multiple genomes is a tedious process. Here we present the software package Homotools which includes multiple modules for retrieval of best-hit homologs, variant discovery, variant annotation, and visualization of structural comparison. These modules facilitate all these processes and leverage genomic resources for single-gene studies. The tools can be used for any species as long as assembled genomes and their genome annotations are available. In a case study using Homotools, it is shown that tolerance to the herbicide nicosulfuron is associated with multiple independent genomic variants found in various maize inbred lines, including structural variants due to insertions of transposable elements. The results from Homotools generate valuable testable hypotheses for further examination. Scripts of all modules are publicly available in GitHub (liu3zhenlab/homotools).
- Comparative genomics,
- Genome variation,
- Structural variation

Supplementary information

Supplemental Table S1 Sensitivity of various maize inbred lines to nicosulfuron.
Supplemental Table S2 Genotypes of genomic polymorphic sites in and around Zm00001eb214410 between Il14H and B73 in other inbred lines%.
Supplemental Dataset 1 Homomine report of Zm00001eb001720.
Supplemental File 1 Example codes to illustrate the usages of five modules.

Rights and permissions
Copyright: © 2024 by the author(s). Published by Maximum Academic Press, Fayetteville, GA. This article is an open access article distributed under Creative Commons Attribution License (CC BY 4.0), visit https://creativecommons.org/licenses/by/4.0/.

References

[1]	Li H, Durbin R. 2024. Genome assembly in the telomere-to-telomere era. Nature Reviews Genetics 25:658−70 doi: 10.1038/s41576-024-00718-w CrossRef Google Scholar
[2]	Rhie A, McCarthy SA, Fedrigo O, Damas J, Formenti G, et al. 2021. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592:737−46 doi: 10.1038/s41586-021-03451-0 CrossRef Google Scholar
[3]	Gong Y, Li Y, Liu X, Ma Y, Jiang L. 2023. A review of the pangenome: how it affects our understanding of genomic variation, selection and breeding in domestic animals? Journal of Animal Science and Biotechnology 14:73 doi: 10.1186/s40104-023-00860-1 CrossRef Google Scholar
[4]	Groza C, Schwendinger-Schreck C, Cheung WA, Farrow EG, Thiffault I, et al. 2024. Pangenome graphs improve the analysis of structural variants in rare genetic diseases. Nature Communications 15:657 doi: 10.1038/s41467-024-44980-2 CrossRef Google Scholar
[5]	Marçais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL, et al. 2018. MUMmer4: a fast and versatile genome alignment system. PLoS Computational Biology 14:e1005944 doi: 10.1371/journal.pcbi.1005944 CrossRef Google Scholar
[6]	Li H. 2018. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34:3094−100 doi: 10.1093/bioinformatics/bty191 CrossRef Google Scholar
[7]	Song B, Marco-Sola S, Moreto M, Johnson L, Buckler ES, et al. 2022. AnchorWave: sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism, and whole-genome duplication. Proceedings of the National Academy of Sciences of the United States of America 119:e2113075119 doi: 10.1073/pnas.2113075119 CrossRef Google Scholar
[8]	Goel M, Sun H, Jiao WB, Schneeberger K. 2019. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biology 20:277 doi: 10.1186/s13059-019-1911-0 CrossRef Google Scholar
[9]	Zhang B, Huang H, Tibbs-Cortes LE, Vanous A, Zhang Z, et al. 2023. Streamline unsupervised machine learning to survey and graph indel-based haplotypes from pan-genomes. Molecular Plant 16:975−78 doi: 10.1016/j.molp.2023.05.005 CrossRef Google Scholar
[10]	Hufford MB, Seetharam AS, Woodhouse MR, Chougule KM, Ou S, et al. 2021. De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes. Science 373:655−62 doi: 10.1126/science.abg5289 CrossRef Google Scholar
[11]	Chen J, Wang Z, Tan K, Huang W, Shi J, et al. 2023. A complete telomere-to-telomere assembly of the maize genome. Nature Genetics 55:1221−31 doi: 10.1038/s41588-023-01419-6 CrossRef Google Scholar
[12]	Lin G, He C, Zheng J, Koo DH, Le H, et al. 2021. Chromosome-level genome assembly of a regenerable maize inbred line A188. Genome Biology 22:175 doi: 10.1186/s13059-021-02396-x CrossRef Google Scholar
[13]	Quinlan AR, Hall IM. 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26:841−42 doi: 10.1093/bioinformatics/btq033 CrossRef Google Scholar
[14]	Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. Journal of Molecular Biology 215:403−10 doi: 10.1016/S0022-2836(05)80360-2 CrossRef Google Scholar
[15]	Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, et al. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10:421 doi: 10.1186/1471-2105-10-421 CrossRef Google Scholar
[16]	Shyam C, Borgato EA, Peterson DE, Dille JA, Jugulam M. 2021. Predominance of metabolic resistance in a six-way-resistant palmer amaranth (Amaranthus palmeri) population. Frontiers in Plant Science 11:614618 doi: 10.3389/fpls.2020.614618 CrossRef Google Scholar
[17]	Hake S, Smith HMS, Holtan H, Magnani E, Mele G, et al. 2004. The role of Knox genes in plant development. Annual Review of Cell and Developmental Biology 20:125−51 doi: 10.1146/annurev.cellbio.20.031803.093824 CrossRef Google Scholar
[18]	Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, et al. 2011. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology 7:539 doi: 10.1038/msb.2011.75 CrossRef Google Scholar
[19]	Edgar RC. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32:1792−97 doi: 10.1093/nar/gkh340 CrossRef Google Scholar
[20]	Katoh K, Misawa K, Kuma K, Miyata T. 2002. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30:3059−66 doi: 10.1093/nar/gkf436 CrossRef Google Scholar
[21]	Li W, Godzik A. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658−59 doi: 10.1093/bioinformatics/btl158 CrossRef Google Scholar
[22]	Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, et al. 2012. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 6:80−92 doi: 10.4161/fly.19695 CrossRef Google Scholar
[23]	Liu XM, Xu X, Li BH, Yao XX, Zhang HH, et al. 2018. Genomic and transcriptomic insights into cytochrome P450 monooxygenase genes involved in nicosulfuron tolerance in maize (Zea mays L.). Journal of Integrative Agriculture 17:1790−99 doi: 10.1016/s2095-3119(18)61921-5 CrossRef Google Scholar
[24]	Pataky JK, Williams MM, Riechers DE, Meyer MD. 2009. A common genetic basis for cross-sensitivity to mesotrione and nicosulfuron in sweet corn hybrid cultivars and inbreds grown throughout North America. Journal of the American Society for Horticultural Science 134:252−60 doi: 10.21273/jashs.134.2.252 CrossRef Google Scholar
[25]	Liu X, Bi B, Xu X, Li B, Tian S, et al. 2019. Rapid identification of a candidate nicosulfuron sensitivity gene (Nss) in maize (Zea mays L.) via combining bulked segregant analysis and RNA-seq. Theoretical and Applied Genetics 132:1351−61 doi: 10.1007/s00122-019-03282-8 CrossRef Google Scholar
[26]	Zhang Y, Zhang Q, Liu Q, Zhao Y, Xu W, et al. 2024. Fine mapping and functional validation of the maize nicosulfuron-resistance gene CYP81A9. Frontiers in Plant Science 15:1443413 doi: 10.3389/fpls.2024.1443413 CrossRef Google Scholar
[27]	Cheng J, Novati G, Pan J, Bycroft C, Žemgulytė A, et al. 2023. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381:eadg7492 doi: 10.1126/science.adg7492 CrossRef Google Scholar
[28]	Brandes N, Goldman G, Wang CH, Ye CJ, Ntranos V. 2023. Genome-wide prediction of disease variant effects with a deep protein language model. Nature Genetics 55:1512−22 doi: 10.1038/s41588-023-01465-0 CrossRef Google Scholar
[29]	Eizenga JM, Novak AM, Sibbesen JA, Heumos S, Ghaffaari A, et al. 2020. Pangenome graphs. Annual Review of Genomics and Human Genetics 21:139−62 doi: 10.1146/annurev-genom-120219-080406 CrossRef Google Scholar
[30]	Guarracino A, Heumos S, Nahnsen S, Prins P, Garrison E. 2022. ODGI: understanding pangenome graphs. Bioinformatics 38:3319−26 doi: 10.1093/bioinformatics/btac308 CrossRef Google Scholar

About this article

Cite this article

Liu H, Adesina O, Bika R, Singh R, Jugulam M, et al. 2024. Homotools: a suite of genomic tools for homologous retrieval and comparison. Genomics Communications 1: e002 doi: 10.48130/gcomm-0024-0002

Liu H, Adesina O, Bika R, Singh R, Jugulam M, et al. 2024. Homotools: a suite of genomic tools for homologous retrieval and comparison. Genomics Communications 1: e002 doi: 10.48130/gcomm-0024-0002

Figures(5)

Download PDF

Article Metrics

Article views(7778) PDF downloads(1694)

Other Articles By Authors

on this site
on Google Scholar

HTML

Introduction

With the advance of high-throughput long-read sequencing and the development of effective genome assembly algorithms, the community of genomics has constructed a wealth of high-continuity genome assemblies, frequently in the form of chromosome-level pseudomolecules, of individuals from a wide variety of species^[1,2]. Such genome assemblies provide fundamental resources for the identification of genomic variation, including single nucleotide polymorphisms, small insertion and deletion, and structural variation^[3,4]. Genomic computational tools have been developed to facilitate genome alignments and variant discovery. For example, NUCMER and minimap2 are two effective sequence aligners that could handle alignments between genome sequences^[5,6]. For alignments between complex genomes with large structure variation, AnchorWave was designed for improved alignment accuracy^[7]. Genomic alignment results can be further processed to identify genomic polymorphisms^[8]. Such analyses are frequently conducted for global genomic comparisons although the process is computationally intensive.

Genomic variation in and around genes is of particular interest. Without knowing homologous genes across genomes, extracting data related to single genes and their homologs from the genome-wide comparative results among multiple genomes is not straightforward. In a scenario where genes of all genomes are well annotated, genic comparison could be directly conducted by aligning sequences of gene models at the DNA and protein levels. However, complete and precise gene annotation is a daunting task, and alternative splicing of a gene complicates such analysis. In addition, with substantial resources of genome assemblies, multiple rounds of sequence retrieval, homolog search, variant identification, and annotation are needed for examining a gene of interest and its homologs. The process also typically involves manual examination and judgment. Therefore, a handy computational pipeline is needed for an efficient process. An online tool, BridgeCereal, represents one such effort to discover structural variation among multiple genomes^[9]. Homotools, including a suite of scripts was developed to ease sequence retrieval of homologs and comparisons among them from multiple genomes. Comprehensive genomic variation can be uncovered, annotated, and visualized by Homotools.

Materials and methods

Genomic data

Maize genomes of 28 inbred lines, including A188, B73, Mo17, and other 25 parents of the Nested Association Mapping (NAM) population, were used for analysis^[10−12]. All data can be downloaded from https://download.maizegdb.org.

Development of Homotools
The majority of scripts were developed using Shell scripting and Perl. Plotting was implemented through R. Bioinformatics tools frequently used include Bedtools^[13], BLAST+^[14,15], and NUCMER^[5].

Plant materials and herbicide treatment
Maize inbred lines were provided by the North Central Regional Plant Introduction Unit, USDA, USA. Ten seeds per inbred line were sown in 10.8 cm × 10.8 cm × 9.5 cm pots. The plants were cultivated in greenhouse conditions under 28/21 °C day/night (d/n) temperature and a 16/8 h d/n photoperiod, supplemented with 600 μmol m⁻² s⁻¹ illumination provided by sodium vapor lamps. Seedlings at the 3−5 leaf stage (2- to 3-weeks old) were used for herbicide treatment. At least two seedlings from each inbred line were sprayed with nicosulfuron (Accent Q®, DuPont, Wilmington, USA) at a dose of 137 g active ingredient per hectare. Herbicide application was performed using a bench-track sprayer (DeVries Manufacturing, Hollandale, MN, USA) equipped with a flat-fan nozzle (8002 TeeJet® tip, Spraying Systems, Wheaton, IL, USA), calibrated to deliver 187 liters per hectare at a speed of 4.85 km h⁻¹ at 207 kPa pressure^[16]. Two weeks after treatment, the plants were phenotyped and photographed.

Discussion

The development of Homotools was motivated by the tedious procedure for retrieving sequence data and related information of a gene from a reference genome and homologous genes from other genomes. Multiple modules have been developed to simplify identification and comparison of homologs. The tools are particularly useful for single-gene genomic studies of a species with multiple genome sequences. Genomic databases with standard formats (e.g., FASTA and General Transfer Format, GTF) are required. Therefore, the tools can be used for genomic analysis for any species. With Homotools, pangenomic data related to an input gene can be readily collected and related genomic variation, including structural variation, can be accurately identified and annotated. In addition, publishable high-resolution figures are output from Homotools. Specifically, module geneseq can extract sequences of genomic DNA, coding regions, proteins, and related transposable elements as long as databases are supplied. Modules homocomp and homomine facilitate homologous retrieval, structural comparison, and variant identification. Modules homostack and homograph graphically visualize sequential alignments and multiple sequence alignments, respectively. Because many software dependencies are required, guidance for creating a Conda environment is provided.

Modules of Homotools could be further improved to enhance computational effectiveness and efficiency. First, the mapping of a query gene to a genome and identification of homologous genes in the homocomp and homomine modules could fail due to repetitive sequences. Approximately 3% of genes are subject to this issue in highly repetitive maize genomes. The issue may be mitigated by specifying the target chromosome and/or a rough target region. Alternatively, a sophisticated algorithm such as the algorithm of CHOICE (Clustering HSPs for Ortholog Identification via Coordinates and Equivalence), for identifying homologous regions in a genome could be adapted in the future^[9]. Second, the variant annotation in module homomine relies on the software SNPEff, which groups variants into high, moderate, low, and modifier impacts^[22]. Most moderate-impacted variants are missense polymorphisms for which the functional impacts can be further quantified using other annotation tools including artificial intelligence (AI) based approaches developed to assess the functional impacts of missense polymorphisms^[27,28]. Third, homostack performs alignments and plotting based on the order of DNA sequences that users input. Users may be uncertain about the order to input. An algorithm could be added to enable automatic ordering by determining pairwise similarities of input sequences. Fourth, homologous sequences of a gene collected using Homotools from pangenomes could be used to build a pangenome graph of genic sequences^[29]. Module homograph visualizing alignments of MSA is capable of tackling simple structural variations such as insertion and deletion. Inversion and translocation would be better represented in more complex graphical structures^[30]. In addition, the development of algorithms aiding in hypothesizing evolutionary trajectories, events of mutation and recombination among homologous genes would be highly valuable.

In summary, it has been shown that Homotools can be useful for readily identifying best-hit homologous genes and genomic polymorphisms at the gene level. An online application has been developed that is tailored for the maize community, with the potential to extend similar platforms to a wide range of species. As the Homotools package continues to be refined and enhanced to ensure greater reliability and efficiency, community feedback will play a vital role for future improvement. We look forward to collaborative efforts to further advance the utility of this tool.

Author contributions

The authors confirm contribution to the paper as follows: herbicide tolerance experiment design and conduction: Liu H, Adesina O, Singh R, Jugulam M; Homotools use and test: Liu H, Adesina O, Bika R; Homotools package development: Liu S; draft manuscript preparation: Liu H, Liu S. All authors reviewed and revised the manuscript.

{{lists.name}}

Homotools: a suite of genomic tools for homologous retrieval and comparison