Phylogenetic Analysis Guides Transporter Protein Deorphanization: A Case Study of the SLC25 Family of Mitochondrial Metabolite Transporters

Affiliations.

  • 1 Cellular and Molecular Physiology Department, Yale School of Medicine, New Haven, CT 06510, USA.
  • 2 Systems Biology Institute, Yale West Campus, West Haven, CT 06516, USA.
  • 3 Yale College, New Haven, CT 06511, USA.
  • PMID: 37759714
  • PMCID: PMC10526428
  • DOI: 10.3390/biom13091314

Homology search and phylogenetic analysis have commonly been used to annotate gene function, although they are prone to error. We hypothesize that the power of homology search in functional annotation depends on the coupling of sequence variation to functional diversification, and we herein focus on the SoLute Carrier (SLC25) family of mitochondrial metabolite transporters to survey this coupling in a family-wide manner. The SLC25 family is the largest family of mitochondrial metabolite transporters in eukaryotes that translocate ligands of different chemical properties, ranging from nucleotides, amino acids, carboxylic acids and cofactors, presenting adequate experimentally validated functional diversification in ligand transport. Here, we combine phylogenetic analysis to profile SLC25 transporters across common eukaryotic model organisms, from Saccharomyces cerevisiae , Caenorhabditis elegans , Drosophila melanogaster , Danio rerio , to Homo sapiens , and assess their sequence adaptations to the transported ligands within individual subfamilies. Using several recently studied and poorly characterized SLC25 transporters, we discuss the potentials and limitations of phylogenetic analysis in guiding functional characterization.

Keywords: SLC25; deorphanization; metabolite transport; mitochondria; phylogenetic analysis.

Grants and funding

  • R35 GM150619/GM/NIGMS NIH HHS/United States

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 20 August 2022

Phylogenetic analysis based on single-copy orthologous proteins in highly variable chloroplast genomes of Corydalis

  • Xianmei Yin 1   na1 ,
  • Feng Huang 1   na1 ,
  • Xiaofen Liu 1 ,
  • Jiachen Guo 1 ,
  • Ning Cui 2 ,
  • Conglian Liang 3 ,
  • Yan Lian 1 ,
  • Jingjing Deng 1 ,
  • Hongxiang Yin 1 &
  • Guihua Jiang 1  

Scientific Reports volume  12 , Article number:  14241 ( 2022 ) Cite this article

2186 Accesses

2 Citations

Metrics details

  • Phylogenetics

Corydalis is one of the few lineages that have been reported to have extensive large-scale chloroplast genome (cp-genome) rearrangements. In this study, novel cp-genome rearrangements of Corydalis pinnata , C. mucronate , and C. sheareri are described. C. pinnata is a narrow endemic species only distributed at Qingcheng Mountain in southwest China. Two independent relocations of the same four genes ( trnM-CAU-rbcL ) were found relocated from the typically posterior part of the large single-copy region to the front of it. A uniform inversion of an 11–14-kb segment ( ndhB-trnR-ACG ) was found in the inverted repeat region; and extensive losses of accD , clpP, and trnV-UAC genes were detected in all cp-genomes of all three species of Corydalis . In addition, a phylogenetic tree was reconstructed based on 31 single-copy orthologous proteins in 27 cp-genomes. This study provides insights into the evolution of cp-genomes throughout the genus Corydalis and also provides a reference for further studies on the taxonomy, identification, phylogeny, and genetic transformation of other lineages with extensive rearrangements in cp-genomes.

Similar content being viewed by others

case study of phylogenetic analysis

Phylogenomics and the rise of the angiosperms

case study of phylogenetic analysis

Genomes of multicellular algal sisters to land plants illuminate signaling network evolution

case study of phylogenetic analysis

Complexity of avian evolution revealed by family-level genomes

Introduction.

Corydalis DC. is a large and diverse genus, with ~ 786 species, within the family Papaveraceae ( http://www.worldfloraonline.org/downloadData [accessed 9 December 2021]). Plants belonging to the genus Corydalis are distributed in the Hengduan Mountains and Qinghai–Tibet Plateau and adjacent areas 1 . The structures of the some recognized Corydalis chloroplast genomes (cp-genomes) have undergone a series of genetic rearrangements, such as pseudogenization or the loss of genes, to adapt to drastic changes in the environment 1 , 2 , 3 , 4 . Corydalis pinnata is a narrow endemic species in China and is only distributed along the streams of Qingcheng Mountain in southwest China at altitudes between 1300 m a.s.l. and 1400 m a.s.l. Consequently, this species must also have undergone a unique genetic shift.

Most of the Corydalis plants have potential as medicinal agents due to their therapeutic effects against hepatitis, tumors, cardiovascular diseases, and pain 5 , 6 , but some species are toxic 7 . As one of the most taxonomically challenging plant taxa, the genus Corydalis has extremely complex morphological variations because of typical reticulate evolution and intense differentiation during evolution 8 , which has hampered understanding of the identification, taxonomy, and utilization of members of this genus.

Chloroplasts are common organelles with an essential role in the photosynthesis of green plants 9 . The cp-genome is an ideal research model for studying molecular identification, phylogeny, species conservation, and genome evolution because of its conservative structure 10 , 11 . The increasingly wide application of the cp-genome super-barcode in identification make the development of new cp-genome resources urgent and significant 12 , 13 . Cp-genome rearrangements can also be useful as a phylogenetic marker because they lack homoplasy and are easily identified 14 , 15 , 16 . Although some genetic rearrangements of Corydalis cp-genomes have been reported 1 , 2 , the pattern, origin, evolution, and phylogenetic relationship of cp-genome rearrangements in Corydalis remain unclear because of a lack of sufficient genetic resources. In the present study, three species of the genus Corydalis from Qingcheng Mountain, including a narrow endemic species, were identified based on their cp-genomes. In addition, 12 Corydalis cp-genomes from the National Centre for Biotechnology Information (NCBI) database were included in the rearrangement analysis to represent all five subgenera of Corydalis and cover most of the distribution areas . The structural characteristics, repeat sequences, and cp-genome rearrangements were documented, and phylogenetic trees based on single-copy orthologous proteins were analyzed. The aim of the study was to assess structural variation and provide valuable resources for identification and classification of members of the genus Corydalis .

DNA features of three Corydalis cp-genomes

Cp-genomes of three species of the genus Corydalis were sequenced; the three species were Corydalis pinnata , C. mucronate, and C. sheareri . The sizes of the three newly sequenced Corydalis cp-genomes ranged from 158,399 bp ( C . pinnata ) to 161,105 bp ( C. sheareri ) (Table 1 ). The guanine+cytosine (G+C) contents of the three genomes were 39.6%–40.47%. The three species each had a cp-genome with typical angiosperm quadripartite structure: a large single-copy (LSC) region, a small single-copy (SSC) region, and a pair of inverted repeats (IRs: IRA and IRB). The lengths of the LSC, SSC, and IR regions of the three newly sequenced Corydalis cp-genomes were 87,573–90,438, 20,408–23,322 and 23,778–25,209 bp, respectively (Table 1 ). After annotation, the sequences of the whole cp-genome sequences of the three Corydalis plants were submitted to the NCBI database; GenBank accession numbers are supplied in Table 1 . The C. pinnata cp-genome was taken as an example and a physical map of the cp-genome was created according to the annotation results using OrganellarGenomeDRAW (OGDraw) 17 (Fig.  1 ). A total of 115–117 unique genes, comprising 80–83 protein-coding genes, 28–30 tRNA genes, 4 rRNA genes, and 4–6 pseudogenes, were present in the three newly sequenced Corydalis cp-genomes (Table 1 ). In total, seven genes were pseudogenized in one or more Corydalis cp-genomes, and three genes ( accD , clpP , and trnV-UAC ) were lost in the three newly sequenced Corydalis cp-genomes (Supplementary Table 1 ).

figure 1

Gene map of the chloroplast genome of C. pinnata . Genes within the circle are transcribed clockwise, and those outside are transcribed counterclockwise. Genes belonging to different functional groups are colour coded. The dark grey in the inner circle corresponds to DNA G+C content, and the light grey corresponds to A+T content.

In contrast to previously reported Corydalis cp-genomes 1 , the three newly sequenced Corydalis cp-genomes in this study had 11 complete ndh genes (Supplementary Table 1 ). In addition, amongst all the anticipated genes of the three Corydalis cp-genomes, introns were discovered in 11–13 genes, including 4–5 tRNA genes and 7–8 protein-coding genes (Supplementary Table 2 ). The tRNA genes with introns were trnL-UAA , trnK-UUU , trnI-GAU , trnG-UCC , and trnA-UGC . The eight protein-coding genes with introns were rps12 , rpoC1 , rpl2 , pafI , ndhB , ndhA , atpF , and ycf2 . Two of the 13 intron-containing genes had two introns ( rps12 and pafI ); the remainder of the genes contained only one intron. The trnH-UUU gene contained the largest intron (2474–2488 bp), which contained the whole matK gene . Similar to other angiosperms, the gene rpl2 in the three Corydalis cp-genomes resulted from trans-splicing activity. The 5ʹ end of rpl2 lay in the LSC region, and the 3ʹ end was located in the IR region (Supplementary Table 2 ).

Chloroplast genome structure rearrangement

Seventeen cp-genomes were included in the syntenic comparisons by Mauve alignment (Fig.  2 ), including 15 Corydalis cp-genomes, a representative of Papaveroideae cp-genomes ( Macleaya microcarpa ), and a sister in the Ranunculales cp-genomes ( Euptelea pleiosperma ) to represent a typical angiosperm quadripartite cp-genome structure. More than 30 locally collinear blocks (LCBs) were identified in the Corydalis cp-genomes, from which 15 rearrangements were deduced (Fig.  2 ).

figure 2

Chloroplast structural alignment of Corydalis and E. pleiosperma using Mauve. LCBs are represented by coloured blocks, and the blocks connected by lines indicate homology. Violet lines below each cp-genome represent IR regions, and green lines represent SSR regions. Species in bold are newly sequenced.

A total of 16 relocation blocks were identified in the 15 Corydalis cp-genomes. Block 1 (approximately 6 kb) of 10 Corydalis cp-genomes contained 4–5 genes ( trnM-CAU , atpE , atpB , rbcL , and trnV-UAC ) relocated from the classically posterior part of the LSC region (downstream of the ndhC gene) to the front. Of these cp-genomes with block 1, three cp-genomes from subgenus Sophorocapnos ( C. saxicola , C. fangshanensis , and C. tomentella ) displayed different types of relocation (downstream of the atpH gene) from other subgenera (downstream of the trnK-UUU gene). Then, in the cp-genome of C. adunca (subg. Cremnocapnos ), block 2 with 1 kb of the rps16 gene relocated from the typical LSC region to downstream of the ndhF gene in the IR region. In addition, blocks 5–7 with approximately 13 kb in the IR region contained 11 genes ( ndhB-trnR-ACG ) inverted uniformly in C. pinnata , C. mucronata , C. hsiaowutaishanensis (subg. Corydalis ), C. sheareri (subg. Rapiferae ), C. adunca (subg. Cremnocapnos ), C. saxicola , and C. fangshanensis . In C. conspersa (subg. Rapiferae ) and C. davidii (subg. Fasciculatae ), block 6 was also inverted but blocks 5 and 7 were lost. Blocks 12–15 (~ 8 kb) in the SSC region contained five genes ( ndhA-ycf1 ) inverted uniformly in C. hsiaowutaishanensis , C. conspersa , C. davidii , C. adunca , C. saxicola , and C. inopinata. Moreover, blocks 9–11 were inverted with blocks 12–15 in C. hsiaowutaishanensis , C. conspersa , and C. inopinata , whereas blocks 9–11 and 16 underwent various degrees of loss in C. davidii , C. adunca , C. saxicola , and C. fangshanensis . In addition, blocks 3–8, with approximately 36 kb in the IR region, contained 54 genes ( trnN-GUU-psaI ) and were inverted uniformly in C. tomentella (subg. Sophorocapnos ) compared with C. fangshanensis and C. tomentella from the same subgenus.

Comparison of genomic variation in the three newly sequenced Corydalis cp-genomes and C. edulis cp-genome

Previous studies reported a marked IR region expansion in some Corydalis cp-genomes; the IR region expanded into the simple sequence repeat (SSR) region and led to IR–SSC boundary variations 1 , 2 . In the present study, three newly sequenced Corydalis cp-genomes were compared with the C. edulis cp-genome, which exhibited a typical angiosperm quadripartite cp-genome structure (Fig.  3 ). The location of the IR region in the three newly sequenced Corydalis cp-genomes was relatively conservative (Fig.  3 ). In these three species, rps19 was located in the LSC region, and ndhF was in the SSC region. The coding region of rpl2 was in the IR region of the C. pinnata cp-genome but spanned the LSC and IRa regions of the C. mucronata and C. sheareri cp-genomes; therefore, the IRb/LSC boundary (the 5′ end was lost) region created a pseudogene.

figure 3

Comparison of genome boundaries in chloroplasts from C. pinnata , C. mucronata , C. sheareri and C. edulis.

The C. edulis cp-genome was used as a reference to ascertain differences in the genomic sequences of the three newly sequenced Corydalis cp-genomes (Fig.  4 a,b). The rearranged regions exhibited higher variability compared with the other regions of the four Corydalis cp-genomes studied (Fig.  4 a). Similar to other cp-genomes of angiosperms, most of the protein-coding genes were highly conserved, except for the large variation in the protein-coding genes of some genes (e.g., rps19 , rpl22 , ycf1 and ycf2 ), intron regions ( paf1 , ndhA and rpl2 ), and intergenic regions ( trnQ-UUG-psbK , psbK-psbI , atpF-aptH , atpH-atpI , rpoB-trnC-GCA , trnC-GCA-petN , trnT-GGU-psbD , trnE-UUC-trnT-GGU , trnD-GUC-trnY-GUA , psaA-pafI , pafI-trnS-GGA , rps4-trnT-UGU , trnT-UGU-trnL-UAA , trnR-ACG-trnL-CAA , and trnN-GUU-ndhB ) among the chloroplast genomic sequences with a higher degree of variation. Such higher-resolution loci have the potential to be used as barcodes in species identification.

figure 4

Comparative analyses of genomic differences in the chloroplasts of C. pinnata , C. mucronata , C. sheareri and C. edulis . ( a ), Sliding window analyses of the entire cp-genome. ( b ), Alignment visualisation of the chloroplast genome sequences of C. pinnata , C. mucronata , C. sheareri and C. edulis using mVISTA. Grey arrows and thick black lines indicate genes and their orientation. Purple bars indicate exons, blue bars represent untranslated regions, pink bars represent non-coding sequences, and grey bars denote mRNA. The similarity among the chloroplast genomes is shown on a vertical scale ranging from 50 to 100%.

Analyses of long repetitive sequences and SSRs

Interspersed repeated sequences (IRSs) with a repeat unit length of ≥ 39 bp were evaluated in the chloroplast genomes of C. pinnata , C. mucronate , and C. sheareri . These repeats comprised only forward and palindromic repeats and lacked reverse and complementary repeats that are common in other species. Fifty IRSs were found, and among these, the sequence lengths in C. pinnata, C. mucronate , and C. sheareri were 40–49, > 80, and ≤ 49/≥ 80 bp, respectively . The IRS analyses of the chloroplast genomes are shown in Fig.  5 a–c.

figure 5

Long repetitive sequences and SSR distribution in the chloroplast genomes of C. pinnata , C. mucronata and C. sheareri.

In total, 46 SSRs were found in C. pinnata , including 38 mononucleotide repeats, 1 dinucleotide repeat, and 5 trinucleotide repeats: 51 SSRs were identified in C. mucronate , including 43 mononucleotide repeats, 1 dinucleotide repeat, 5 trinucleotide repeats, and 2 pentanucleotide repeats; and 46 SSRs were found in C. sheareri, including 35 mononucleotide repeats, 1 dinucleotide repeat, 5 trinucleotide repeats, 2 tetranucleotide repeats, and 3 hexanucleotides (Fig.  5 d).

Phylogenetic analyses

Using concatenated single-copy orthologous proteins to resolve phylogenic relationships could avoid rearrangement-misled phylogenetic tree reconstruction and provide a more reliable evolutionary framework compared with using several specific genes 18 . Therefore, the predicted proteome was used in the phylogenetic analyses rather than the whole cp-genome sequence. Based on 31 single-copy orthologous proteins conserved in 27 species with E. pleiosperma as the outgroup, a maximum-likelihood (ML) phylogenetic tree was reconstructed to illuminate the evolutionary history of the compared species (Fig.  6 ). The ML tree had three major clades: the Fumarioideae clade, Papaveroideae clade, and the clade with the rest of the Ranunculales family members. Corydalis constituted a monophyletic sub-clade nested within the Fumarioideae clade. All lineages within Corydalis were strongly supported. The three newly sequenced Corydalis cp-genomes, namely, C. pinnata (Sect. Mucronatae), C. mucronata (Sect. Mucronatae), and C. sheareri (Sect. Asterostigmata), were closely related.

figure 6

ML tree of Corydalis and its relative species based on single-copy orthologous proteins. Euptelea pleiosperma was used as outgroup. References for the classification of the genus Corydalis were World Flora Online. Bootstrap supports on the branches were calculated from 1000 replicates. Three newly sequenced Corydalis marked with bold.

Although the three newly sequenced Corydalis cp-genomes from the same geographic region belong to two different subgenera of Corydalis , the sizes and structures of their LSC, IR, and SSC regions, as well as their total genomes, are highly similar. This includes similar gene losses, inversions, and relocations (Fig.  1 and Supplementary Table 1 ), which are common features in the Corydalis cp-genomes and are considered to be responsible for the variation in cp-genome sizes 1 .

The loss of three genes ( accD , clpP , and trnV-UAC ) is a synapomorphic characteristic in the Corydalis cp-genomes (Supplementary Table 1 ). Xu et al. 1 speculated that the loss of the accD gene occurred before divergence of the genus Corydalis . However, in the present study, the accD gene was found in the cp-genomes of a few species of the subgenus Rapiferae (Supplementary Table 1 ), which indicated that the loss event happened after divergence of the genus Corydalis . The exact time of the loss event should be further explored by gathering more information on Corydalis cp-genomes. The accD gene is relocated to the nucleus in some species, such as some members of the family Campanulaceae 19 , 20 . The pseudogenization or loss of 11 chloroplast ndh genes that encode NADH dehydrogenase subunits only occurred in a few species of the genus Corydalis ( C. conspersa , C. davidii , C. adunca , and C. inopinata ; Supplementary Table 1 ). Strikingly, these species are all located in high-altitude areas (1000–5200 m a.s.l.) 21 . Therefore, extreme changes in the environment may result in gene deletions or pseudogenization; this phenomenon has been observed in other species 22 . Further studies are required to determine whether or not the pseudogenization or loss of ndh genes will affect photosynthesis in those plants.

The chloroplast genome, as a photosynthetic organelle, is highly conserved in terms of structure, gene content, and arrangement 23 , 24 , 25 . Large-scale rearrangement exists only occasionally in a few lineages, such as Campanulaceae 16 , 17 , 26 , 27 , 28 , Ranunculaceae 29 , 30 , Geraniaceae 31 , 32 , 33 , 34 , 35 , 36 , Fabaceae 15 , 37 , 38 , 39 , 40 , 41 , 42 , 43 , 44 , Oleaceae 45 , Asteraceae 46 , 47 , 48 , 49 , Plantaginaceae 50 , 51 , 52 , Euphorbiaceae 53 and Poaceae 14 , 54 , 55 , 56 , 57 . In the present study, rearrangement predominantly occurred in 16 regions (blocks 1–16, Fig.  2 ) of Corydalis plants, which determine the diversity in Corydalis cp-genomes. Repeat sequences may contribute to structural variations in relatively stable rearrangement regions 58 , 59 , 60 . Relocation only occurred in the LSC region of the Corydalis cp-genomes, and inversion only occurred in the IR and SSC regions (Fig.  2 ). This suggested that the patterns of relocation and inversion were regulated in different ways. In addition, blocks 1–16 are likely active rearrangement regions because they have various rearrangement patterns. C. hsiaowutaishanensis (subg. Corydalis ), C. adunca (subg. Cremnocapnos ), C. Saxicola , and C. fangshanensis (subg. Sophorocapnos ) all underwent the inversion of blocks 10–16, but the inversion boundaries of C. hsiaowutaishanensis expanded into block 9, suggesting that the inversion of blocks 9–16 in C. hsiaowutaishanensis was an independent event. Furthermore, some species from different subgenera have the same relocation or inversion pattern, such as the three Corydalis plants ( C. pinnata , C. mucronate , and C. sheareri ) collected from Qingcheng Mountain in the current study. Although they represent two subgenera, these three species have an almost identical relocation/inversion pattern in their cp-genomes (Fig.  2 ). Moreover, blocks 5–7 underwent at least two inversions in C. tomentella ; blocks 5–7 initially inversed independently and then inversed with blocks 3, 4, and 8. This active rearrangement suggested that relocation or inversion in Corydalis cp-genomes might be affected by the geographical environment.

Loss of introns and/or genes is instrumental in the regulation of gene expression and can control gene expression temporally and in a tissue-specific manner 61 , 62 , 63 .The regulation mechanisms of introns for gene expression in plants and animals have been reported 63 , 64 , 65 . However, the implications or link between gene expression and intron loss for Corydalis have not been published. Further experimental work on the roles of introns in Corydalis is therefore essential and should prove interesting. Highly variable DNA barcodes play an important role in species identification and phylogenetic analyses. In the current study, protein-coding genes ( rps19 , rpl22 , ycf1, and ycf2 ), intron regions ( paf1 , ndhA , and rpl2 ), and the intergenic regions ( trnQ-UUG-psbK , psbK-psbI , atpF-aptH , atpH-atpI , rpoB-trnC-GCA , trnC-GCA-petN , trnT-GGU-psbD , trnE-UUC-trnT-GGU , trnD-GUC-trnY-GUA , psaA-pafI , pafI-trnS-GGA , rps4-trnT-UGU , trnT-UGU-trnL-UAA , trnR-ACG-trnL-CAA , and trnN-GUU-ndhB ) exhibited some extent of variation and have great potential as DNA markers (Fig.  4 b).

Cp-genomes have made marked contributions to the phylogenetic studies of angiosperms and to resolving the evolutionary relationships within phylogenetic clades 66 , 67 . However, active rearrangement in Corydalis cp-genomes may mislead the reconstruction of species phylogenetic relationships based on DNA sequence of cp-genomes. Phylogenetic reconstruction of the genus Corydalis was previously explored with DNA barcoding 68 or relatively conserved nucleotide fragments in cp-genomes 1 . However, deep relationships remained poorly resolved by this phylogenetic approach applying a few plastid markers. Some studies reported that the protein-coding genes shared by all taxa could be used to reconstruct a phylogeny 2 , 34 . However, single-copy genes (SCGs) have subsequently emerged as candidates for phylogenetic analysis because paralogues are derived from duplication events other than speciation events and should therefore be discarded from phylogenetic analyses 69 , 70 . Therefore, the 31 single-copy orthologous proteins in all 27 cp-genomes were used to reconstruct the phylogeny of the genus Corydalis. Three distinct clades were defined by high bootstrap values (Fig.  6 ) in the resulting phylogenetic tree, which is consistent with previous studies based on molecular markers 1 , 71 . This indicated that the application of the single-copy orthologous proteins of cp-genomes can improve the resolution of the phylogeny and taxonomy of the genus Corydalis . Findings from the study also provide a reference for the taxonomy and identification of other plants with extensive rearrangement in cp-genomes.

Conclusions

The cp-genomes of three species of the genus Corydalis ( C. pinnata , C. mucronata , and C. sheareri ) from the Qingcheng Mountain in southwest China, including a narrow endemic species ( C. pinnata ), were characterized. The cp-genomes of the three species exhibited a large-scale rearrangement, including the relocation of four genes ( trnM-CAU-rbcL ) in the LSC region, the inversion of an 11–14-kb segment ( ndhB-trnR-ACG ) in the IR region, and the loss of three genes ( accD , clpP , and trnV-UAC ). The three Corydalis cp-genomes showed high similarity in terms of genome size, gene classes, gene sequences, rearrangement pattern, and distribution of repeat sequences. In addition, the structural alignment of 17 Corydalis cp-genomes with the typical chloroplast genomic structure of angiosperms ( E. pleiosperma ) revealed a frequent and extensive large-scale rearrangement in the Corydalis cp-genomes. Among them, the relocation of two blocks ( trnM-CAU-rbcL and rps16 ) frequently appeared in the LSC region, and the inversion of four blocks ( rpl23-trnL-CAA , ndhB-trnR-ACG , trnN-GUU , and ndhA-ycf1 ) frequently appeared in the IR and SSC regions. The extensive large-scale cp-genome rearrangement may mislead phylogenetic analysis based on cp-genomes. Single-copy orthologous proteins of cp-genomes were therefore used to reconstruct the phylogeny of the genus Corydalis . This method was concluded to have good prospects for elucidating the phylogeny and taxonomy of Corydalis and could potentially be employed for the phylogenetic analysis of other lineages with extensive rearranged cp-genomes in future studies. Findings from this study provide a reference for further studies on the taxonomy, identification, and evolution of the genus Corydalis.

Materials and methods

Plant collection and sampling.

The aboveground parts of the three plant species were collected from Qingcheng Mountain, Sichuan Province, China ( C. sheareri , location: E 103°32ʹ4″ N 30°54ʹ5″, altitude: 720 m a.s.l.; C. mucronate , location: E 103°28ʹ35″ N 30°28ʹ35″, altitude: 980 m a.s.l.; C. pinnata , location: E 103°25ʹ27″ N 30°65ʹ5″, altitude: 1350 m a.s.l.). The voucher specimens were deposited in the herbarium of the College of Pharmacy, Chengdu University of Traditional Chinese Medicine, China (deposition numbers: C. sheareri , CDCM0005283; C. mucronate , CDCM0005284; C. pinnata , CDCM0005285). The collection of samples conformed to the management provisions of the List of State-protected Wild Plants and was approved by the National Forestry and Grassland Administration of China (Supplementary Fig.  1 ). The specimens were identified by Professor Guihua Jiang.

DNA sequencing, assembly and validation of the chloroplast genome

A modified cetyltrimethylammonium bromide (CTAB) method was used for DNA extraction and the NEBNext Ultra DNA Library Prep Kit for Illumina sequencing was used for 500-bp paired-end library construction. A shotgun library (250 bp) was constructed according to the manufacturer’s (Vazyme Biotech, Nanjing, China) instructions. Sequencing was accomplished with the X™ Ten platform (Illumina, San Diego, CA, USA) using the double terminal sequencing method (pair-end 150) 10 . Total raw data from a sample was approximately 10.0 G, and > 300 million paired-end reads were attained.

Raw data were filtered by Skewer-0.2.2 22 72 . The resulting reads were used for genome assembly by GetOrganelle version 1.7.5 73 . Another assembly for each species of the genus Corydalis was performed by ABYSS with C. edulis as the reference to confirm the GetOrganelle assemblies. The draft genome was used to map clean reads by BWA version 0.7.17 74 , and then clean reads were filtered using SAMtools version 1.7 75 . Mapping was visualized by IGV version 2.10.0 76 to check the concatenation of contigs 1 . Furthermore, junction splicing sites were verified with polymerase chain reaction (PCR) and Sanger sequencing. All of the contigs were aligned to the reference cp-genome of C. edulis with MUMmer version 4.0 77 . Finally, the sequences were extended and gaps were filled with SSPACE-3.0 78 .

Gene annotation and sequence analyses

Sequence annotation was achieved by Plann version 1.1.2 79 using the cp-genome of C. conspersa as a reference and some manual correction . BLAST and Apollo 80 were used to check the start and stop codons and the intron/exon boundaries with the cp-genome of C. conspersa as a reference sequence. Complete cp-genome sequences were submitted to the NCBI. A physical map of the cp-genomes was generated with Organellar Genome OGDraw 81 ( http://ogdraw.mpimp-golm.mpg.de/ ).

Genome structure analyses

To determine synteny and identify possible rearrangements, 19 cp-genomes were compared using Mauve 2.4.0 82 with the “progressiveMauve” algorithm, including 17 Corydalis cp-genomes, the cp-genome of Macleaya microcarpa (NC_039623 ) representing Papaveroideae, and the cp-genome of Euptelea pleiosperma (NC_029429) representing a typical angiosperm cp-genome. The Mauve result was then manually modified to show the notable rearrangements. The cp-genomes of species of the genus Corydalis were completed by mVISTA 83 (Shuffle-LAGAN mode) using the genome of C. edulis as the reference. Tandem Repeats Finder 84 was used to detect tandem repeats, forward repeats, and palindromic repeats as tested by REPuter 85 . SSRs were detected by Misa.pl 86 using search parameters of mononucleotides set to ≥ 10 repeat units, dinucleotides ≥ 8 repeat units, trinucleotides and tetranucleotides ≥ 4 repeat units, and pentanucleotides and hexanucleotides ≥ 3 repeat units.

Twenty-seven cp-genomes were used to reconstruct a phylogenetic tree. First, single-copy orthologous proteins were extracted by OrthoFinder version 2.3.8 87 . Next, genes were aligned by MUSCLE version 3.8, and then the best-fit models of amino acid substitution were estimated by ProtTest version 3.4 88 with the best corrected Akaike Information Criterion (AICc) value selected. Finally, a ML phylogenetic tree was reconstructed by RAxML version 8.2.12 89 including tree robustness assessment using 1000 replicates of rapid bootstrap with the HIVb + I + G + F substitution model based on the results of ProtTest.

Xu, X. & Wang, D. Comparative chloroplast genomics of Corydalis Species (Papaveraceae): Evolutionary perspectives on their unusual large scale rearrangements. Front. Plant Sci. 11 , 2243 (2021).

Article   Google Scholar  

Ren, F. et al. Highly variable chloroplast genome from two endangered Papaveraceae lithophytes Corydalis tomentella and Corydalis saxicola . Ecol. Evol. 11 , 4158–4171 (2021).

Article   PubMed   PubMed Central   Google Scholar  

Yu, Z., Zhou, T., Li, N. & Wang, D. The complete chloroplast genome and phylogenetic analysis of Corydalis fangshanensis W.T. Wang ex S.Y. He (Papaveraceae). Mitochondrial DNA B 6 , 3171–3173. https://doi.org/10.1080/23802359.2021.1987172 (2021).

Kanwal, N. et al. Complete chloroplast genome of a Chinese endemic species Corydalis trisecta Franch (Papaveraceae). Mitochondrial DNA B 4 , 2291–2292. https://doi.org/10.1080/23802359.2019.1627930 (2019).

Medicine, E. B. O. C. T. Chinese Tibetan Medicine (Shanghai Science and Technology Press, 1996).

Google Scholar  

Kubo, M., Matsuda, H., ToKUoKA, K., Ma, S. & Shiomoto, H. Anti-inflammatory activities of methanolic extract and alkaloidal components from Corydalis tuber . Biol. Pharm. Bull. 17 , 262–265 (1994).

Article   CAS   PubMed   Google Scholar  

Guo, Y. et al. The traditional uses, phytochemistry, pharmacokinetics, pharmacology, toxicity, and applications of Corydalis saxicola bunting: A review. Front. Pharmacol. 13 , 822792. https://doi.org/10.3389/fphar.2022.822792 (2022).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Lidén, M., Fukuhara, T. & Axberg, T. Systematics and Evolution of the Ranunculiflorae 183–188 (Springer, 1995).

Book   Google Scholar  

Bruneau, A., Starr, J. R. & Joly, S. Phylogenetic relationships in the genus Rosa: New evidence from chloroplast DNA sequences and an appraisal of current knowledge. Syst. Bot. 32 , 366–378 (2007).

Yin, X. et al. The chloroplasts genomic analyses of Rosa laevigata , R. rugosa and R. canina . Chin. Med. 15 , 1–11 (2020).

Article   CAS   Google Scholar  

Ning, C. et al. Complete chloroplast genome of Salvia plebeia : Organization, specific barcode and phylogenetic analysis. Chin. J. Nat. Med. 18 , 563–572 (2020).

Zhang, Z. L., Zhang, Y., Song, M. F., Guan, Y. H. & Ma, X. J. Species identification of dracaena using the complete chloroplast genome as a super-barcode. Front. Pharmacol. 11 , 1441 (2020).

Wu, L. et al. Plant super-barcode: A case study on genome-based identification for closely related species of Fritillaria. Chin. Med. 16 , 52. https://doi.org/10.1186/s13020-021-00460-z (2021).

Doyle, J. J., Davis, J. I., Soreng, R. J., Garvin, D. & Anderson, M. J. Chloroplast DNA inversions and the origin of the grass family (Poaceae). Proc. Natl. Acad. Sci. 89 , 7722–7726 (1992).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Doyle, J. J., Doyle, J. L., Ballenger, J. & Palmer, J. The distribution and phylogenetic significance of a 50-kb chloroplast DNA inversion in the flowering plant family Leguminosae. Mol. Phylogenet. Evol. 5 , 429–438 (1996).

Cosner, M. E., Raubeson, L. A. & Jansen, R. K. Chloroplast DNA rearrangements in Campanulaceae: Phylogenetic utility of highly rearranged genomes. Bmc Evol. Biol. 4 , 1–27 (2004).

Knox, E., Downie, S. & Palmer, J. Chloroplast genome rearrangements and the evolution of giant lobelias from herbaceous ancestors. Mol. Biol. Evol. 10 , 414–430 (1993).

CAS   Google Scholar  

Zhang, N., Zeng, L. P., Shan, H. Y. & Ma, H. Highly conserved low-copy nuclear genes as effective markers for phylogenetic analyses in angiosperms. New Phytol. 195 , 923–937 (2012).

Hong, C. P. et al. accD nuclear transfer of Platycodon grandiflorum and the plastid of early Campanulaceae. BMC Genomics 18 , 1–13 (2017).

Rousseau-Gueutin, M. et al. Potential functional replacement of the plastidic acetyl-CoA carboxylase subunit (accD) gene by recent transfers to the nucleus in some angiosperm lineages. Plant Physiol. 161 , 1918–1929 (2013).

Li, J. Flora of China. Harv. Pap. Bot. 13 , 301–302 (2007).

Lin, C.-S. et al. The location and translocation of ndh genes of chloroplast origin in the Orchidaceae family. Sci. Rep. 5 , 1–10 (2015).

Mower, J. P. & Vickrey, T. L. Structural diversity among plastid genomes of land plants. Adv. Bot. Res. 85 , 263–292 (2018).

Palmer, J. D. Comparative organization of chloroplast genomes. Annu. Rev. Genet. 19 , 325–354 (1985).

Wicke, S., Schneeweiss, G. M., Depamphilis, C. W., Müller, K. F. & Quandt, D. The evolution of the plastid chromosome in land plants: Gene content, gene order, gene function. Plant Mol. Biol. 76 , 273–297 (2011).

Uribe-Convers, S., Carlsen, M. M., Lagomarsino, L. P. & Muchhala, N. Phylogenetic relationships of Burmeistera (Campanulaceae: Lobelioideae): Combining whole plastome with targeted loci data in a recent radiation. Mol. Phylogenet. Evol. 107 , 551–563 (2017).

Article   PubMed   Google Scholar  

Knox, E. B. The dynamic history of plastid genomes in the Campanulaceae sensu lato is unique among angiosperms. Proc. Natl. Acad. Sci. 111 , 11097–11102 (2014).

Knox, E. B. & Li, C. The East Asian origin of the giant lobelias. Am. J. Bot. 104 , 924–938 (2017).

Choi, K. S. et al. Two Korean endemic Clematis chloroplast genomes: Inversion, reposition, expansion of the inverted repeat region, phylogenetic analysis, and nucleotide substitution rates. Plants 10 , 397 (2021).

Liu, H. et al. Comparative analysis of complete chloroplast genomes of Anemoclema, Anemone, Pulsatilla, and Hepatica revealing structural variations among genera in tribe Anemoneae (Ranunculaceae). Front. Plant Sci. 9 , 1097 (2018).

Palmer, J. D., Nugent, J. M. & Herbon, L. A. Unusual structure of geranium chloroplast DNA: A triple-sized inverted repeat, extensive gene duplications, multiple inversions, and two repeat families. Proc. Natl. Acad. Sci. 84 , 769–773 (1987).

Chumley, T. W. et al. The complete chloroplast genome sequence of Pelargonium × hortorum: Organization and evolution of the largest and most highly rearranged chloroplast genome of land plants. Mol. Biol. Evol. 23 , 2175–2190 (2006).

Guisinger, M. M., Kuehl, J. V., Boore, J. L. & Jansen, R. K. Extreme reconfiguration of plastid genomes in the angiosperm family Geraniaceae: Rearrangements, repeats, and codon usage. Mol. Biol. Evol. 28 , 583–600 (2011).

Weng, M.-L., Blazier, J. C., Govindu, M. & Jansen, R. K. Reconstruction of the ancestral plastid genome in Geraniaceae reveals a correlation between genome rearrangements, repeats, and nucleotide substitution rates. Mol. Biol. Evol. 31 , 645–659 (2014).

Röschenbleck, J., Wicke, S., Weinl, S., Kudla, J. & Müller, K. F. Genus-wide screening reveals four distinct types of structural plastid genome organization in Pelargonium (Geraniaceae). Genome Biol. Evol. 9 , 64–76 (2017).

PubMed   Google Scholar  

Weng, M. L., Ruhlman, T. A. & Jansen, R. K. Expansion of inverted repeat does not decrease substitution rates in Pelargonium plastid genomes. New Phytol. 214 , 842–851 (2017).

Kolodner, R. & Tewari, K. Inverted repeats in chloroplast DNA from higher plants. Proc. Natl. Acad. Sci. 76 , 41–45 (1979).

Palmer, J. D. & Thompson, W. F. Rearrangements in the chloroplast genomes of mung bean and pea. Proc. Natl. Acad. Sci. 78 , 5533–5537 (1981).

Lavin, M., Doyle, J. J. & Palmer, J. D. Evolutionary significance of the loss of the chloroplast-DNA inverted repeat in the Leguminosae subfamily Papilionoideae . Evolution 44 , 390–402 (1990).

CAS   PubMed   Google Scholar  

Cai, Z. et al. Extensive reorganization of the plastid genome of Trifolium subterraneum (Fabaceae) is associated with numerous repeated sequences and novel DNA insertions. J. Mol. Evol. 67 , 696–704 (2008).

Article   ADS   CAS   PubMed   Google Scholar  

Martin, G. E. et al. The first complete chloroplast genome of the Genistein legume Lupinus luteus : Evidence for a novel major lineage-specific rearrangement and new insights regarding plastome evolution in the legume family. Ann. Bot. 113 , 1197–1210 (2014).

Schwarz, E. N. et al. Plastid genome sequences of legumes reveal parallel inversions and multiple losses of rps16 in papilionoids. J. Syst. Evol. 53 , 458–468 (2015).

Wang, Y.-H., Qu, X.-J., Chen, S.-Y., Li, D.-Z. & Yi, T.-S. Plastomes of Mimosoideae: structural and size variation, sequence divergence, and phylogenetic implication. Tree Genet. Genomes 13 , 41 (2017).

Charboneau, J. L., Cronn, R. C., Liston, A., Wojciechowski, M. F. & Sanderson, M. J. Plastome structural evolution and homoplastic inversions in Neo-Astragalus (Fabaceae). Genome Biol. Evol. 13 , 215 (2021).

Lee, H.-L., Jansen, R. K., Chumley, T. W. & Kim, K.-J. Gene relocations within chloroplast genomes of Jasminum and Menodora (Oleaceae) are due to multiple, overlapping inversions. Mol. Biol. Evol. 24 , 1161–1180 (2007).

Jansen, R. K. & Palmer, J. D. A chloroplast DNA inversion marks an ancient evolutionary split in the sunflower family (Asteraceae). Proc. Natl. Acad. Sci. 84 , 5818–5822 (1987).

Kim, K.-J., Choi, K.-S. & Jansen, R. K. Two chloroplast DNA inversions originated simultaneously during the early evolution of the sunflower family (Asteraceae). Mol. Biol. Evol. 22 , 1783–1792 (2005).

Sablok, G., Amiryousefi, A., He, X., Hyvönen, J. & Poczai, P. Sequencing the plastid genome of giant ragweed (Ambrosia trifida, Asteraceae) from a herbarium specimen. Front. Plant Sci. 10 , 218 (2019).

Mehmood, F., Rahim, A., Heidari, P., Ahmed, I. & Poczai, P. Comparative plastome analysis of Blumea, with implications for genome evolution and phylogeny of Asteroideae. Ecol. Evol. 11 , 7810–7826 (2021).

Zhu, A., Guo, W., Gupta, S., Fan, W. & Mower, J. P. Evolutionary dynamics of the plastid inverted repeat: the effects of expansion, contraction, and loss on substitution rates. New Phytol. 209 , 1747–1756 (2016).

Kwon, W., Kim, Y., Park, C.-H. & Park, J. The complete chloroplast genome sequence of traditional medical herb, Plantago depressa Willd. (Plantaginaceae). Mitochondrial DNA B 4 , 437–438 (2019).

Asaf, S. et al. Expanded inverted repeat region with large scale inversion in the first complete plastid genome sequence of Plantago ovata . Sci. Rep. 10 , 1–16 (2020).

Wei, N. et al. Plastome evolution in the hyperdiverse Genus Euphorbia (Euphorbiaceae) using phylogenomic and comparative analyses: Large-scale expansion and contraction of the inverted repeat region. Front. Plant Sci. 12 , 1555 (2021).

Palmer, J. D. & Thompson, W. F. Chloroplast DNA rearrangements are more frequent when a large inverted repeat sequence is lost. Cell 29 , 537–550 (1982).

Michelangeli, F. A., Davis, J. I. & Stevenson, D. W. Phylogenetic relationships among Poaceae and related families as inferred from morphology, inversions in the plastid genome, and sequence data from the mitochondrial and plastid genomes. Am. J. Bot. 90 , 93–106 (2003).

Burke, S. V., Lin, C.-S., Wysocki, W. P., Clark, L. G. & Duvall, M. R. Phylogenomics and plastome evolution of tropical forest grasses (Leptaspis, Streptochaeta: Poaceae). Front. Plant Sci. 7 , 1993 (2016).

Liu, Q. et al. Comparative chloroplast genome analyses of Avena: Insights into evolutionary dynamics and phylogeny. BMC Plant Biol. 20 , 1–20 (2020).

Ogihara, Y., Terachi, T. & Sasakuma, T. Intramolecular recombination of chloroplast genome mediated by short direct-repeat sequences in wheat species. Proc. Natl. Acad. Sci. 85 , 8573–8577 (1988).

Milligan, B. G., Hampton, J. N. & Palmer, J. D. Dispersed repeats and structural reorganization in subclover chloroplast DNA. Mol. Biol. Evol. 6 , 355–368 (1989).

Li, J. et al. Assembly of the complete mitochondrial genome of an endemic plant, Scutellaria tsinyunensis , revealed the existence of two conformations generated by a repeat-mediated recombination. Planta 254 , 1–16 (2021).

Article   PubMed   CAS   Google Scholar  

Le Hir, H., Nott, A. & Moore, M. J. How introns influence and enhance eukaryotic gene expression. Trends Biochem. Sci. 28 , 215–220 (2003).

Niu, D.-K. & Yang, Y.-F. Why eukaryotic cells use introns to enhance gene expression: Splicing reduces transcription-associated mutagenesis by inhibiting topoisomerase I cutting activity. Biol. Direct 6 , 1–10 (2011).

Callis, J., Fromm, M. & Walbot, V. Introns increase gene expression in cultured maize cells. Genes Dev. 1 , 1183–1200 (1987).

Emami, S., Arumainayagam, D., Korf, I. & Rose, A. B. The effects of a stimulating intron on the expression of heterologous genes in A rabidopsis thaliana. Plant Biotechnol. J. 11 , 555–563 (2013).

Choi, T., Huang, M., Gorman, C. & Jaenisch, R. A generic intron increases gene expression in transgenic mice. Mol. Cell. Biol. 11 , 3070–3074 (1991).

CAS   PubMed   PubMed Central   Google Scholar  

Haberle, R. C., Fourcade, H. M., Boore, J. L. & Jansen, R. K. Extensive rearrangements in the chloroplast genome of Trachelium caeruleum are associated with repeats and tRNA genes. J. Mol. Evol. 66 , 350–361 (2008).

Yue, F., Cui, L., Claude, W. D., Moret, B. M. & Tang, J. Gene rearrangement analysis and ancestral order inference from chloroplast genomes with inverted repeat. BMC Genomics 9 , 1–9 (2008).

Ren, F. M. et al. DNA barcoding of Corydalis, the most taxonomically complicated genus of Papaveraceae. Ecol. Evol. 9 , 1934–1945 (2019).

Sang, T. Utility of low-copy nuclear gene sequences in plant phylogenetics. Crit. Rev. Biochem. Mol. Biol. 37 , 121–147 (2002).

Debray, K. et al. Identification and assessment of variable single-copy orthologous (SCO) nuclear loci for low-level phylogenomics: A case study in the genus Rosa (Rosaceae). BMC Evol. Biol. 19 , 1–19 (2019).

Wang, Y. W. Systematics of Corydalis DC. (Fumariaceae) (The Chinese Academy of Sciences, 2006).

Jiang, H., Lei, R., Ding, S.-W. & Zhu, S. Skewer: A fast and accurate adapter trimmer for next-generation sequencing paired-end reads. BMC Bioinform. 15 , 1–12 (2014).

Jin, J.-J. et al. GetOrganelle: A fast and versatile toolkit for accurate de novo assembly of organelle genomes. Genome Biol. 21 , 1–31 (2020).

Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997 (2013).

Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25 , 2078–2079 (2009).

Article   PubMed   PubMed Central   CAS   Google Scholar  

Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29 , 24–26 (2011).

Delcher, A. L., Salzberg, S. L. & Phillippy, A. M. Using MUMmer to identify similar regions in large sequence sets. Curr. Protoc. Bioinform. 1 , 10.13.11-10.13.18 (2003).

Boetzer, M., Henkel, C. V., Jansen, H. J., Butler, D. & Pirovano, W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27 , 578–579 (2011).

Huang, D. I. & Cronk, Q. C. Plann: A command-line application for annotating plastome sequences. Appl. Plant Sci. 3 , 1500026 (2015).

Misra, S. & Harris, N. Using Apollo to browse and edit genome annotations. Curr. Protoc. Bioinform. 12 , 1–28 (2005).

Lohse, M., Drechsel, O. & Bock, R. OrganellarGenomeDRAW (OGDRAW): A tool for the easy generation of high-quality custom graphical maps of plastid and mitochondrial genomes. Curr. Genet. 52 , 267–274 (2007).

Darling, A. C., Mau, B., Blattner, F. R. & Perna, N. T. Mauve: Multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 14 , 1394–1403 (2004).

Frazer, K. A., Pachter, L., Poliakov, A., Rubin, E. M. & Dubchak, I. VISTA: Computational tools for comparative genomics. Nucleic Acids Res. 32 , W273–W279 (2004).

Benson, G. Tandem repeats finder: A program to analyze DNA sequences. Nucleic Acids Res. 27 , 573–580 (1999).

Kurtz, S. et al. REPuter: The manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res. 29 , 4633–4642 (2001).

Article   MathSciNet   CAS   PubMed   PubMed Central   Google Scholar  

Beier, S., Thiel, T., Münch, T., Scholz, U. & Mascher, M. MISA-web: A web server for microsatellite prediction. Bioinformatics 33 , 2583–2585 (2017).

Emms, D. M. & Kelly, S. OrthoFinder: Phylogenetic orthology inference for comparative genomics. Genome Biol. 20 , 1–14 (2019).

Abascal, F., Zardoya, R. & Posada, D. ProtTest: Selection of best-fit models of protein evolution. Bioinformatics 21 , 2104–2105 (2005).

Stamatakis, A. RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30 , 1312–1313 (2014).

Download references

Acknowledgements

This work was supported by grants from the The National Key Research and Development Program of China (2019YFC1712302, 2019YFC1712305); Department of Science and Technology of Shichuan Province (2020YJ0369); Research on the authentic scientific connotation of Angelicae Dahuricae Radix based on the difference in efficacy of the "gut-brain axis" in the treatment of migraine (82173928).

Author information

These authors contributed equally: Xianmei Yin and Feng Huang.

Authors and Affiliations

College of Pharmacy, Chengdu University of Traditional Chinese Medicine, Chendu, 611130, China

Xianmei Yin, Feng Huang, Xiaofen Liu, Jiachen Guo, Yan Lian, Jingjing Deng, Hongxiang Yin & Guihua Jiang

Central Laboratory, Shandong Academy of Chinese Medicine, Jinan, 250014, China

Ning Cui & Hao Wu

College of Pharmacy, Shandong University of Traditional Chinese Medicine, Jinan, 250355, China

Conglian Liang

You can also search for this author in PubMed   Google Scholar

Contributions

G.H.J and H.X.Y. conceived and designed the research framework; X.F.L and J.C.G. collected and identified the sample; X.M.Y and F.H. performed the experiments; Y.L. and N.C. analyzed the data; X.M.Y and F.H. wrote the paper; C.L.L., J.J.D. and H.W. made revisions to the final manuscript. All the authors have read and approved the final manuscript.

Corresponding authors

Correspondence to Hongxiang Yin or Guihua Jiang .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary figures., supplementary tables., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Yin, X., Huang, F., Liu, X. et al. Phylogenetic analysis based on single-copy orthologous proteins in highly variable chloroplast genomes of Corydalis . Sci Rep 12 , 14241 (2022). https://doi.org/10.1038/s41598-022-17721-y

Download citation

Received : 28 October 2021

Accepted : 29 July 2022

Published : 20 August 2022

DOI : https://doi.org/10.1038/s41598-022-17721-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Dynamic changes in the plastid and mitochondrial genomes of the angiosperm corydalis pauciovulata (papaveraceae).

  • Seongjun Park
  • SeonJoo Park

BMC Plant Biology (2024)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

case study of phylogenetic analysis

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Biomolecules
  • PMC10526428

Logo of biomol

Phylogenetic Analysis Guides Transporter Protein Deorphanization: A Case Study of the SLC25 Family of Mitochondrial Metabolite Transporters

Katie l. byrne.

1 Cellular and Molecular Physiology Department, Yale School of Medicine, New Haven, CT 06510, USA

2 Systems Biology Institute, Yale West Campus, West Haven, CT 06516, USA

3 Yale College, New Haven, CT 06511, USA

Richard V. Szeligowski

Hongying shen, associated data.

Sequence and alignment data files are provided within the paper.

Homology search and phylogenetic analysis have commonly been used to annotate gene function, although they are prone to error. We hypothesize that the power of homology search in functional annotation depends on the coupling of sequence variation to functional diversification, and we herein focus on the S o L ute C arrier (SLC25) family of mitochondrial metabolite transporters to survey this coupling in a family-wide manner. The SLC25 family is the largest family of mitochondrial metabolite transporters in eukaryotes that translocate ligands of different chemical properties, ranging from nucleotides, amino acids, carboxylic acids and cofactors, presenting adequate experimentally validated functional diversification in ligand transport. Here, we combine phylogenetic analysis to profile SLC25 transporters across common eukaryotic model organisms, from Saccharomyces cerevisiae , Caenorhabditis elegans , Drosophila melanogaster , Danio rerio , to Homo sapiens , and assess their sequence adaptations to the transported ligands within individual subfamilies. Using several recently studied and poorly characterized SLC25 transporters, we discuss the potentials and limitations of phylogenetic analysis in guiding functional characterization.

1. Introduction

It is estimated that at least 20–30% of the coding proteins derived from any model organism’s genome have poorly characterized functions [ 1 ]. To predict their functions, in silico analysis, especially sequence homology, is frequently used for functional annotation based on their characterized homologous proteins. However, this strategy is prone to false predictions due to functional diversification in protein evolution [ 2 ]. We therefore reasoned that the power of homology search in predicting function depends on the coupling of sequence variation to functional diversification, which can be evaluated by performing a phylogenetic analysis of homologs that span sufficient experimentally validated functional diversity.

In the context of cellular metabolism, specifically for metabolic enzymes and metabolite transporters of poorly characterized function, we consider functional prediction, the “deorphanization” effort, as predicting the ligands specifically binding at the evolutionarily conserved, putative catalytic pocket. This binding ligand can be a translocating ligand for a transporter, a substrate or product of a metabolic enzyme, or a regulatory molecule bound at the same pocket of these pseudogenized enzymes or transporters that gain additional moonlighting functions [ 3 ]. To test whether phylogenetic analysis is sufficient for guiding deorphanization and predicting putative ligands, we chose the mitochondrial SLC25 metabolite transporter family as the model family.

The SLC25 family is the largest protein family responsible for translocating metabolites across the mitochondrial inner membrane, a process that critically controls all aspects of mitochondrial physiology [ 4 , 5 , 6 , 7 , 8 ]. The human genome encodes 53 SLC25 genes, including four adenine nucleotide carriers (ADP/ATP carriers) that exchange ADP 3− for ATP 4− to support cellular bioenergetics and UCP1, which is exclusively expressed in brown adipose tissue and dissipates the proton motive force to generate heat. Despite their critical roles, at least one-third of human SLC25 transporters remain poorly characterized, leaving ample space for deorphanization efforts. Meanwhile, decades of elegant experimental studies have provided functional and mechanistic insights, presenting an ideal case for evaluating the coupling of sequence variation and functional diversification for proteins within the SLC25 family.

The SLC25 family is a prime candidate for our investigation into phylogenetic-analysis-guided studies for many reasons. First, the family is extraordinarily functionally diverse, being the largest protein family (53 in humans) that translocate a variety of chemically distinct metabolite ligands across the inner mitochondrial membrane, ranging from nucleotides to amino acids, carboxylic acids, inorganic ions, and different cofactors to protons with high specificity

Second, structure–function analysis, leveraging elegant primary sequence interrogation [ 9 , 10 ], atomic structures of different state conformations locked by pharmacological inhibitors [ 11 , 12 ] and site-directed mutagenesis screens [ 13 , 14 ], has confidently pinpointed the ligand binding residues of several characterized proteins in this family. This allows for an evaluation of coupling sequence adaptation to the transporting ligands. The transmembrane region of each SLC25 transporter is a structural and functional monomer of only approximately 300 amino acids with six transmembrane α-helices (threefold pseudo-symmetrical repeats of the two transmembrane helices called the “mito_carr” and “Solcar” domains) surrounding the central cavity. The ADP/ATP carrier crystal structures in both cytoplasmic-open (c-state) and matrix-open states (m-state), locked by two inhibitors bound at the ligand binding site, carboxyatractyloside and bongkrekic acid, respectively, highlight a conserved alternating-access transport mechanism that is triggered by a conformational change that occurs upon ligand binding at the central cavity [ 8 ]. Therefore, it is reasonable to speculate that the residues involved in ligand recognition and translocation would evolve and adapt for specific ligands that might be reflected by phylogenetic analysis.

Third, and most remarkably, the SLC25 family is highly evolutionarily conserved yet functionally diverse in eukaryotes. This family is present in all eukaryotic species (no bacterial and archaeal homologs have been identified so far), with only a few exceptions in protist species that have significantly reduced mitochondrial function, Giardia lamblia and Encephalitozoon cuniculi [ 15 ], or that have completely lost mitochondria, e.g., Monocercomonoides sp. [ 16 ]. We postulate that the SLC25 family might have massively expanded even before the last eukaryotic common ancestor and diversified into distinct transporters with different ligand specificities, subfamilies or orthologous groups. This is because (i) any genome that contains the SLC25 family is found to encode multiple SLC25 family genes, for instance, the human genome encodes 53, and even the distantly diverged apicomplexans contain about a dozen; (ii) in many reported cases [ 17 , 18 , 19 ], the corresponding mammalian ortholog can rescue the yeast SLC25 mutant phenotypes, suggesting human-to-yeast conservation; and (iii) previous phylogenetic analysis of SLC25 transporters within human and yeast genomes [ 20 , 21 ] already suggested that their protein sequences are clustered by function, specifically ligand specificity [ 21 ].

Here, to evaluate the sequence adaption for transporting ligands within the SLC25 family, we performed a phylogenetic analysis of SLC25 transporters from humans and commonly studied model organisms and annotated the tree based on the chemistry of the experimentally validated transporting ligands. We observed that for majority of the transporters, there exists a clustering of orthologous transporters across different species that transport the same ligand, suggesting an early divergence of these metabolite transporters as well as the coupling of sequence variation to ligand adaptation. They include ADP/ATP exchangers and transporters for CoA, carboxylates, glutathione, SAM and several amino acids, among others. In other cases, the sequence variation might be coupled to a conserved metabolite regulation mechanism or to other functional innovations, for instance, targeting different subcellular membranes, including outer mitochondrial membranes or peroxisomes. A lack of homology for several species-specific transporters is also noted. In summary, we explore the power of phylogenetic analysis in ligand prediction for the SLC25 transporter family, which might guide the “deorphanization” effort for some of the poorly characterized SLC25 transporters as well as provide an example for the exploration of other transporter families.

2. Materials and Methods

We chose the following five genomes, including commonly studied model organisms, to evaluate the SLC25 protein family: Saccharomyces cerevisiae (budding yeast, id: 4932), Caenorhabditis elegans (roundworm, id: 6239), Drosophila melanogaster (fruit fly, id: 7227), Danio rerio (zebrafish, id: 7955) and Homo sapiens (human, id: 9606). We reasoned that these organisms are diverse enough to represent different eukaryotic branches, and as model organisms, the sequences are more likely to be experimentally studied and thus suitable for our analysis.

SLC25 sequences from each genome were identified via a text-mining search in the UNIPROT database [ 22 ], using a combined “mito_carr” and “solcar” search into each organism via a taxonomy code (Data S1). We chose to carry out a text-mining search in UNIPROT over searching NCBI BLASTp/tBLASTn for the following reasons: first, the proteome annotations for these model organisms in UNIPROT are relatively complete, albeit with duplication and errors; and second, we found that “mito_carr” and “Solcar” domain predictions are of high accuracy. For instance, many SLC25 protein sequences from these model organisms are considered validated (“reviewed” in UNIPROT), including all 53 human SLC25s, all 35 yeast SLC25s and tens of worm, fly, and zebrafish SLC25s.

While validating the yeast sequences, we noticed that the Cmc1 protein from the S288c yeast strain ( {"type":"entrez-protein","attrs":{"text":"D6W196","term_id":"313471279","term_text":"D6W196"}} D6W196 ) was curated as a truncated form of the protein due to a frameshift in position 403 in Strain S288c. We therefore replaced this sequence with the full-length sequence of the same protein from the CG379 yeast strain ( {"type":"entrez-protein","attrs":{"text":"P0CI40","term_id":"313471266","term_text":"P0CI40"}} P0CI40 ).

Upon validating the yeast (35) and human (53) sequences, we then manually curated the SLC25 sequences in worm (49), fly (120) and zebrafish (111) using several rounds of multiple-sequence alignment (MUSCLE) [ 23 , 24 ] within each organism (Data S2). We (1) included all UNIPROT-“reviewed” sequences; (2) eliminated obvious duplicated sequences if the sequence identity <0.005; and (3) removed obvious truncated sequences (usually <200 amino acids in length). We chose 200 amino acids as a relatively loose cutoff to accommodate SLC25 proteins that might have truncated hydrophilic loops. However, all validated SLC25 protein sequences we curated are of approximately 300 amino acids or longer if they contain additional protein domains.

In summary, we included in our analysis 53 human, 35 yeast, 35 worm, 48 fly and 57 zebrafish SLC25 protein sequences. We acknowledge that our curation might contain errors for the worm, fly and zebrafish sequences, especially for paralogs within the same subfamily. First, for several subfamilies that are known to contain multiple paralogous genes, for instance, ADP/ATP exchangers and CMCs, multiple sequences might have identity values <0.01 but would either map to different chromosomal regions in tBLASTn or the mapping was of low confidence; therefore, the number of paralogs might be incorrect. Second, the zebrafish genome is known to have local genome duplication, so paralogs might be missing in our analysis. Regardless of the exact number of sequences, our analysis should sufficiently represent the complete functional diversity of the family.

We then manually trimmed all protein sequences to the regions that are predicted to encompass the three “Solcar” domains: the core transporter regions. For instance, the predicted EF-hand domains in CMCs and Aralars were removed. We then included the human MCU, another inner membrane channel, as the outgroup.

We aligned the protein sequences using MAFFT v7.505 [ 25 ] and determined the best substitution model (LG + F + G4) via ModelFinder [ 26 ]. We then used this model to construct a phylogenetic tree, using IQ-Tree v2.1.2. [ 27 ] with branch support values obtained via ultrafast bootstrapping from 10,000 replicates [ 28 ]. iTOL [ 29 ] was used to visualize and annotate the resulting tree, which we rooted using the human protein MCU as the outgroup.

To perform the phylogenetic analysis, we collected amino acid sequences via a text-mining search (using “mito_carr” and “Solcar” in the UNIPROT database) against each model organism using their taxonomy codes. These organisms included Homo sapiens (human), Saccharomyces cerevisiae (yeast), Caenorhabditis elegans (worm), Drosophila melanogaster (fly) and Danio rerio (zebrafish). We chose these model organisms because they are evolutionarily diverse and are likely to be experimentally studied and annotated. Upon manual curation (see method), we finalized the sequence entry, which included 53 human sequences, 57 zebrafish sequences, 48 fly sequences, 35 worm sequences and 35 sequences. The large numbers of SLC25 transporters in all model organisms are consistent with a putative massive gene family expansion prior to the time when these eukaryotic lineages diverged. We then performed a multiple sequence analysis, using a nonhomologous mitochondrial calcium uniporter sequence in human (MCU) as the outgroup, and constructed the phylogenetic tree with bootstrap values to highlight clade confidence (a bootstrap value > 70 denotes high confidence).

We then sought to functionally annotate the tree by color-coding major clades based on the chemistry of the transporting ligands identified for the known transporters in the clade. Specifically, the subfamily is manually annotated by ligand based on the biochemical activity of the well-characterized representative yeast or human SLC25 proteins in each clade and then expanded to the other SLC25 proteins in the clade with bootstrapping values > 90 ( Figure 1 and Figure S1 ). Indeed, the majority of the well-characterized paralogs and orthologs across all model organisms translocating the same ligand are well-clustered with extremely high levels of confidence (bootstrapping values > 90), followed by homologs translocating ligands with similar chemical structures. For instance, the well-characterized SLC25 ADP/ATP exchanger (also named AACs, ANTs, ANCs and ADTs) that include four human transporters (SLC25A4, SLC25A5, SLC25A6 and SLC25A31) and their orthologs are all clustered together and are within the nucleotide transporter clade. Because several recent reviews have already summarized the characterized SLC25 transporters [ 21 , 30 , 31 ], here, we will highlight a few recently identified transporters and transporters of poorly characterized functions or of uncertainty in the human genome, with the goal of guiding future functional characterizations.

An external file that holds a picture, illustration, etc.
Object name is biomolecules-13-01314-g001.jpg

Phylogenetic analysis of SLC25 mitochondrial transporter family members. All human (human), Danio rerio (DANRE), Caenorhabditis elegans (CAEEL), Drosophila melanogaster (DROME), and Saccharomyces cerevisiae (YEAST, or YEASX for CMC1, see the Materials and Methods) SLC25 family transporters were aligned, built into a tree, then color-annotated by the transporting ligands. Bold lines represent bootstrap values > 70.

3.1. Recently Characterized Transporters

  • Glutathione for SLC25A39 and SLC25A40

SLC25A39 was recently identified as the evolutionarily conserved, putative transporter for the major cellular antioxidant glutathione, the tripeptide in which the γ-peptide bond preserves the amino acid chemical group of the glutamate [ 32 , 33 ]. The phylogenetic tree supports SLC25A40 as the human paralog [ 32 ].

  • BCAAs for SLC25A44

The recently identified branched-chain amino acid (BCAA) transporter SLC25A44 [ 34 ] contains orthologs in animals (fly, worm and zebrafish) but no obvious orthologs in yeast.

Both the SLC25A39 and SLC25A44 clades are nested within a cluster of transporters (despite having low homology with the other groups) in which the majority translocate amino acids, supporting their putative identification.

  • NAD+ for SLC25A51 and SLC25A52

Recently identified putative human NAD+ transporters, SLC25A51 [ 35 , 36 , 37 ] and SLC25A52 [ 35 ], cluster within the same clade. This clade also contains orthologous genes in all animal model organisms but exhibits little homology with the yeast NAD transporters Ndt1/Yia6 and Ndt2/Yea6 [ 38 ]. It would be intriguing if fungi and animals independently evolved their ability to transport NAD cofactors, as well as their dependence on NAD-mediated OXPHOS to support mitochondrial bioenergetics. A homologous human transporter, SLC25A53, remains an orphan. This clade is adjacent to the majority of the nucleotide transporter clade, but a lack of high homology (bootstrap values < 70) might suggest that a homology search is not sufficient for guiding ligand characterization.

3.2. Transporters of Uncertainty

The evolutionarily conserved SLC25A32 subfamily has been proposed to transport both dinucleotide cofactor FAD and tetrahydrofolate (THF) cofactors supporting a mitochondrial one-carbon metabolism. While in vitro [ 39 ], mouse genetics [ 40 ] and cellular characterizations [ 41 , 42 , 43 ] supported its role in THF cofactor transport and one-carbon metabolism, this SLC25A32 clade’s homology to other nucleotide transporters might also support FAD transport [ 44 , 45 , 46 ].

The previously proposed putative glycine transporter SLC25A38 [ 47 , 48 ] was found to contain orthologs in zebrafish, yeast and human. This clade was recently proposed to also transport isopentenyl pyrophosphate (IPP), which is involved in CoQ biosynthesis (yeast ortholog Hem25, a fungi-specific function [ 49 ]), and to regulate other metabolite transport. The SLC25A38 clade appears to be within a majority amino-acid-transporting group.

3.3. Poorly Characterized Transporters

  • Peroxisomal SLC25A17

The only non-mitochondrial SLC25 protein, peroxisomal SLC25A17 [ 50 ], which is present in human, fruit fly, and zebrafish, does not exhibit homology of high confidence with other SLC25 transporters. Its clustering with the yeast peroxisomal nucleotide transporter Ant1 [ 51 ] cannot be validated via a bidirectional best hit (BBH) search. This lack of obvious homology might suggest an early divergence with potentially broader substrate specificity [ 50 ] and a situation in which sequence variation might not be coupled to ligand specificity but to subcellular targeting.

  • SLC25A34 and SLC25A35

The distinct SLC25A34 and SLC25A35 subfamily of completely unknown function contains human, fly, zebrafish and yeast orthologs, such as yeast Oac1 [ 52 ], but does not have any worm orthologs. The clade exhibits homology with the clade of dicarboxylate transporters SLC25A10 and SLC25A11 (bootstrap = 59), followed by homology with the U n C oupling P rotein subfamily UCP1-5 (SLC25A7, A8, A9, A14 and A30) (bootstrap = 95), in which dicarboxylate transport activities have been proposed for several UCPs [ 53 , 54 , 55 ]. Such insights might guide the characterization of these proteins.

The vertebrate-specific SLC25A43 (only in human and zebrafish) of completely unknown function sparked interest because the clade is nested within a majority nucleotide group with high confidence (bootstrap = 86). The clade’s high homology to ADP/ATP carriers (SLC25A4, A5, A6 and A31), CoA transporters (A16 and A42) and the nucleotide-importer SCAMC subfamily, followed by thiamine pyrophosphate transporters, might guide the deorphanization of SLC25A43.

Despite a close clustering of high confidence, we cannot validate the homology via BBH search between SLC25A43 and the yeast Ugo1, an outer mitochondrial membrane SLC25 protein that is important for mitochondrial fusion with unknown ligand significance [ 56 ]. We therefore suspect that this clustering might be an artifact due to long-branch attraction of unknown causes [ 57 ], and the SLC25A43 clade might be a vertebrate-specific innovation.

  • SLC25A45, SLC25A47 and SLC25A48

The homologous SLC25A45, SLC25A47 and SLC25A48 are three poorly characterized transporters that are clustered within a distinct clade within the amino acid transporters, separated (bootstrap = 83) but homologous (bootstrap = 78) with the putative basic amino acid transporter SLC25A29, followed by a homology with the carnitine transporter SLC25A20 and the ornithine transporters SLC25A15 and SLC25A2 (bootstrap = 100). Recent loss-of-function studies in mouse livers suggest that deleting the liver-specific [ 58 , 59 , 60 ] Slc25a47 leads to a reduction in respiration [ 58 ], an increase in the mitochondrial stress response [ 58 , 60 ] and an impaired lipid metabolism [ 59 ]. While the exact endogenous ligand is yet to be identified, our phylogenetic analysis does not support a nucleotide substrate using this family-wide conserved ligand recognition mechanism.

  • Outer mitochondrial membrane SLC25A46, MTCH1 and MTCH2

The three human outer mitochondrial membrane SLC25 proteins, SLC25A46, MTCH1 and MTCH2, are homologous with high levels of confidence with other animal sequences in worm, fly and zebrafish. A lack of homology with the yeast outer mitochondrial membrane protein Ugo1 might suggest two independent evolutionary innovations to utilize the inner membrane SLC25 protein scaffold for new outer membrane function. The recently proposed novel role of MTCH2 as an outer mitochondrial membrane insertase [ 61 ] is intriguing, and future studies are required to bridge this biochemical activity with the previously proposed roles of MTCH2 in membrane fusion [ 62 , 63 ] and apoptosis [ 64 ] in yeast and in animals. The significance of potential metabolite binding at the family-wide conserved ligand binding pocket is yet to be explored.

4. Discussion

Here, we performed a phylogenetic analysis of SLC25 family transporters from common model organisms to evaluate the power of homology search in guiding ligand characterization. Our analysis exhibited an extremely high degree of homology between SLC25 transporters across evolution that transport the same ligand (bootstrapping value > 90), consistent with an early functional divergence of SLC25 proteins. Consistent with previous phylogenetic analyses [ 20 , 21 ], we also observed a high degree of homology between SLC25 proteins transporting ligands with similar chemistries, e.g., nucleotides, amino acids and carboxylates (bootstrapping value usually >50), supporting the coupling of sequence variation to functional divergence. The ability of the homology search to guide ligand prediction is promising for several remaining “orphan” SLC25 transporters and is discussed case-by-base.

While several phylogenetic analyses have been performed on automatically collected SLC25 sequences across a broad range of eukaryotic taxa [ 65 , 66 , 67 ], our analysis, which focused on a few model organisms, provides unique insights to guide future functional annotation. Because the model organism proteomes are well curated, we were able to carefully exhaust all SLC25s in each species and more importantly, immediately distinguish the evolutionarily conserved SLC25s from the SLC25s that might be due to recent innovation in vertebrates or in mammals. Unsurprisingly, the biochemical activities of most evolutionarily conserved SLC25s have been characterized, and a majority of the poorly characterized transporters are only present in certain species, suggesting they are dispensable for central and conserved mitochondrial metabolism in animals—an important insight to interpret for any future research detecting their biochemical activities in the context of the human mitochondrial metabolism.

Our analysis should not be affected by the potentially incorrect number of SLC25 family proteins in each species because the error should only occur for species-specific paralogs for which we cannot distinguish whether certain transporter sequences have inaccurate annotations or are taxa-specific gene duplications. A more complete analysis might benefit from the inclusion of other model organisms, including the plant Arabidopsis thaliana , house mouse, Mus musculus , and western clawed frog, Xenopus tropicalis (often used for the study of ion channels). For recently innovated SLC25 proteins, a phylogenetic analysis focusing on sequenced vertebrate [ 68 ] or mammalian genomes [ 69 ] might inform the understanding of their adaptative evolution in human physiology.

Several limitations to utilizing homology search for ligand characterization are noted. First, singleton transporters with little homology with other transporters provide little guidance toward their characterization. Examples of this include the peroxisome SLC25A17 clade (discussed above) and many species-specific transporters, for instance, the fruit fly MME1 ( {"type":"entrez-protein","attrs":{"text":"Q9VM51","term_id":"74948205","term_text":"Q9VM51"}} Q9VM51 , annotated as “Mitochondrial Magnesium Exporter 1”) [ 70 ] and the COLT of unknown function ( {"type":"entrez-protein","attrs":{"text":"Q9VQG4","term_id":"45476989","term_text":"Q9VQG4"}} Q9VQG4 , annotated as “Congested-like trachea protein”) [ 71 ], both of which are within the amino acid transporter clade. Second, we highlighted that homology-based sequence variation cannot distinguish between adaptation to new ligands and adaptation to different subcellular organelle targeting or to other moonlighting functions. For instance, the homology among the U n C oupling P roteins (UCPs) might potentially suggest an adapted metabolite regulation mechanism instead of an adaptation to transported ligands—for example, the regulation of UCP1 by purine nucleotides—demonstrated via structure [ 72 , 73 ] and molecular simulation [ 74 ].

A phylogenetic analysis could potentially guide experimental ligand characterization efforts for certain SLC25 transporters as an orthologous strategy to in vitro proteoliposome-based transport assays, cell-based assays and in vivo studies. Other molecular evolution approaches could also provide exciting insights [ 66 ]. A comprehensive evaluation of how a homology search and phylogenetic analysis could guide deorphanization efforts probably requires a “dream world” scenario in which experimentally validated functional annotation is available for every C luster of O rthologous G roup (COG) in bacteria and in eukaryotes [ 75 ].

Acknowledgments

We thank Xiaojian Shi and other members of the Shen lab for the fruitful discussion and feedback.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/biom13091314/s1 , Figure S1: Phylogenetic analysis of the SLC25 mitochondrial transporter family members with bootstrap values; Data S1: SLC25 transporter protein sequence file; Data S2: SLC25 transporter protein alignment file.

Funding Statement

This research was funded by the Yale School Medicine and Yale West Campus startup (H.S.) and NIH grant R35GM150619 (H.S.).

Author Contributions

Conceptualization, H.S. and K.L.B.; analysis and investigation, K.L.B., R.V.S. and H.S.; writing—original draft preparation, K.L.B. and H.S.; editing, K.L.B., R.V.S. and H.S.; supervision, H.S. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Data availability statement, conflicts of interest.

The authors declare no conflict of interest.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

  • Methodology article
  • Open access
  • Published: 20 March 2007

A practical approach to phylogenomics: the phylogeny of ray-finned fish (Actinopterygii) as a case study

  • Chenhong Li 1 ,
  • Guillermo Ortí 1 ,
  • Gong Zhang 2 &
  • Guoqing Lu 3  

BMC Evolutionary Biology volume  7 , Article number:  44 ( 2007 ) Cite this article

297 Citations

3 Altmetric

Metrics details

Molecular systematics occupies one of the central stages in biology in the genomic era, ushered in by unprecedented progress in DNA technology. The inference of organismal phylogeny is now based on many independent genetic loci, a widely accepted approach to assemble the tree of life. Surprisingly, this approach is hindered by lack of appropriate nuclear gene markers for many taxonomic groups especially at high taxonomic level, partially due to the lack of tools for efficiently developing new phylogenetic makers. We report here a genome-comparison strategy to identifying nuclear gene markers for phylogenetic inference and apply it to the ray-finned fishes – the largest vertebrate clade in need of phylogenetic resolution.

A total of 154 candidate molecular markers – relatively well conserved, putatively single-copy gene fragments with long, uninterrupted exons – were obtained by comparing whole genome sequences of two model organisms, Danio rerio and Takifugu rubripes . Experimental tests of 15 of these (randomly picked) markers on 36 taxa (representing two-thirds of the ray-finned fish orders) demonstrate the feasibility of amplifying by PCR and directly sequencing most of these candidates from whole genomic DNA in a vast diversity of fish species. Preliminary phylogenetic analyses of sequence data obtained for 14 taxa and 10 markers (total of 7,872 bp for each species) are encouraging, suggesting that the markers obtained will make significant contributions to future fish phylogenetic studies.

We present a practical approach that systematically compares whole genome sequences to identify single-copy nuclear gene markers for inferring phylogeny. Our method is an improvement over traditional approaches (e.g., manually picking genes for testing) because it uses genomic information and automates the process to identify large numbers of candidate makers. This approach is shown here to be successful for fishes, but also could be applied to other groups of organisms for which two or more complete genome sequences exist, which has important implications for assembling the tree of life.

The ultimate goal of obtaining a well-supported and accurate representation of the tree of life relies on the assembly of phylogenomic data sets for large numbers of taxa [ 1 ]. Molecular phylogenies based on DNA sequences of a single locus or a few loci often suffer from low resolution and marginal statistical support due to limited character sampling. Individual gene genealogies also may differ from each other and from the organismal phylogeny (the "gene-tree vs. species-tree" issue) [ 2 , 3 ], in many cases due to systematic biases (i.e., compositional bias, long-branch attraction, heterotachy), leading to statistical inconsistency in phylogenetic reconstruction [ 4 – 7 ]. Phylogenomic data sets – using genome sequences to study evolutionary relationship – provide the best solution to these problems [ 1 , 8 ]. This approach requires compilation of large data sets that include many independent nuclear loci for many species [ 9 – 14 ]. Such data sets are less likely to succumb to sampling and systematic errors [ 13 ] by offering the possibility of analyzing large numbers of phylogenetically informative characters from different genomic locations, and also of corroborating phylogenetic results by varying the species sampled. If any systematic bias may be present in a fraction of individual loci sampled, it is unlikely that all affected loci will be biased in the same direction. Powerful analytical approaches that accommodate model heterogeneity among data partitions are becoming available to efficiently analyze such complex phylogenomic data sets [ 15 , 16 ].

Constructing phylogenomic data sets for large number of taxa still is, however, quite challenging. Most attempts to use this approach have been based either on few available complete genomic sequence data [ 13 , 17 , 18 ], or cDNA and ESTs sequences [ 9 , 12 , 18 , 19 ] for relatively few taxa. Availability of complete genomes limits the number of taxa that can be analyzed [ 13 , 17 ], imposing known problems for phylogenetic inference associated with poor taxon sampling [ 20 , 21 ]. On the other hand, methods based on ESTs or cDNA sequence data are not practical for many taxa because they require construction of cDNA libraries and fresh tissue samples. In addition, some genes may not be expressed in certain tissues or developmental stages, leading to cases with undesirable amounts of missing data [ 9 ]. The most efficient way to collect nuclear gene sequences for many taxa is to directly amplify target sequences using "universal" PCR primers, an approach so far used for just a few widely-used nuclear genes [ 22 – 25 ], or selected taxonomic groups (e.g., placental mammals and land plants). Widespread use of this strategy in most taxonomic groups has been hindered by the paucity of available PCR-targeted gene markers.

Mining genomic data to obtain candidate phylogenetic markers requires stringent criteria, since not all loci are likely to carry the appropriate historical signal. The phylogenetic informativeness of characters has been extensively debated on theoretical grounds [ 26 , 27 ], as well as in empirical cases [ 28 – 30 ]. Our study does not intend to contribute to this debate, but rather to focus on the practical issues involved in obtaining the raw data for analysis. What is the best strategy to select a few hundreds candidate loci from thousands of genes present in the genome? For practical purposes, a good phylogenetic nuclear gene marker must satisfy three criteria. First, orthologous genes should be easy to identify and amplify in all taxa of interest. One of the main problems associated with nuclear protein-coding genes used to infer phylogeny is uncertainty about their orthology [ 3 ]. This is especially true when multiple copies of a target gene are amplified by PCR from whole genomic DNA. To minimize the chance of sampling paralogous genes among taxa (the trap of "mistaken paralogy" that will lead to gene-tree-species-tree discordance), our approach is initiated by searches for single-copy nuclear genes in genomic databases. Under this criterion, even if gene duplication events may have occurred during evolution of the taxa of interest (e.g., the fish-specific whole-genome duplication event) [ 31 , 32 ], duplicated copies of a single-copy nuclear gene tend to be lost quickly, possibly due to dosage compensation [ 33 ]. Some authors estimate that almost 80% of the paralogs have been secondarily lost following the genome-duplication event [ 34 , 35 ]. Thus, if duplicated copies are lost before the relevant speciation events occur (Figure 1a, b ), no paralogous gene copies would be sampled. If the alternative situation occurs (Figure 1c ), paralogy will mislead phylogenetic inference resulting in topological discordance among genes. In the latter case, the topological distribution of this discordance may be used to reconstruct putative duplication/extinction events and clarify the putative mistaken paralogy [ 36 ]. The second criterion used to facilitate efficient data collection is to identify protein-coding genes with long exons (longer than a practical threshold determined by current DNA sequencing technology, for example 800 bp). Most genes are fragmented into small exons and large introns. For high taxonomic-level phylogenetic inference (deep phylogeny), intron sequences evolve too fast and are usually not informative, becoming an obstacle for the amplification and sequencing of more informative exon-coding sequences. The third criterion used seeks to identify reasonably conserved genes. Genes with low rates of evolution are less prone to accumulate homoplasy, and also provide the practical advantage of facilitating the design of universal primers for PCR that will work on a diversity of taxa. Furthermore, conserved protein-coding genes also are easy to align for analysis, based on their amino acid sequence.

figure 1

Single-copy genes are useful markers for phylogeny inference . Gene duplication and subsequent loss may not cause incongruence between gene tree and species tree if gene loss occurs before the first speciation event (a), or before the second speciation event (b). The only case that would cause incongruence is when the gene survived both speciation events and is asymmetrically lost in taxon 2 and taxon 3 (c).

Sequence conservatism and long exonic regions have been used as preferred criteria to select phylogenetic markers in the past [ 37 ]. However, finding many preferred, easy-to-apply gene markers is unlikely when candidate genes are manually screened from data bases or taken from isolated studies of few individual genes. This complexity partially explains the scarcity of currently available nuclear gene markers in many taxonomic groups. To address the problem, we present a simple bioinformatic approach to obtain nuclear gene markers from complete genomic data, based on the three aforementioned criteria. Our method incorporates two improvements over the traditional way of manually picking genes and testing their phylogenetic utilities. These improvements include using full genomic information and automating the process of searching for candidate makers. We apply the method to Actinoptertygii (ray-finned fish), the largest vertebrate clade – they make up about half of all known vertebrate species – with a poorly known phylogeny [ 38 – 42 ]. We also present experimental tests to show that PCR primers designed for a subset of the candidate markers can efficiently amplify these markers for a highly diverse sample of ray-finned fishes. Comparative analyses of the sequences obtained show encouraging phylogenetic properties for future studies.

The bioinformatic pipeline used is shown in Figure 2 . Within-genome sequence comparisons resulted in 2,797 putative single-copy exons (> 800 bp) in zebrafish ( D. rerio ), and 1,822 in torafugu ( T. rubripes ), 2132 in stickleback ( G. aculeatus ), and 1809 in Japanese rice fish ( O. latipes ). Note that our operational definition of a "single-copy" gene only requires that the fragment is not present as a second copy in the genome with similarity higher than 50%. Some single-copy genes may, in fact, have duplicates in the genome that are less than 50% similar. Pairwise between-genome comparisons of the single-copy exon sequences resulted in a range of 113 to 281 putative orthologs shared among genomes, that have similarity greater than 70%. The lowest number of "conserved orthologs" was detected between zebrafish and rice fish, and the highest between torafugu and stickleback. The number of putative conserved orthologs shared among three or more genomes varied from case to case; for example, it peaked at 155 when comparing torafugu, Japanese rice fish, and stickleback, but only 61 for the comparison involving torafugu, Japanese ricefish, and zebrafish. All the information resulting from these analyses is publicly available in our website [ 43 ], and a sample output of candidate markers is shown in Additional file 1 .

figure 2

The bioinformatic pipeline for phylogenetic markers development . It involves within- and across-genome sequences comparison, in silico test with sequences in other species, and experimental validation. Numbers of genes and exons identified for D. rerio are indicated by the asterisk. Exon length (L), within-genome similarity (S), between-genome similarity (Sx), and coverage (C) are adjustable parameters (see methods).

To investigate the properties of candidate markers, we analyzed those found in the zebrafish and torafugu comparison, since their genome sequences are well annotated. Among them, 154 putative homologs were identified between zebrafish and torafugu by cross-genome comparison. Further comparison with EST sequences from other fish species reduced this number to 138 candidate markers (Supplementary Table 1). The 154 candidate markers shared between these two genomes according to our search criteria are distributed among 24 of the 25 chromosomes of zebrafish, and a Chi-square test did not reject a Poisson distribution of markers among chromosomes (χ 2 = 16.99, df = 10, p = 0.0746). The size of candidate markers ranged from 802 to 5811 bp (in D. rerio ). Their GC content ranged from 41.6% to 63.9% (in D. rerio ), and the average similarity of the DNA sequence of these markers between D. rerio and T. rubripes varied from 77.3% to 93.2% (constrained by the search criteria).

To test the practical value of potential phylogenetic markers, 15 gene fragments were randomly picked from the candidate list of 154 and tested experimentally on 36 taxa, chosen to represent two-thirds of all ray-finned fish orders (see Additional file 2 ). PCR primers were designed on conserved flanking regions for each fragment, based on the genomic sequences and tested on all taxa (Table 1 ). Ten out of the 15 markers examined successfully amplified a single product of the predicted size by a nested PCR approach in 31 taxa. For comparative sequence analyses, we took only 14 taxa ( Amia calva , D. rerio , Semotilus atromaculatus , Ictalurus punctatus , Oncorhynchus mykiss , Brotula multibarbata , Fundulus heteroclitus , Oryzias latipes , Oreochromis niloticus , Gasterosteus aculeatus , Lycodes atlanticus , T. rubripes , Morone chrysops , Lutjanus mahogoni ) that could be amplified and sequenced directly for the set of 10 markers [GenBank: EF032909 – EF033038]. The size of the sequenced fragments ranged from 666 to 987 bp, and the average uncorrected genetic distances for DNA sequence of the 10 markers among the 14 taxa ranged from 13% to 21%. We present (Table 2 ) additional characteristics of the data set such as the substitution rate, consistency index (CI), gamma shape parameter (α), relative composition variability (RCV), and treeness [ 44 ] resulting from phylogenetic analysis of the sequences of the 10 new markers. Values obtained are similar to those observed in a commonly used phylogenetic marker – recombination activating gene 1 (RAG-1, Table 2 ). For the newly characterized phylogenetic markers, the substitution rate is negatively correlated with CI (r = -0.84, P = 0.0026) and marginally correlated with α (r = -0.56, P = 0.095). In contrast, base composition heterogeneity (RCV) and the phylogenetic signal to noise index (treeness index) are not correlated with substitution rate. Based on the treeness value, genes ENC1, plagl2, Ptr, Gylt and tbr1 seem well suited for phylogenetic studies at high taxonomic level among ray-finned fishes.

A phylogeny of the 14 taxa using concatenated sequences of all 10 markers (total of 7,872 bp) was inferred on the basis of protein and DNA sequences. For the protein sequence data, a JTT model with gamma parameter accounting for rate heterogeneity was selected by ProtTest [ 45 ]. The data were partitioned by gene, as this strategy was favoured by the Akaike information criterion (AIC) over treating the concatenated sequences as a single partition. Maximum likelihood (ML) and Bayesian analysis (BA) resulted in the same tree (Figure 3a ). A similar topology to Figure 3a was obtained by ML analysis of nucleotide sequences with RY-coded nucleotides to address potential artefacts due to base compositional bias [ 44 ]. The positions of Brotula and Morone remain somewhat unresolved, receiving low bootstrap support and conflicting resolution based on either protein or RY-coded nucleotide data. When analyzed separately, all individual gene trees display low support in many branches and none of them has the same topology as the "total evidence" tree based on all 10 genes (see Additional file 3 ). However, only 6 individual genes exhibit significant differences with the total evidence tree (based on one tailed SH tests with p < 0.05), the exceptions being myh6 (p = 0.113), Gylt (p = 0.091), plagl2 (p = 0.056)), and sreb2 (p = 0.080).

figure 3

A comparison of the maximum likelihood phylogram inferred in this study with the conventional phylogeny . (a) Left panel – the phylogram of 14 taxa inferred from protein sequences of 10 genes; (b) right panel – a "consensus" phylogeny following Nelson [50]. The numbers on the branches are Bayesian posterior probability, ML bootstrap values estimated from protein sequences and ML bootstrap values estimated from RY-coded nucleotide sequence. Asterisks indicate bootstrap supports less than 50.

The bioinformatic approach implemented in this study resulted in a large set (154 loci for the zebrafish and torafugu comparison) of candidate genes to infer high-level phylogeny of ray-finned fishes. The actual number of candidate loci depended on the genomes being compared and the fixed search parameters. Experimental tests of a smaller subset (15 loci) demonstrate that a large fraction (2/3) of these candidates are easily amplified by PCR from whole genomic DNA extractions in a vast diversity of fish taxa. The assumption that these loci are represented by a single copy in the fish genomes could not be rejected by the PCR assays in the species tested (all amplifications resulted in a single product), increasing the likelihood that the genetic markers are orthologous and suitable to infer organismal phylogeny. Our method is based on searching, under specific criteria, the available complete genomic databases of organisms closely related to the taxa of interest. Therefore, the same approach that is shown to be successful for fishes could be applied to other groups of organisms for which two or more complete genome sequences exist. Parameter values (L, S, and C) used for the search (Figure 2 ) may be altered to obtain fragments of different size or with different levels of conservation (i.e., less conserved for phylogenies of more closely related organisms).

An alternative way to develop nuclear gene markers for phylogenetic studies is to construct a cDNA library or sequence several ESTs for a small pilot group of taxa, and then to design specific PCR primers to amplify the orthologous gene copy in all the other taxa of interest [ 19 , 46 ]. The major potential problem with this approach stems from the fact that the method starts with a cDNA library or a set of EST sequences, with no prior knowledge of how many copies a gene has in each genome. As discussed above, this condition may lead to mistaken paralogy. In our approach, we search the genomic database to find single-copy candidates so no duplicate gene copies, if present, would be missed (see below).

Recent studies have proposed whole genome duplication events during vertebrate evolution and also genome duplications restricted to ray-finned fishes [ 31 , 32 , 47 , 48 ]. Our results indicate that many single-copy genes still exist in a wide diversity of fish taxa (representing 28 orders of actinopterygian fishes), in agreement with previous estimates that a vast majority of duplicated genes are secondarily lost [ 34 , 35 ]. All 154 candidates were identified as single-copy genes in D. rerio and T. rubripes , according to our search criteria. Our results also show the 154 candidate genes are randomly distributed in the fish genome (at least among chromosomes of D. rerio ). In the experimental tests, 10 out of 15 markers were found in single-copy condition in all successful amplifications, including the tetraploid species, O. mykiss . However, relaxing the search criteria, and conserving targets less than 50% similar in a subsequent blast search against the zebrafish genome, 7 of the 10 genes were found to have "alignable paralogs" (the 3 exceptions were myh6, tbr1, and Gylt). Genomes of medaka, stickleback, and fugu were also checked for these 3 genes, and no "paralogs" were detected, suggesting the sequences of ray-finned fish collected for these 3 genes are unambiguously orthologous to each other. Phylogenetic analyses for each of the 7 genes that include the putative paralogs found by this procedure produced tree topologies that strongly suggest an ancient duplication event in the vertebrate lineage, before the divergence of tetrapods from ray-finned fishes. Paralogous sequences are placed at the base of the tetrapod-actinopteryigian divergence, or as part of a basal polytomy with the other tetrapod and ray-finned fish sequences. In the terminology proposed by Remm et al. [ 49 ] these would be considered out-paralogs. In no case are these sequences nested among ingroup actinopterygian sequences (see Additional file 4 ), as would be the case expected for in-paralogs [ 49 ]. Stringent search critera implemented in our approach followed by phylogenetic analysis can distinguish between orthologs and putative our-paralogs. Although the method will not guarantee that single copy genes amplified by PCR in several taxa are orthologs as opposed to in-paralogs, the existence and identification of genome-scale single-copy nuclear markers should facilitate the construction of the tree of life, even if the evolutionary mechanism responsible for maintaining single-copy genes is poorly known [ 33 ].

The molecular evolutionary profiles of the 10 newly developed markers are in the same range as RAG-1, a widely-used gene marker in vertebrates. The genes with high treeness values have intermediate substitution rate, suggesting that optimal rate and base composition stationarity are important factors that determine the suitability of a phylogenetic marker. The phylogeny based on individual markers revealed incongruent phylogenetic signal among 6 of the 10 individual genes. This incongruence suggests that significant biases in the data might obscure the true phylogenetic signal in some individual genes, but the direction of the bias is hardly shared among genes (Additional file 3 ), justifying the use of genome-scale gene makers to infer organismal phylogeny.

Finally, with respect to the phylogenetic results per se , there are two significant areas of discrepancy between the phylogeny obtained in this study (Figure 3a ) and a consensus view of fish phylogeny (Figure 3b ) [ 50 ]. Although these differences could be due to poor taxonomic sampling, we discuss them briefly. First, the traditional tree groups cichlids with other perciforms, whereas our results showed the cichlid O. niloticus is more closely related to atherinomorphs (Cyprinodontiformes + Beloniformes) than to other perciforms. This result also was supported by two recent studies analysing multiple nuclear genes [ 17 , 51 ]. The second difference is that the traditional tree groups Lycodes with other perciforms, while Lycodes was found closely related to Gasterosteus (Gasterosteiformes) in our results. Interestingly, the sister-taxa relationship between Lycodes and Gasterosteus also is supported by recent studies using mitochondrial genome data [ 38 , 52 ]. The difference between our "total evidence" tree and the classical hypothesis is significant based on the new data, as indicated by a one-tailed Shimodaira-Hasegawa (SH) test (p = 0.000) [ 53 ].

We developed a genome-based approach to identify nuclear gene markers for phylogeny inference that are single-copy, contain large exons, and are conserved across extensive taxonomic distances. We show that our approach has practical value through direct experimentation on a representative sample of ray-finned fish, the largest vertebrate clade in need of phylogenetic resolution. The same approach, however, could be applied to other groups of organisms as long as two or more complete genome sequences are available. This research may have important implications for assembling the tree of life.

Genome-scale mining for phylogenetic markers

Whole genomic sequences of Danio rerio and Takifugu rubripes were retrieved from the ENSEMBL database [ 54 ]. Exon sequences with length > 800 bp were then extracted from the genome databases. The exons extracted were compared in two steps: (1) within-genome sequence comparisons and (2) between genome comparisons. The first step is designed to generate a set of single-copy nuclear gene exons (length > 800 bp) within each genome, whereas the second step should identify single-copy, putatively orthologous exons between D. rerio and T. rubripes (Figure 2 ). The BLAST algorithm was used for sequence similarity comparison. In addition to the parameters available in the BLAST program, we applied another parameter, coverage (C), to identify global sequence similarity between exons. The coverage was defined as the ratio of total length of locally aligned sequences over the length of query sequence. The similarity (S) was set to S < 50% for within-genome comparison, which means that only genes that have no counterpart more than 50% similar to themselves were kept. The similarity was set to S × > 70% and the coverage was set to C > 30% in cross-genome comparison, which selected genes that are 70% similar and 30% aligned between D. rerio and T. rubripes . Subsequent comparisons were performed on the newly available genome of stickleback ( Gasterosteus aculeatus ) and Japanese rice fish ( Oryzias latipes ), as described above. We programmed this procedure using PERL programming language to automate the processes and made the source code publicly available on our website [ 43 ]. We are in progress to make it available for other genomic sequences and parameter values.

Experimental testing for candidate markers

PCR and sequencing primers were designed on aligned sequences of D. rerio and T. rubripes for 15 random selected genes. Primer3 was used to design the primers [ 55 ]. Degenerate primers and a nested-PCR design were used to assure the amplification for each gene in most of the taxa. Ten of the 15 genes tested were amplified with single fragment in most of the 36 taxa examined. PCR primers for 10 gene markers are listed in Table 1 . The amplified fragments were directly sequenced, without cloning, using the BigDye system (Applied Biosystems). Sequences of the frequently used RAG1 gene were retrieved for the same taxa from GenBank for comparison to the newly developed markers [GenBank: AY430199, NM_131389, U15663, AB120889, DQ492511, AY308767, AF108420, EF033039 – EF033043]. When RAG1 sequences for the same taxa were not available, a taxon of the same family was used, i.e. Nimbochromis was used instead of Oreochromis and Neobythites was used instead of Brotula .

Phylogenetic analysis

Sequences of the 10 new markers in the 14 taxa were used in phylogenetic analysis to assess their performance. Sequences were aligned using ClustalX [ 56 ] on the translated protein sequences. Uncorrected genetic distances were calculated using PAUP [ 57 ]. Relative substitution rate for each markers were estimated using a Bayesian approach [ 58 ]. Relative composition variability (RCV) and treeness were calculated following Phillips and Penny [ 44 ]. Prottest [ 45 ] was used to chose the best model for protein sequence data and the AIC criteria to determine the scheme of data partitioning. Bayesian analysis implemented in MrBayes v3.1.1 and maximum likelihood analysis implemented in TreeFinder [ 59 ] were performed on the protein sequences. One million generation with 4 chains were run for Bayesian analysis and the trees sampled prior to reaching convergence were discarded (as burnin) before computing the consensus tree and posterior probabilities. Two independent runs were used to provide additional confirmation of convergence of posterior probability distribution. Given the biased base composition in the nucleotide data indicated by the RCV value (Table 2 ), we analyzed the nucleotide data under the RY-coding scheme (C and T = Y, A and G = R), partitioned by gene in TreeFinder, since RY-coded data are less sensitive to base compositional bias [ 44 ]. Alternative hypotheses were tested by one-tailed Shimodaira and Hasegawa (SH) test [ 53 ] with 1000 RELL bootstrap replicates implemented in TreeFinder.

Delsuc F, Brinkmann H, Philippe H: Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet. 2005, 6 (5): 361-375. 10.1038/nrg1603.

Article   CAS   PubMed   Google Scholar  

Pamilo P, Nei M: Relationships Between Gene Trees and Species Trees. Mol Biol Evol. 1988, 5 (5): 568-583.

CAS   PubMed   Google Scholar  

Fitch WM: Distinguishing homologous from analogous proteins. Syst Zool. 1970, 19 (2): 99-113. 10.2307/2412448.

Lopez P, Casane D, Philippe H: Heterotachy, an important process of protein evolution. Mol Biol Evol. 2002, 19 (1): 1-7.

Felsenstein J: Case in which parsimony or compatibility methods will be positively misleading. Syst Biol. 1978, 27: 401-410.

Article   Google Scholar  

Weisburg WG, Giovannoni SJ, Woese CR: The Deinococcus-Thermus phylum and the effect of rRNA composition on phylogenetic tree construction. Syst Appl Microbiol. 1989, 11: 128-134.

Foster PG, Hickey DA: Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions. J Mol Evol. 1999, 48 (3): 284-290. 10.1007/PL00006471.

Eisen JA, Fraser CM: Phylogenomics: intersection of evolution and genomics. Science. 2003, 300 (5626): 1706-1707. 10.1126/science.1086292.

Philippe H, Snell EA, Bapteste E, Lopez P, Holland PW, Casane D: Phylogenomics of eukaryotes: impact of missing data on large alignments. Mol Biol Evol. 2004, 21 (9): 1740-1752. 10.1093/molbev/msh182.

Driskell AC, Ane C, Burleigh JG, McMahon MM, O'Meara B C, Sanderson MJ: Prospects for building the tree of life from large sequence databases. Science. 2004, 306 (5699): 1172-1174. 10.1126/science.1102036.

Takezaki N, Figueroa F, Zaleska-Rutczynska Z, Klein J: Molecular phylogeny of early vertebrates: monophyly of the agnathans as revealed by sequences of 35 genes. Mol Biol Evol. 2003, 20 (2): 287-292. 10.1093/molbev/msg040.

Bapteste E, Brinkmann H, Lee JA, Moore DV, Sensen CW, Gordon P, Durufle L, Gaasterland T, Lopez P, Muller M, Philippe H: The analysis of 100 genes supports the grouping of three highly divergent amoebae: Dictyostelium, Entamoeba, and Mastigamoeba. Proc Natl Acad Sci U S A. 2002, 99 (3): 1414-1419. 10.1073/pnas.032662799.

Article   PubMed Central   CAS   PubMed   Google Scholar  

Rokas A, Williams BL, King N, Carroll SB: Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature. 2003, 425 (6960): 798-804. 10.1038/nature02053.

Murphy WJ, Eizirik E, Johnson WE, Zhang YP, Ryder OA, O'Brien SJ: Molecular phylogenetics and the origins of placental mammals. Nature. 2001, 409 (6820): 614-618. 10.1038/35054550.

Brandley MC, Schmitz A, Reeder TW: Partitioned Bayesian analyses, partition choice, and the phylogenetic relationships of scincid lizards. Syst Biol. 2005, 54 (3): 373-390. 10.1080/10635150590946808.

Article   PubMed   Google Scholar  

Castoe TA, Doan TM, Parkinson CL: Data partitions and complex models in Bayesian analysis: the phylogeny of Gymnophthalmid lizards. Syst Biol. 2004, 53 (3): 448-469. 10.1080/10635150490445797.

Chen WJ, Ortí G, Meyer A: Novel evolutionary relationship among four fish model systems. Trends Genet. 2004, 20 (9): 424-431. 10.1016/j.tig.2004.07.005.

Rokas A, Kruger D, Carroll SB: Animal evolution and the molecular signature of radiations compressed in time. Science. 2005, 310 (5756): 1933-1938. 10.1126/science.1116759.

Whittall JB, Medina-Marino A, Zimmer EA, Hodges SA: Generating single-copy nuclear gene data for a recent adaptive radiation. Mol Phylogenet Evol. 2006, 39 (1): 124-134. 10.1016/j.ympev.2005.10.010.

Hillis DM, Pollock DD, McGuire JA, Zwickl DJ: Is sparse taxon sampling a problem for phylogenetic inference?. Syst Biol. 2003, 52 (1): 124-126. 10.1080/10635150390132911.

Article   PubMed Central   PubMed   Google Scholar  

Soltis DE, Albert VA, Savolainen V, Hilu K, Qiu YL, Chase MW, Farris JS, Stefanovic S, Rice DW, Palmer JD, Soltis PS: Genome-scale data, angiosperm relationships, and "ending incongruence": a cautionary tale in phylogenetics. Trends Plant Sci. 2004, 9 (10): 477-483. 10.1016/j.tplants.2004.08.008.

Lovejoy NR, Collette BB: Phylogenetic relaionships of new world needlefishes (Teleostei: Belonidae) and the biogeography of transitions between marine and freshwater habitats. Copeia. 2001, 2001 (2): 324-338. 10.1643/0045-8511(2001)001[0324:PRONWN]2.0.CO;2.

Saint KM, Austin CC, Donnellan SC, Hutchinson MN: C-mos, a nuclear marker useful for squamate phylogenetic analysis. Mol Phylogenet Evol. 1998, 10 (2): 259-263. 10.1006/mpev.1998.0515.

Mohammad-Ali K, Eladari ME, Galibert F: Gorilla and orangutan c-myc nucleotide sequences: inference on hominoid phylogeny. J Mol Evol. 1995, 41 (3): 262-276. 10.1007/BF01215173.

Groth JG, Barrowclough GF: Basal divergences in birds and the phylogenetic utility of the nuclear RAG-1 gene. Mol Phylogenet Evol. 1999, 12 (2): 115-123. 10.1006/mpev.1998.0603.

Lyons-Weiler J, Hoelzer GA, Tausch RJ: Relative apparent synapomorphy analysis (RASA). I: The statistical measurement of phylogenetic signal. Mol Biol Evol. 1996, 13 (6): 749-757.

Philippe H, Zhou Y, Brinkmann H, Rodrigue N, Delsuc F: Heterotachy and long-branch attraction in phylogenetics. BMC Evol Biol. 2005, 5: 50-10.1186/1471-2148-5-50.

Collins TM, Fedrigo O, Naylor GJ: Choosing the best genes for the job: the case for stationary genes in genome-scale phylogenetics. Syst Biol. 2005, 54 (3): 493-500. 10.1080/10635150590947339.

Steel MA, Lockhart PJ, Penny D: Confidence in evolutionary trees from biological sequence data. Nature. 1993, 364 (6436): 440-442. 10.1038/364440a0.

Phillips MJ, Delsuc F, Penny D: Genome-scale phylogeny and the detection of systematic biases. Mol Biol Evol. 2004, 21 (7): 1455-1458. 10.1093/molbev/msh137.

Amores A, Force A, Yan YL, Joly L, Amemiya C, Fritz A, Ho RK, Langeland J, Prince V, Wang YL, Westerfield M, Ekker M, Postlethwait JH: Zebrafish hox clusters and vertebrate genome evolution. Science. 1998, 282 (5394): 1711-1714. 10.1126/science.282.5394.1711.

Meyer A, Van de Peer Y: From 2R to 3R: evidence for a fish-specific genome duplication (FSGD). Bioessays. 2005, 27 (9): 937-945. 10.1002/bies.20293.

Ciccarelli FD, von Mering C, Suyama M, Harrington ED, Izaurralde E, Bork P: Complex genomic rearrangements lead to novel primate gene function. Genome Res. 2005, 15 (3): 343-351. 10.1101/gr.3266405.

Woods IG, Wilson C, Friedlander B, Chang P, Reyes DK, Nix R, Kelly PD, Chu F, Postlethwait JH, Talbot WS: The zebrafish gene map defines ancestral vertebrate chromosomes. Genome Res. 2005, 15 (9): 1307-1314. 10.1101/gr.4134305.

Jaillon O, Aury JM, Brunet F, Petit JL, Stange-Thomann N, Mauceli E, Bouneau L, Fischer C, Ozouf-Costaz C, Bernot A, Nicaud S, Jaffe D, Fisher S, Lutfalla G, Dossat C, Segurens B, Dasilva C, Salanoubat M, Levy M, Boudet N, Castellano S, Anthouard V, Jubin C, Castelli V, Katinka M, Vacherie B, Biemont C, Skalli Z, Cattolico L, Poulain J, De Berardinis V, Cruaud C, Duprat S, Brottier P, Coutanceau JP, Gouzy J, Parra G, Lardier G, Chapple C, McKernan KJ, McEwan P, Bosak S, Kellis M, Volff JN, Guigo R, Zody MC, Mesirov J, Lindblad-Toh K, Birren B, Nusbaum C, Kahn D, Robinson-Rechavi M, Laudet V, Schachter V, Quetier F, Saurin W, Scarpelli C, Wincker P, Lander ES, Weissenbach J, Roest Crollius H: Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature. 2004, 431 (7011): 946-957. 10.1038/nature03025.

Page RD, Cotton JA: Vertebrate phylogenomics: reconciled trees and gene duplications. Pac Symp Biocomput. 2002, 536-547.

Google Scholar  

Friedlander TP, Regier JC, Mitter C: Nuclear gene sequences for higher level phylogenetic analysis: 14 promising candidates. Syst Biol. 1992, 41 (4): 483-490. 10.2307/2992589.

Miya M, Takeshima H, Endo H, Ishiguro NB, Inoue JG, Mukai T, Satoh TP, Yamaguchi M, Kawaguchi A, Mabuchi K, Shirai SM, Nishida M: Major patterns of higher teleostean phylogenies: a new perspective based on 100 complete mitochondrial DNA sequences. Mol Phylogenet Evol. 2003, 26 (1): 121-138. 10.1016/S1055-7903(02)00332-9.

Stiassny MLJ, Wiley EO, Johnson GD, de Carvalho MR: Gnathostome fishes. Assembling The Tree of Life. Edited by: Cracraft J, Donoghue MJ. 2004, New York , Oxford University Press, 410-429.

Stiassny MLJ, Parenti LR, Johnson GD: Interrelationships of fishes. 1996, San Diego , Academic Press, xiii, 496 p.-

Arratia G: Phylogenetic relationships of teleostei: past and present. Estud Oceanol. 2000, 19: 19-51.

Greenwood PH, Miles RS, Patterson C: Interrelationships of fishes. 1973, London , Academic Press, 536 p.-

Phylomarker - mining phylogenetic markers for assembling the Tree of Life [http://bioinfo-srv1.awh.unomaha.edu/phylomarker].

Phillips MJ, Penny D: The root of the mammalian tree inferred from whole mitochondrial genomes. Mol Phylogenet Evol. 2003, 28 (2): 171-185. 10.1016/S1055-7903(03)00057-5.

Abascal F, Zardoya R, Posada D: ProtTest: selection of best-fit models of protein evolution. Bioinformatics. 2005, 21 (9): 2104-2105. 10.1093/bioinformatics/bti263.

Small RL, Cronn RC, Wendel JF: L. A. S. Johnson Review No. 2. Use of nuclear genes for phylogeny reconstruction in plants. Australian Systematic Botany. 2004, 17: 145-170. 10.1071/SB03015.

Article   CAS   Google Scholar  

Taylor JS, Braasch I, Frickey T, Meyer A, Van de Peer Y: Genome duplication, a trait shared by 22000 species of ray-finned fish. Genome Res. 2003, 13 (3): 382-390. 10.1101/gr.640303.

Van de Peer Y, Taylor JS, Meyer A: Are all fishes ancient polyploids?. J Struct Funct Genomics. 2003, 3 (1-4): 65-73. 10.1023/A:1022652814749.

Remm M, Storm CE, Sonnhammer EL: Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol. 2001, 314 (5): 1041-1052. 10.1006/jmbi.2000.5197.

Nelson JS: Fishes of the world. 2006, New York , John Wiley and Sons, Inc., 601 pp.-4th

Steinke D, Salzburger W, Meyer A: Novel Relationships Among Ten Fish Model Species Revealed Based on a Phylogenomic Analysis Using ESTs. J Mol Evol. 2006, 62 (6): 772-784. 10.1007/s00239-005-0170-8.

Miya M, Satoh TP, Nishida M: The phylogenetic position of toadfishes (order Batrachoidiformes) in the higher ray-finned fish as inferred from partitioned Bayesian analysis of 102 whole mitochondrial genome sequences. Biol J Linn Sco Lond. 2005, 85: 289-306. 10.1111/j.1095-8312.2005.00483.x.

Shimodaira H, Hasegawa M: Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol Biol Evol. 1999, 16: 1114-1116.

Ensembl [www.ensembl.org/index.html].

Rozen S, Skaletsky H: Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol. 2000, 132: 365-386.

Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG: The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 1997, 25 (24): 4876-4882. 10.1093/nar/25.24.4876.

Swofford DL: PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. 2003, Sinauer Associates, Sunderland, Massachusetts.

Ronquist F, Huelsenbeck JP: MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003, 19 (12): 1572-1574. 10.1093/bioinformatics/btg180.

Jobb G, von Haeseler A, Strimmer K: TREEFINDER: a powerful graphical analysis environment for molecular phylogenetics. BMC Evol Biol. 2004, 4: 18-10.1186/1471-2148-4-18.

Download references

Acknowledgements

This work was supported by the grants from University of Nebraska-Lincoln (to C. L.), National Science Foundation DEB-9985045 (to G. O.) and University of Nebraska-Omaha (to G. L.). We thank Fred J. Potmesil and Thaine W. Rowley for help in computer programming.

Author information

Authors and affiliations.

School of Biological Sciences, University of Nebraska, Lincoln, NE, 68588, USA

Chenhong Li & Guillermo Ortí

Department of Mathematics, University of Nebraska, Omaha, NE, 68182, USA

Department of Biology, University of Nebraska, Omaha, NE, 68182, USA

You can also search for this author in PubMed   Google Scholar

Corresponding authors

Correspondence to Chenhong Li or Guoqing Lu .

Additional information

Authors' contributions.

CL conceived of the study, designed the bioinformatic pipeline, carried out the experimental tests, and drafted the manuscript. GO conceived of the study and helped to draft the manuscript. GZ implemented the bioinformatics pipeline and developed the web pages. GL conceived of the study, designed the bioinformatics pipeline and the web pages, and helped to draft the manuscript. All authors have read and approved the final manuscript.

Electronic supplementary material

12862_2006_338_moesm1_esm.rtf.

Additional file 1: Exon ID, exon length, GC content of predicted single nuclear gene markers in zebrafish and torafugu, as well the blast result between orthologous genes. (RTF 249 KB)

Additional file 2: Results of PCR amplification of 10 new makers in 36 species of ray-finned fishes. (RTF 165 KB)

12862_2006_338_moesm3_esm.pdf.

Additional file 3: Maximum likelihood phylogeny based on protein sequences of individual genes, zic1, myh6, RYR3, Ptr, tbr1, ENC1, Gylt, SH3PX3, plagl2, and sreb2. Bootstrap value higher than 50% were mapped on branches. (PDF 7 KB)

12862_2006_338_MOESM4_ESM.pdf

Additional file 4: ML phylogenies based on protein sequences of individual genes and their out-paralogs found by relaxing our search criteria to include fragments with similarity < 50%. (PDF 9 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2, authors’ original file for figure 3, rights and permissions.

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article.

Li, C., Ortí, G., Zhang, G. et al. A practical approach to phylogenomics: the phylogeny of ray-finned fish (Actinopterygii) as a case study. BMC Evol Biol 7 , 44 (2007). https://doi.org/10.1186/1471-2148-7-44

Download citation

Received : 25 September 2006

Accepted : 20 March 2007

Published : 20 March 2007

DOI : https://doi.org/10.1186/1471-2148-7-44

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Candidate Marker
  • Phylogenetic Marker
  • Compositional Bias
  • Phylogenetic Informativeness
  • Organismal Phylogeny

BMC Ecology and Evolution

ISSN: 2730-7182

case study of phylogenetic analysis

  • Research article
  • Open access
  • Published: 25 April 2022

Phylogenomic approaches untangle early divergences and complex diversifications of the olive plant family

  • Wenpan Dong   ORCID: orcid.org/0000-0002-8371-3246 1 ,
  • Enze Li 1 ,
  • Yanlei Liu 2 ,
  • Chao Xu 2 ,
  • Yushuang Wang 1 ,
  • Kangjia Liu 1 ,
  • Xingyong Cui 1 ,
  • Jiahui Sun 3 ,
  • Zhili Suo 2 ,
  • Zhixiang Zhang 1 ,
  • Jun Wen 4 &
  • Shiliang Zhou 2  

BMC Biology volume  20 , Article number:  92 ( 2022 ) Cite this article

3993 Accesses

19 Citations

8 Altmetric

Metrics details

Deep-branching phylogenetic relationships are often difficult to resolve because phylogenetic signals are obscured by the long history and complexity of evolutionary processes, such as ancient introgression/hybridization, polyploidization, and incomplete lineage sorting (ILS). Phylogenomics has been effective in providing information for resolving both deep- and shallow-scale relationships across all branches of the tree of life. The olive family (Oleaceae) is composed of 25 genera classified into five tribes with tribe Oleeae consisting of four subtribes. Previous phylogenetic analyses showed that ILS and/or hybridization led to phylogenetic incongruence in the family. It was essential to distinguish phylogenetic signal conflicts, and explore mechanisms for the uncertainties concerning relationships of the olive family, especially at the deep-branching nodes.

We used the whole plastid genome and nuclear single nucleotide polymorphism (SNP) data to infer the phylogenetic relationships and to assess the variation and rates among the main clades of the olive family. We also used 2608 and 1865 orthologous nuclear genes to infer the deep-branching relationships among tribes of Oleaceae and subtribes of tribe Oleeae, respectively. Concatenated and coalescence trees based on the plastid genome, nuclear SNPs and multiple nuclear genes suggest events of ILS and/or ancient introgression during the diversification of Oleaceae. Additionally, there was extreme heterogeneity in the substitution rates across the tribes. Furthermore, our results supported that introgression/hybridization, rather than ILS, is the main factor for phylogenetic discordance among the five tribes of Oleaceae. The tribe Oleeae is supported to have originated via ancient hybridization and polyploidy, and its most likely parentages are the ancestral lineage of Jasmineae or its sister group, which is a “ghost lineage,” and Forsythieae. However, ILS and ancient introgression are mainly responsible for the phylogenetic discordance among the four subtribes of tribe Oleeae.

Conclusions

This study showcases that using multiple sequence datasets (plastid genomes, nuclear SNPs and thousands of nuclear genes) and diverse phylogenomic methods such as data partition, heterogeneous models, quantifying introgression via branch lengths (QuIBL) analysis, and species network analysis can facilitate untangling long and complex evolutionary processes of ancient introgression, paleopolyploidization, and ILS.

Understanding the evolutionary processes remains central to addressing questions about diversification of life on Earth. One of the most difficult challenges in systematics and evolution is inferring the deep-branching relationships during periods of incomplete lineage sorting (ILS), ancient introgression/hybridization, polyploidization, and rapid radiation. Phylogenomic studies often focus on resolving deep-branching relationships, such as the root of angiosperms [ 1 , 2 ], the backbone of animals [ 3 ], the family relationships of asterids [ 4 ], the subfamilies of legumes [ 5 , 6 ], and deep recalcitrant relationships within a family [ 7 , 8 ]. These studies have shown that such relationships may remain unresolved even when large genome-scale molecular sequencing data are used, due to the discordant phylogenetic signals among genes from different genomes (nuclear, plastid and mitochondrial genomes) or different genomic regions [ 9 , 10 , 11 ]. However, phylogenomic analyses can provide effective information to gain insights into the complexity of evolutionary processes and the underlying causes of the lack of phylogenetic resolution and conflicting phylogenetic results.

One of the most significant phenomena in phylogenomic analyses is gene tree and species tree discordance in empirical studies. Gene tree discordance has numerous causes, such as substitution rate variation [ 12 ], gene duplication/loss, gene tree estimation errors, or random noise from uninformative genes [ 13 ], as well as ILS and introgression/hybridization [ 11 , 14 , 15 , 16 , 17 ]. Among these potential sources of gene tree discordance, ILS is recognized as the cause to explain conflicting genealogies [ 17 ]. ILS or deep coalescence describes the pattern due to stochasticity of the coalescent, representing the retention of ancestral polymorphism and fixation in the descendant lineages after speciation events due to stochastic genetic drift. Meanwhile, introgression/hybridization can similarly result in gene tree discordance. More recently, several methods have been developed to differentiate between the two or infer phylogenetic networks while accounting for ILS and introgression/hybridization simultaneously [ 18 , 19 , 20 ], but they are most commonly used at shallow phylogenetic scales, such as the species level [ 21 , 22 , 23 ]. For deeper phylogenetic scales (such as at the subfamily level or genus level), distinguishing true discordance causes can be challenging because the long history of evolutionary processes may obscure phylogenetic signals [ 6 , 24 , 25 ]. To overcome these limitations, comparing phylogenetic signals among genetic markers with different inheritances (plastid and nuclear genomes) and the use of multiple phylogenetic tools are essential to disentangle causes of phylogenetic conflict and provide insight into evolutionary histories.

The olive family (Oleaceae) is composed of 25 genera and approximately 600 species of temperate and tropical shrubs or woody climbers and trees distributed from the north temperate to the southern parts of Australia, Africa, and South America. Oleaceae are important components of temperate and tropical ecosystems [ 26 , 27 ]. Moreover, many Oleaceae species are economically important, e.g., olive ( Olea europaea ) is cultivated for its fruit and oil, Jasminum , Forsythia , Osmanthus , Syringa , and Ligustrum are cultivated extensively as ornamentals and for fragrances, and ash trees ( Fraxinus ) are grown for timber as well as ornamentals.

Within the Lamiales, Oleaceae is sister to the small tropical Asian family Carlemanniaceae, and the clade is the early divergent group in Lamiales [ 4 , 28 ]. More than two decades since the first molecular phylogenies of the Oleaceae were inferred [ 26 ], the family has now been supported to include five tribes (Myxopyreae, Fontanesieae, Forsythieae, Jasmineae, and Oleeae), and the tribe Oleeae is divided into four subtribes (Schreberinae, Ligustrinae, Fraxininae, and Oleinae). The evolutionary history of Oleaceae is very complex, e.g., Oleeae originated from paleopolyploid events with one of the parental genome closely related to Jasminum [ 29 ], and some of the recognized genera are polyphyletic [ 26 , 30 , 31 , 32 , 33 , 34 , 35 , 36 , 37 ] or paraphyletic [ 38 ]. Furthermore, phylogenetic incongruence between plastid and nuclear data has been reported, suggesting ILS and/or hybridization within several genera [ 34 , 39 ]. Heterogeneous evolutionary rates among clades and genes might also account for conflicting relationships [ 35 , 36 , 37 ].

Previous molecular phylogenetic analyses did not well resolve the origin and early evolution, including deep-branching relationships among the five tribes and subtribes of Oleeae (Fig.  1 ). Six and four possible topologies among five tribes and four subtribes of Oleeae, respectively, appeared in previous studies and showed obvious incongruence when using datasets from different genomes. Moreover, previous olive phylogenies have been heavily relied on chloroplast and mitochondrial markers [ 39 , 41 ], and a handful of nuclear genes have shown different topologies [ 36 ]. Extensive sampling of molecular datasets, especially unlinked nuclear genes, which can account for different evolutionary histories of individual genes, is preferable to infer species trees and explore the causes of conflicts for deep branching.

figure 1

Phylogenetic hypotheses of Oleaceae from previous studies. a–f The six alternate topologies of the five tribes. g–j The four alternate topologies of the four subtribes of Oleeae. a Dupin et al. [ 36 ] using the 80 concatenated plastid coding genes based on the maximum likelihood (ML) method. b Dupin et al. [ 36 ] using the 37 concatenated mitochondrial genes based on the ML method. c Dupin et al. [ 36 ] using the RY-coded nrDNA based on the ML method. d Ha et al. [ 40 ] using six cpDNA sequence datasets ( matK , rbcL , ndhF , atpB , rps16 , and trnL-F ) based on the Bayesian inference (BI) method and Dupin et al. [ 36 ] using the nuclear genes of phyB-1 and phyE-1. e Dupin et al. [ 36 ] using the nontransformed nrDNA cluster based on the ML method. f Wallander and Albert [ 26 ] using two plastid genes, rps16 and trnL-F , based on maximum parsimony (MP) methods. g Dupin et al. [ 36 ] using the 80 concatenated plastid coding genes, 37 concatenated mitochondrial genes, and RY-coded nrDNA based on the ML method. h Dupin et al. [ 36 ] using the nuclear genes of phyB-1 and phyE-1. i Van de Paer et al. [ 41 ] using the nuclear mtpt4 based on the ML method. j Dupin et al. [ 36 ] using the nontransformed nrDNA cluster based on the ML method. Myx, Myxopyreae; Fon, Fontanesieae; For, Forsythieae; Jas, Jasmineae; Ole, Oleeae; Lig, Ligustrinae; Sch, Schreberinae; Fra, Fraxininae; Olei, Oleinae

Beyond resolving the complex history in the olive family, our main objectives are to investigate the causes of the lack of resolution, distinguish phylogenetic signal conflicts, and explore alternative scenarios for the uncertainties concerning deep-branching relationships of the olive family. First, we estimated the olive family relationships using 180 samples from 24 genera representing all five tribes based on the whole plastid genomes and nuclear SNP datasets. These analyses were used to test whether the markers of different inheritance caused the lack and/or conflict of phylogenetic signals. We employed multiple phylogenetic methods and data partitioning schemes to resolve recalcitrant relationships at both deep and shallow nodes. Second, we analyzed thousands of nuclear gene alignments harvested from whole genome sequencing and published complete genomes of representative species from the tribes or subtribe of Oleeae. Upon inferring the most likely species tree, we analyzed and distinguished the signal of gene tree discordance produced by ILS, introgression/hybridization, and hard polytomy among deep branches and explored the implications for understanding the early evolutionary diversification of the olive family.

Phylogenomic relationships based on plastid datasets and molecular evolutionary rate variation among clades of Oleaceae

To resolve the phylogeny of Oleaceae, we expanded the taxon sampling (Additional file 1 : Tables S1-S2), employed extensive data from plastid genomes, and used multiple methods to dissect the phylogenetic signals (Table  1 and Table  2 ), and explore information and conflicts among the phylogenetic trees. In total, seven plastid datasets were constructed to infer the phylogeny of Oleaceae (Table  1 ), and a total of 19 ML (maximum likelihood) trees (Table  2 ) were constructed based on different datasets and phylogenetic methods. The ML tree from the 180s77Gaa dataset under a gene partitioning scheme was used as our main reference or summary tree for iterative topological concordance analyses of the plastid gene trees (Fig.  2 , Table  2 , Additional file 2 : Fig. S1, and the reason for using this tree as the reference tree was shown in Additional file 3 ), which visualized the proportions of genes in each gene tree supporting the alternative topologies. Our analyses revealed all tribes as monophyletic with full support. However, conflicting topologies were detected at several nodes among different trees (see below). The relationships among the tribes were less robustly resolved, with, in particular, the positions of Fontanesieae and Forsythieae showed conflicts in some analyses (Fig.  2 ). Myxopyreae, the first diverged lineage of the olive family, was strongly supported in all analyses. The plastid nucleotide sequence datasets and the 180s77Gaa based on the posterior mean site frequency (PMSF) model supported Forsythieae as sister to the clade comprising Fontanesieae, Jasmineae, and Oleeae (topology a (Myxopyreae (Forsythieae, (Fontanesieae, (Jasmineae, Oleeae))) in Fig.  1 ). In contrast to the plastid nucleotide sequence phylogeny, the analyses of the amino acid sequence data (180s77Gaa) except under the PMSF model showed that Fontanesieae was sister to the clade comprising Forsythieae, Jasmineae, and Oleeae (topology d (Myxopyreae (Fontanesieae, (Forsythieae, (Jasmineae, Oleeae))) in Fig.  1 ). However, this topology was weakly supported by the 180s77Gaa and the bootstrap support values were 25%, 32%, and 35% using the three partitioning schemes (Table  3 and Additional file 1 : Table S3). This suggests that the topology a of the five tribes in Fig.  1 is the most likely, as inferred from the plastid data, with the high support values when using the whole plastome data. The phylogenetic signal in the plastid data with regard to this topology appears to be sufficient. The sister relationship of Jasmineae and Oleeae was strongly supported in all analyses.

figure 2

Maximum likelihood phylogeny of Oleaceae inferred from RAxML analysis of the plastid 77G180saa dataset based on the gene partition models. Pie charts present the proportion of 19 plastid gene trees that support that clade (blue), or support the main alternative bifurcation (green), or support the remaining alternative (red), and the proportion that have < 80% bootstrap support (gray). Only pie charts for major clades are shown, and Additional file 2 : Fig. S1 shows pie charts for all nodes. Myx, Myxopyreae; Fon, Fontanesieae; For, Forsythieae; Jas, Jasmineae; Ole, Oleeae

In contrast to the problematic deep relationships of the family, our analyses robustly supported relationships among major clades within the tribe Oleeae. There was 100% support for the monophyly of each subtribe, and the topology of (Schreberinae (Ligustrinae, (Oleinae, Fraxininae))) (topology g in Fig.  1 ) was strongly supported by all the analyses, consistent with previous studies [ 35 , 36 ]. Within the Oleeae, at least seven genera were not monophyletic (i.e., Schrebera , Syringa , Chionanthus , Olea , Osmanthus , Phillyrea , and Nestegis ), and Chionanthus was the most complex polyphyletic genus (Fig.  2 ). Three genera including Forestiera , Hesperelaea , Priogymnanthus , and the species Chionanthus ligustrinus formed a highly supported clade and were sister to the rest of the subtribe Oleinae. The internode certainty all (ICA) value for the backbone of Oleinae was low (Fig.  2 and Additional file 2 : Fig. S1), indicating major incongruence between species trees. The conflict can therefore, at least partially, reflect incomplete sorting and/or introgression/hybridization [ 33 , 35 ].

The ML tree based on the plastid genome data showed significant differences in branch lengths (Fig.  2 and Fig.  3 b) among the tribes and subtribes of Oleaceae. The tribe Jasmineae and the Oleeae subtribe Ligustrinae had the longest branch lengths, while Forsythieae and Oleeae had relatively short branch lengths. Genetic distances showed a similar pattern with branch lengths (Fig.  3 a).

figure 3

Variation in plastid substitution rates among clades of Oleaceae. a Genetic distance among clades/branches of Oleaceae. b Comparison of intratribal and intrasubtribal plastid branch lengths among the Oleaceae based on the ML tree of the “77G180snt” dataset using the gene partitioned model, as assessed by root-to-tip branch lengths, from the common ancestor of each respective clade to each sampled tip

Branch model tests in Baseml/PAML indicated that the results significantly departed from the null hypothesis that all rates were equal among clades (“global clock” model) (Table  4 ). Model M1, which allows a local clock for Jasmineae, had a significantly better fit than M0. The rates for Jasmineae branches were 5.58 times higher than the background (Table  4 ). Meanwhile, Model M2 (a local clock for Jasmineae and the Oleeae subtribe Ligustrinae) had a better fit than Model M1, and the rates for Jasmineae and the Oleeae subtribe Ligustrinae were 6.98 and 2.29 times higher than those for the remaining Oleaceae species. According to the AICc comparison and Bonferroni-corrected likelihood ratio tests, Model M3 was the best fitting model, which indicated that Oleaceae had branch rate variation among the most clades.

Phylogenomic relationships of Oleaceae based on nuclear datasets

Following the methods of Olofsson et al. [ 35 ], we obtained three nuclear SNP datasets using the oleaster ( Olea europaea var. sylvestris ), ash ( Fraxinus excelsior ), and Forsythia suspensa nuclear genomes as the reference sequences (Table  1 ). Finally, six gene trees were reconstructed using two phylogenetic methods (Table  2 ). Using the SNP-ash dataset, 41 gene trees were reconstructed. These results were showed in Fig.  4 , Additional file 2 : Fig. S2 and Fig. S3, respectively.

figure 4

Maximum likelihood phylogeny of Oleaceae inferred from RAxML analysis of the SNP-ash dataset. The left and the right pie charts presented the proportion of nine SNP data trees and the proportion of 41 gene trees based on the dividing method using the SNP-ash dataset, respectively. The pie charts indicate support for that clade (blue), or support for the main alternative bifurcation (green), or support for the remaining alternative (red), and the proportion that have < 80% bootstrap support (gray). Only pie charts for major clades are shown, and Additional file 2 : Fig. S2 and S3 shows pie charts for all nodes. Myx, Myxopyreae; Fon, Fontanesieae; For, Forsythieae; Jas, Jasmineae; Ole, Oleeae

All six gene trees from the three SNP datasets supported that all tribes and subtribes of Oleeae were monophyletic groups, with concordant relationships among the deep nodes in the six trees (Fig.  4 and Additional file 2 : Fig. S2). The topology of (Myxopyreae (Fontanesieae, (Forsythieae, (Jasmineae, Oleeae))) (topology d in Fig.  1 ) for the five tribes of Oleaceae and the topology of (Schreberinae (Ligustrinae, (Oleinae, Fraxininae))) (topology j in Fig.  1 ) for the four subtribes of Oleeae were strongly supported. The nuclear SNP datasets also inferred that seven genera were not monophyletic in Oleeae. Most of the backbone of Oleinae were resolved with high ICA values. Furthermore, some nodes had major conflicts among the gene trees, such as the backbone of Fraxinus (Additional file 2 : Fig. S2) .

At the tribe level, the backbone relationships had low support and showed conflicting phylogenetic signals (Fig.  4 ) using the SNP dataset, indicating a complex early evolutionary history. The four subtribes of Oleeae were well supported, consistent with whole SNP dataset results. The SNP dataset suggested that some shallow nodes had conflicting phylogenetic signals, e.g. the species relationship among Ligustrum , and Olea (Additional file 2 : Fig. S3).

Assessing phylogenetic relationships and conflicts of phylogenetic signals

Half of the nodes had a consistent topology among the 25 gene trees (plastid and nuclear SNP dataset, Additional file 2 : Fig. S4); however, the backbone of the family was characterized by high levels of gene tree discordance. The most significant conflicting nodes are at the tribe level, and our data supported two alternative topologies (topology a and d in Fig.  1 ). The incongruence was higher at the shallow branches, but generally, most conflicting nodes had a majority uninformative gene tree (Additional file 2 : Fig. S4). For example, most trees (17/25) were uninformative at the node of the sister group relationship between Olea javanica and the clade consisting of O. neriifolia , O. parvilimba , and O. brachiata . Insufficient information could lead to spurious tree inference, thus producing noise and/or conflict.

Overall, the three types of datasets showed incongruence in topology when compared with trees derived from implicit (e.g., distance-based) analyses (Fig.  5 a). The nuclear SNP trees, in particular, had high support values in the backbone branches. This high resolution is directly related to the larger sampling of parsimony-informative sites (Table  1 ). On the other hand, the phylogenetic relationships recovered by the plastid data were impacted by the robustness of the method. Meanwhile, the nuclear SNPs sampled across the genome are probably unlinked, while the plastid genes constitute just a single locus. These two types of datasets hence track different evolutionary histories, leading to the incongruence in topology.

figure 5

Comparison of topologies of multiple gene trees. Twenty-five gene trees were reconstructed based on the 77 plastid coding genes, plastome data, and SNP datasets. a Matrix of Robinson-Foulds (RF) distance, which measures the overall topological discrepancy between two trees. The numbers in the x -axis and y -axis represented the gene trees, and the information was showed in Table  2 . b PCoA of the RF distance matrix

To further evaluate the impact of heterogeneity of sequence evolution across sites on relationships, we used the heterogeneous model, PMSF, and general heterogeneous evolution on a single topology (GHOST), which considers heterogeneity in the amino acid and nucleotide substitution process. The impact of using the GHOST model instead of homogeneous models on the topology was small compared with the data types (Fig.  5 a). Meanwhile, the GHOST and PMSF trees continued to support a large portion of phylogenetic relationships among the deep nodes. The PMSF trees have different topologies (topology a in Fig.  1 ) among the five tribes compared to the trees from site homogeneous models (topology d in Fig.  1 ). Gene partitioned analyses using the two plastid gene datasets (180s77Gnt and 180s77Gaa) also produced fewer effective topologies.

The principal coordinates analysis (PCoA) (Fig.  5 b) showed that all nuclear SNP gene trees were clearly separated from the plastid gene trees along the first and the second axes. The three plastid gene trees were separated along the second axis. Within the datasets, gene trees obtained with different phylogenetic methods are spread across the tree space.

Widespread introgression across the five tribes in Oleaceae

To further assess inherent conflicts between gene trees and species trees across the five tribes in Oleaceae, we estimated the plastid genome tree, individual nuclear gene trees and a species tree based on the 2608 single-copy orthologous genes among the five species representing the five tribes and the outgroup Origanum vulgare (Fig.  6 a, b) . The plastid genome tree showed that Fontanesieae was sister to a clade of Jasmineae and Oleeae, while there was inconsistency with the species tree, and the nuclear concatenated gene tree, which supported Forsythieae, Jasmineae, and Oleeae forming a clade. All branches in the species tree had low major quartet scores (q1), gene concordance factor (gCF), and site concordance factor (sCF) of < 0.5 (Fig.  6 b), and these three branches received almost equal quartet scores for q1, q2, and q3, suggesting that the gene trees yielded random topologies with respect to the species tree, which was also supported by the overlapping gene trees (Fig.  6 c).

figure 6

Phylogeny and tests for gene introgression of five tribes of Oleaceae. a Plastome concatenated tree inferred from a 76-coding gene supermatrix. b ASTRAL species tree and the nuclear concatenated phylogeny inferred from 2608 nuclear genes. Pie charts in the nodes present the proportion of gene trees that support the main topology (red), the first alternative (blue), and the second alternative (green). Gene concordance factor (gCF)/site concordance factor (sCF) values are shown above the branches. ML bootstrap/astral local posterior probabilities are shown below branches. c Cladograms of the coalescent-based species tree (heavy black lines) and 500 gene trees (in green) randomly sampled from 2608 inferred gene trees. d The most common topologies in gene trees, sorted by frequency of occurrence, as shown in brackets. e Comparison of branch length of five tribes. The root-to-tip branch length of each gene tree and each sample were assessed. f Pairwise D per species pair (lower diagonal) and the mean total proportion of introgressed loci per species pair inferred through QuIBL analysis (upper diagonal). 0 values correspond to nonsignificant values. More details were provided in Table S5. g – i Phylogenetic network analysis using PhyloNet. Numerical values next to curved branches indicate inheritance probabilities for each hybrid node. Myx, Myxopyreae; Fon, Fontanesieae; For, Forsythieae; Jas, Jasmineae; Ole, Oleeae

All the frequencies of 105 possible topologies were shown in Additional file 1 : Table S4, and 103 possible topologies appeared in the 2608 gene trees. The number of the eleven most frequent topologies (topo1 to topo9) ranged from 6.02% to 2.57% (Fig.  6 d), indicating significant conflict among the gene trees. Only 6.02% of these gene trees (topo1) were consistent with the species tree, and the plastid genome tree (topo3) was the third most frequent topology, accounting for 4.29%. The second most frequent topologies (topo2, accounting for 5.14%) showed that Jasmineae and Oleeae were the first and second divergent groups, respectively, and Forsythieae was sister to a clade of Myxopyreae and Fontanesieae. One-way analysis of variance test showed the branch lengths of all gene trees among the five nodes had significant differences ( P  < 0.05), indicating that there was rate variation among the tribes in the nuclear data (Fig.  6 e). The ASTRAL polytomy tests resulted in the same bifurcating species tree for the nuclear gene dataset and rejected the null hypothesis that any branch was a polytomy ( P  < 0.01).

To further assess whether the observed gene tree incongruences were mainly due to hybridization/gene flow, we calculated the D -statistic, which uses the ABBA-BABA test for introgression between species. The D -statistic showed that D was significant in all the triplets ( P  < 0.002, Z  > 3; Additional file 1 : Table S5). A mean value of absolute D for a species pair was calculated from all triplets (Fig.  6 f and Additional file 1 : Table S5). The absolute D was significant in most of the pairwise species comparisons (six out of ten pairwise comparisons) and varied from 0.09 to 0.41 (Fig.  6 f). The highest D value was among Forsythieae, Oleeae, and Fontanesieae, which could explain the phylogenetic relationships of topo4, topo7, topo8, and topo11 in which Fontanesieae was sister to Forsythieae or Oleeae. For Oleeae and Jasmineae, D was not significantly different from zero, and Myxopyreae showed little or no gene flow with the other four tribes. Considering the lower support value and the D value of the five tribes, gene flow might have contributed to the observed phylogenetic discordance.

Phylogenetic incongruences can be potentially associated with both ILS and introgression, and the quartet scores (QS) values for q1, q2, and q3 were almost equal, indicating a high level of ILS [ 42 ]. We used a recently developed tree-based method, QulBL [ 19 ], to distinguish these two processes. The QulBL analysis revealed that most of the triplets showed significant evidence for introgression (26 of 30 triplets, dBIC < − 10, Additional file 1 : Table S6). The mean value of the proportion of trees arising via introgression for a species pair was calculated from all triplets (Additional file 1 : Table S7). We found a strong signal for gene flow among all ten species pairs (Fig.  6 f), suggesting widespread introgression across the ancestral region of the five tribes.

Furthermore, we inferred the phylogenetic networks to visualize gene flow among the five tribes. The PhyloNet analyses identified extremely complicated and statistically significant signals for gene flow across the five tribes (Fig.  6 g–i). When reticulation events were set to 1, 2, and 3, all corresponding optimal networks supported the hybrid origin of the tribe Oleeae ( n  = 46) between tribe Forsythieae and tribe Jasmineae. The tribe Oleeae was connected to Forsythieae by an inheritance probability of 0.76, 0.73, and 0.73, respectively, under the three different reticulation scenarios. In each of the three reticulation events, large portions of the genome were exchanged. The other two reticulations are between the ancestral lineage of Jasmineae/Forsythieae/Oleeae (inheritance probability: 0.35) and Myxopyreae (0.65) and between Forsythieae (0.31), and Myxopyreae (0.69). These reticulation events were all supported by the D -statistic or QulBL.

Collectively, our results suggested that introgression/hybridization, rather than ILS, was the main factor contributing to the phylogenetic discordance among the five tribes. Oleeae is especially evident with its origin supported by ancient hybridization and polyploidy, with the ancestral lineages of Jasmineae and Forsythieae as the most likely parentages .

Comparison of genome collinearity between Oleeae and two putative parental tribes

In order to further identify the parentages of tribe Oleeae, we compared the genome collinearity among Oleeae, Jasmineae, and Forsythieae (Fig.  7 ). After the BLAST searches, for transcripts of O. europaea , there were 20,040 sequences that were successfully mapped to the genome of J. sambac while 34,542 sequences were mapped to the genome of Forthysia suspensa . For transcripts of Fraxinus excelsior , there were 38,240 sequences that were mapped to the genome of J. sambac , while 47,590 for Forthysia suspensa . The genome synteny comparison of O. europaea and Fraxinus excelsior with their putative parental lineages showed that there were 173 synteny blocks found between genomes of O. europaea and J. sambac , fewer than the synteny blocks between O. europaea and Forthysia suspensa (303) . The same result was found in comparisons between Fraxinus excelsior and the putative parent lineages: 388 synteny blocks with J. sambac and 470 synteny blocks with Forthysia suspensa (Fig.  7 ) . Hence, the two gene copies in Oleeae from the putative ancestral lineages (Jasmineae and Forsythieae) showed unequal inheritance. Alternatively, Jasmineae may not be the direct parental lineage.

figure 7

Comparisons of genome synteny of Oleeae with that of Forsythieae and Jasmineae. Two genome synteny plots were generated for Olea europaea and Fraxinus excelsior of Oleeae with Jasmimum sambac and Forsythia suspensa , respectively. a Synteny of Olea europaea with the putative parental lineages: there were 303 synteny blocks found with Forthysia suspensa while there were 173 synteny blocks found with Jasmimum sambac. b Synteny of Fraxinus excelsior with the putative parental lineages: there were 470 synteny blocks found with Forsythia suspensa while there were 388 synteny blocks found with Jasmimum sambac . Top 5% of most similar syntenic blocks’ ribbons were marked as green. c Bar plot of numbers of synteny blocks from different synteny combinations. The numbers in parentheses represent the number of syntenic sequences. For, Forsythia suspensa .; Jas, Jasmimum sambac ; Ole, Olea europaea ; Fra, F. excelsior

ILS and introgression as the main sources of phylogenetic discordance of the four subtribes in tribe Oleeae

The plastid genome data, nuclear concatenated gene tree, and species tree based on 1865 single-copy orthologous genes had identical topologies, supporting Schreberinae as the first divergent group, and Ligustrinae forming a clade with Oleinae and Fraxininae. Gene tree concordance factors (QS, gCF, and sCF) showed that the nodes of the clades of Ligustrinae, Fraxininae, and Oleinae were supported by only small fractions, and the QS, gCF, and sCF values were 0.44, 39.57, and 49.29, respectively, whereas the sister group of Fraxininae and Oleinae had higher support values and concordance factors (Fig.  8 a and b).

figure 8

Phylogeny and tests for gene introgression of four subtribes of Oleeae. a Plastome concatenated tree inferred from 76-coding gene supermatrix, ASTRAL species tree and the nuclear concatenated phylogeny inferred from 1865 nuclear genes. Pie charts in the nodes present the proportion of gene trees that support the main topology (red), the first alternative (blue), and the second alternative (green). Gene concordance factor (gCF)/site concordance factor (sCF) values are shown above the branches. ML bootstrapping with chloroplast genes and nuclear genes and astral local posterior probability are shown below branches. b Cladograms of the coalescent-based species tree (heavy black lines) and 500 gene trees (in green) randomly sampled from 1,865 inferred gene trees. c Comparison of branch length of four subtribes. The root-to-tip branch length of each gene tree and each sample were assessed. d The most common topologies in gene trees, sorted by frequency of occurrence, as shown in brackets. e Pairwise D per species pair (lower diagonal) and the mean total proportion of introgressed loci per species pair inferred through QuIBL analysis (upper diagonal). 0 values correspond to nonsignificant values. More details were provided in Table S9. f , g Phylogenetic network analysis using PhyloNet. Numerical values next to curved branches indicate inheritance probabilities for each hybrid node. Lig, Ligustrinae; Sch, Schreberinae; Fra, Fraxininae; Olei, Oleinae

All 15 possible topologies appeared in the 1865 gene trees (Additional file 1 : Table S8), and three topologies were the most frequent (> 15%). A total of 30.03% of these gene trees (topo1) were consistent with the species tree. The second and third most frequent topologies (topo2 and topo3, accounting for 18.28% and 17.80% gene trees, respectively) showed Schreberinae as sister to the Fraxininae–Oleinae clade, and forming a clade with Ligustrinae, respectively (Fig.  8 d). There was significant branch length variation among the four subtribes of Oleeae (Fig.  8 c, one-way analysis of variance test, P  < 0.05), indicating that heterotachous evolution, such as the rate variation of the lineages, was a likely factor affecting tree discordance. The ASTRAL polytomy test results also rejected the null hypothesis that any branch is a polytomy ( P  < 0.01) in the four subtribes.

D -statistics showed no or little gene flow among the four subtribes (Fig.  8 e). Gene flow was only identified between Ligustrinae and Oleinae, as well as Ligustrinae and Fraxininae, but the D values were much lower than most in the five tribes (Additional file 1 : Table S9). QulBL analysis revealed that only one of the six species pairs showed significant evidence for introgression (Fig.  8 e, and Additional file 1 : Tables S10-S11), suggesting that ILS was the main factor behind gene tree discordance among the four subtribes. PhyloNet analyses supported two reticulation events, between Ligustrinae and the ancestral lineage of Fraxininae and Oleinae, and between Fraxininae and Oleinae (Fig.  8 f and Fig.  8 g). These two reticulation events were also supported by the D -statistic or QulBL.

In summary, our results revealed that ILS and ancient introgression had both contributed to phylogenetic discordance among the four subtribes of tribe Oleeae. Two introgression events were supported: one between Ligustrinae and the ancestral lineage of Fraxininae and Oleinae and the other between Fraxininae and Oleinae.

Timescale for the Oleaceae tree of life

Using the 91s77G dataset and four calibration priors (Additional file 1 : Table S12), we inferred the divergence times of Oleaceae (Additional file 2 : Fig. S5). The Oleaceae stem node dated back to the Paleocene (62.59 Ma, 95% highest probability density, HPD: 60.63–64.53 Ma) and the crown node was 60.51 Ma (95%, HPD: 56.01–64.07 Ma). From the late Paleocene (60.51 Ma) to the early Eocene (52.47 Ma), an approximately 8 Ma interval, five ancestral lineages corresponding to the tribes became genealogically divergent. The crown ages of Myxopyreae, Forsythieae, Jasmineae, and Oleeae were dated to 29.47 Ma during the early Oligocene, 19.22 Ma during the early Miocene, 37.78 Ma during the late Eocene, and 46.66 Ma during the middle Eocene, respectively. The four subtribes of Oleeae diverged from 46.66 Ma to 39.43 Ma during the middle Eocene, and the crown ages for the four subtribes were 22.51 Ma, 34.06 Ma, 27.69 Ma, and 33.78 Ma, respectively.

Variation in substitution rates among the clades of Oleaceae

Our study clearly suggests faster rates of genome evolution in tribe Jasmineae and some branches of the Oleeae subtribe Ligustrinae than in the other clades of Oleaceae, as evidenced by longer branch lengths and larger genetic distances in Jasmineae and Oleeae subtribe Ligustrinae as well as branch model tests. The branch model test in baseml/PAML, e.g., the M1 model (Table  4 ) shows a 5.5-fold average variation among Jasmineae and the rest of the clades in Oleaceae.

In comparison to previous results, we here report that the lower phylogenetic signal of the deep branching is related to extreme variation in substitution rates in Oleaceae. We sampled representatives of nearly all genera and inferred broad relationships of tribes and subtribes of Oleeae using heterogeneous models (e.g., PMSF, GHOST) and multiple partitioning schemes; however, the deep nodes had low support values and showed conflicts with species trees (Fig.  2 and Additional file 1 : Table S3 see below for more details), suggesting that rate heterogeneity severely obscured plastid relationships [ 43 ].

Variations in substitution rates among different lineages have long been studied in plants [ 44 , 45 , 46 , 47 ]. A hypothesis commonly invoked to explain rate variation is generation time, i.e., nucleotide substitution rates are negatively correlated with generation time. This hypothesis has been supported in plants by comparing the rates of long-lived woody plants and short-lived herbaceous plants [ 44 , 45 ]. Our results also support the generation time hypothesis, as Jasmineae species are woody climbers, shrubs, and herbs, while the remaining Oleaceae species are mostly woody. However, the mechanism behind the influence of generation time on the substitution rate is unclear in plants because different from animals, plants do not sequester their germ line, and somatic mutations can be passed down. Lanfear et al. [ 48 ] found a consistently negative relationship between plant height and substitution rate across angiosperms. Differences in the rates of mitosis in the apical meristem can account for the observed differences in rates of molecular evolution among plants of different heights [ 48 ]. Taller, long-lived woody plants accumulate more mutations per generation, and the chances of deleterious mutations are increased. A way to avoid this is for them to have fewer opportunities for DNA replication errors to occur than the short-lived plants [ 49 ].

Species diversification in angiosperms is positively correlated with substitution rates [ 49 , 50 ]. In the results of Oleaceae, this correlation is also supported, as Jasmineae is the most species rich (with approximately 220 species throughout the Old World tropics and warm temperate regions) in comparison with the other major clades in the family [ 27 ].

Approximately 20% of angiosperm species have biparental plastid inheritance [ 51 , 52 ], and plastid genome rearrangement events are associated with this inheritance [ 53 , 54 , 55 , 56 , 57 ]. Jasminum is a group with biparental plastid inheritance, and the plastid genomes of Jasminum and Menodora show several distinctive rearrangements, including inversions, gene duplications, insertions, inverted repeat expansions, and gene and intron losses [ 58 ]. Meanwhile, the substitution rate is correlated with plastid genome rearrangements [ 46 , 59 , 60 ]. A possible explanation for this is that the biparental inheritance of plastomes influences both substitution rates and plastid genome rearrangements. A scenario may be aberrant DNA repair/recombination/replication (RRR) by biparental inheritance responsible for the increase in substitution rates and highly rearranged plastomes [ 59 , 61 ].

Strong discordance among gene trees

The results showed strong discordance of gene trees among different datasets and phylogenomic methods. Exploration of gene tree discordance is fundamental to unravel recalcitrant backbone relationships of Oleaceae, and multiple types (whole plastomes, nuclear SNPs, and multiple nuclear genes) of data were used to tease apart alternative hypotheses concerning the source of gene tree heterogeneity along the backbone phylogeny of Oleaceae.

Although the plastid analyses largely resolved relationships of the olive family, we identified multiple instances of strongly supported conflicts among datasets, sequence types (nucleotide vs. amino acid), and phylogenetic models. In the 19 gene trees based on the plastid datasets, we recovered conflicting or uninformative support at ~ 33% of nodes (Additional file 2 : Fig. S2). The sources of conflict in plastid genome phylogenies remain unclear and poorly understood, and several factors have demonstrated their relevance, such as phylogenetic signals, rapid radiation, and rate heterogeneity [ 6 , 62 ]. In Oleaceae, the rate heterogeneity among the clades likely explains the deep-branching node conflict, and using the amino acid dataset to reduce the observed conflict and rapid radiation may explain the conflict of shallow nodes [ 35 , 37 ]. Nevertheless, heteroplasmic recombination deserves consideration in light of supported conflict [ 6 ].

Our analyses clearly show that the plastid gene tree conflicts with the nuclear SNP gene tree among terminal branches, as well as in some deeper nodes (Fig.  5 a). Cytonuclear discordance is well known in plants and has been traditionally attributed to chloroplast capture. Recently, ILS, organellar introgression, positive selection, branch length, and geography have largely explained the widespread cytonuclear discordance in closely related taxa [ 10 , 16 , 63 ]. For the deep nodes, the majority of the incongruences within the olive family can be explained by ancient introgression. For intraspecific or intrageneric relationships, these discordances probably mirror the differences in evolutionary processes (e.g., differences in effective population size and different rates of pollen and seed gene flow) [ 22 , 63 ]. Nevertheless, allopolyploidization likely explains a portion of the observed discordance. Several species (e.g., Fraxinus chinensis , subspecies of O. europaea ) have been demonstrated to be of recent hybrid origin [ 29 , 64 , 65 ].

Based on the phylogenetic analyses, ancient introgression and ILS were mainly responsible for the phylogenetic discordance observed in the deeper nodes. However, the phylogenetic results had similar phylogenetic information/signals, and it is difficult to differentiate ancient introgression and ILS [ 66 ], especially with deep divergence as the earliest dichotomy. Indeed, gene tree discordance caused by ILS is thought to be common when internodes are short owing to rapid diversification [ 5 , 13 , 25 ], and this is often a main factor to explain gene tree discordance at all taxonomic levels. Using the D -statistic, QuIBL, and phylogenetic network, we attempted to differentiate the deep coalescence and post-speciation gene glow at the tribe level and the subtribe level, respectively. The D -statistic showed the signal of introgression in seven possible locations, and QuIBL was detected in all possible locations among the five tribes of Oleaceae (Fig.  6 f). The inferred introgression events agreed with the reticulation scenarios from the phylogenetic network analysis (Fig.  6 g–i). The signal of D -statistic may be lost or distorted, when there were multiple or “hidden” reticulations [ 67 ], was the cause that no introgression was detected between Oleeae and Jasmineae, but it was detected in QuIBL and phylogenetic network analysis. Our phylogenetic tree also exhibited short internal branches at deep branching (Fig.  2 ), and the distribution of gene tree frequency supports the presence of polytomous topology (Additional file 1 : Table S4); however, the polytomy test in ASTRAL rejected a polytomous topology in the five tribes. Indeed, ancient introgression, not ILS, is consistent with our findings and with the extensive discordance we identified in our phylogenetic analyses of the five tribes.

The level of post-speciation gene flow inferred with the D -statistic and QuIBL test was very low (Fig.  8 e), and ILS was the main cause of the gene tree discordance within the subtribes of Oleeae. Ancient admixture of ancestral lineages is a powerful means for rapid radiation to occur [ 68 ]. The results of our phylogenetic analyses, QuIBL tests, and phylogenetic networks support that Oleeae is likely to be the result of ancient allopolyploidization and rapid radiation.

Early evolutionary history of Oleaceae

We propose two scenarios for the early diversification of Oleaceae based on the results of this study (Fig.  9 ). The species tree from the nuclear genes and the gene tree from SNPs supported the relationships among the five tribes of the olive family as (Myxopyreae (Fontanesieae, (Forsythieae, (Jasmineae, Oleeae))). Oleaceae originated in the Paleocene, and the first divergence of Myxopyreae from the remaining clades was at c. 60.5 Ma; within approximately eight Ma, five major lineages corresponding to the five tribes became diversified. During these times, there was frequent reticulate evolution. The basic chromosome number [ 27 , 69 ] and the phylogenomic results [29, this study] support that the tribe Oleeae originated via ancestral allopolyploidization at c. 52.5 Ma. All plastid datasets showed Jasmineae as sister to Oleeae, supporting that the ancestral Jasmineae was the maternal parentage (left scenario in Fig.  9 ); however, phylogenetic network results did not support the inheritance probability of potential parents (Jasmineae and Forsythieae) of approximately 50%, also consistent with low-level gene flow using the D -statistic and QuIBL test (Fig.  6 f–i). Moreover, the results from genome synteny analyses revealed both O. europensa and Fraxinus excelsior of tribe Oleeae showed higher genome synteny to tribe Forsythieae ( Forthysia suspensa ) than to tribe Jasmineae ( J. sambac ), indicating the ancestral lineages of Jasmineae may not be the direct ancestors (Fig.  7 ). We hence propose an alternative scenario in which there was a “ghost lineage,” which was sister to Jasmineae, and this extinct “ghost lineage” was the likely maternal parent of the tribe Oleeae. Phylogenetic network analysis strongly support that the ancestral Forsythieae was the paternal parentage. The allopolyploid Oleeae experienced a rapid radiation, and the most likely species tree of the four subtribes is (Schreberinae, (Ligustrinae, (Fraxininae, Oleinae))). ILS, together with the limited introgression, is the most likely driving force for the divergences of the four subtribes of Oleeae.

figure 9

Two alternative models of the evolutionary diversification of Oleaceae. Myx, Myxopyreae; Fon, Fontanesieae; For, Forsythieae; Jas, Jasmineae; Ole, Oleeae; Lig, Ligustrinae; Sch, Schreberinae; Fra, Fraxininae; Olei, Oleinae

In this study, we employed multiple genomic datasets to resolve the phylogenetic relationships, especially the deep nodes of the olive family Oleaceae. Analyses of the whole plastid genome and the nuclear genes provide evidence for extreme heterogeneity of plastid substitution rates among the different clades, and these findings have implications for systematics of the family. Although our phylogenetic results confirm support for monophyly of the family and each of the five tribes and the four subtribes of tribe Oleeae, we have also detected strong conflicts in relationships inferred from the plastid and nuclear SNP datasets, as well as the nuclear gene trees. By evaluating conflicting phylogenetic signals, we have resolved the backbone phylogeny of Oleaceae and have detected ancient introgression and ILS in the deeper nodes. More generally, this study adds valuable genomic data of the economically important olive plant family and explores gene tree discordance in detail, providing a strong case study on exploring the complexity of the plant tree of life in the genomic age.

Taxon sampling, plant material, and the deposition of vouchers

We sampled 179 ingroup samples, including 140 species and one outgroup (Carlemanniaceae, Carlemannia griffithii ), which was the sister family of Oleaceae. The ingroup included species representing all currently recognized tribes (five), subtribes (four), and genera (24) (except the genus Dimetra , which only included one species, Dimetra craibiana ), in Oleaceae according to the classifications of E Wallander and VA Albert [ 26 ] and PS Green [ 27 ]. Eighty-four samples were obtained in this study (Additional file 1 : Table S1), and 96 samples were from GenBank (Additional file 1 : Table S2).

The 84 samples obtained in this study were mainly collected from the field and herbarium specimens. All samples were identified based on morphological characters. Leaf material from the field was dried using silica gel, and the voucher specimens were deposited in the herbarium of the Institute of Botany, Chinese Academy of Science (PE). The herbarium materials were obtained from PE, and the specimens were selected using the two criteria according to the results of Xu et al. [ 70 ]: (1) the collection date for the specimen was as close to today as possible and (2) the specimen was from a healthy plant. Every specimen was inspected under a dissecting microscope to ensure that there were no visible fungal infections. All the samples were collected according to the local, national, or international guidelines and legislation.

DNA isolation and sequencing

Leaf material was ground using the mechanical lapping method, and the total DNA was isolated using a modified CTAB protocol (mCTAB) [ 71 ]. DNA concentration was measured with the Qubit 2.0 Fluorometer (Thermo Fisher Scientific), and the length of the DNA fragments was quantified on an agarose gel for a subset of the samples. Total DNA concentrations > 1 μg were chosen for Illumina sequencing.

Genome skimming was used to obtain plastid genome data and nuclear SNPs and to identify multiple nuclear genes [ 35 , 72 ]. Total DNA was fragmented by sonication into 350 bp fragments except for some herbarium materials that had degraded to less than 350 bp. The DNA was constructed as 350-bp insert libraries, and the degradation DNA of herbarium material was used to construct 200-bp insert libraries using Nextera XT DNA Library Preparation Kit (Illumina, San Diego, CA, USA) and was then used for sequencing. Each sample was paired-end sequenced (150 bp) on the Illumina HiSeq X-ten at Novogene in Tianjin, China. Most samples yielded approximately 5 Gb of 150-bp paired-end reads. The samples were used to sequence whole genomes, yielding 35 Gb of data.

Plastome assembly and annotation

Raw reads were cleaned and filtered as follows: Illumina adapter artifacts, low-quality reads and low-quality bases at the read ends were trimmed with Trimmomatic 0.39 (using settings: ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:1:true LEADING:20 TRAILING:20 SLIDINGWINDOW:4:15) [ 73 ]. Two methods were used to assemble the plastomes. First, the whole plastomes were assembled using GetOrganelle [ 74 ]. with a range of k-mers of 65, 75, 85, 95, and 105. If GetOrganelle was unsuccessful at assembling complete plastomes, we used the second method to assemble it.

For the second successive assembly method, clean data from Trimmomatic were assembled de novo into contigs using SPAdes version 3.13.1 [ 75 ]. The plastome contigs were extracted directly by BLAST search from the de novo assembled contigs against Fraxinus excelsior , Jasminum nudiflorum , and Olea europaea plastome reference sequences using custom Python scripts. The extracted contigs were further assembled using Sequencher v5.4.5 (Gene Code Corporation, Ann Arbor, MI, USA). The gaps between the contigs were filled using clean reads that were mapped to the contigs. The plastomes were further checked by mapping the paired reads to the assembled plastomes and scanned by eye to confirm appropriate mapping using Geneious Prime version 2020.0.5 [ 76 ].

Finished plastomes were annotated using the Perl script Plann [ 77 ], and the missing or incorrect genes were checked in Geneious. The physical maps of the Oleaceae were drawn using OrganellarGenomeDRAW [ 78 ]. Finally, the newly assembled plastomes and the raw Illumina data were deposited in GenBank (Additional file 1 : Table S1).

Nuclear SNP calling

Olofsson et al. [ 35 ] described a reference-based approach to call SNPs using low-depth whole genome sequencing data. This method used the quality filtered reads to map onto a reference genome and extracted the high-quality SNP positions from uniquely mapped reads taking differences in sequencing depth between samples into account [ 35 ] and then bioinformatically reconstructing genotypes from uniquely mapped reads using a series of bioinformatic pipelines. Three whole genomes of Oleaceae were used as the reference genomes for SNP calling. The oleaster ( Olea europaea var. sylvestris ) [ 79 ] and ash ( Fraxinus excelsior ) [ 80 ] both belong to tribe Oleeae, and Forsythia suspensa [ 81 ] belongs to tribe Forsythieae.

Raw reads were first subjected to quality control using the NGS QC toolkit version 2.3.3 [ 82 ]. Reads with more than 20% of bases with quality scores below 20 were removed, and low-quality bases (Q < 20) were trimmed from the 3′ end of each read. Quality-controlled reads of all 180 samples were mapped to the four reference genomes using Bowtie 2 [ 83 ], and uniquely mapped reads in proper pairs were identified using SAMtools version 1.3.1 [ 84 ] and Picard tools version 1.92 ( http://broadinstitute.github.io/picard/ ). The high-quality nuclear SNPs were called in SAMtools [ 84 ] using the “mpileup” module. The individual genotypes were merged in BCFtools version 1.3.1 [ 85 ] filtered in VCFtools version 0.1.14 according to the following criteria: (1) quality value ≥ 20; (2) for each sample, the raw genotyped SNPs were filtered, and the sites with coverage between 0.5 and two times the median coverage; (3) a minor allele count of at least three; and (4) SNPs with ≥ 20 missing genotypes within the 180 samples were removed.

Plastid gene/genome alignment and data matrix construction

Whole plastid genome datasets.

In total, 180 whole plastomes were aligned (excluding one copy of the inverted repeat) using Mauve Version 1.1.1 [ 86 ] to identify potential genome rearrangements such as inversions. The genome rearrangements were adjusted manually according to the gene order of Fraxinus excelsior . The alignment was done using MAFFT version 7.313. As regions of introns and spacers can be difficult to align at high taxonomic levels, we used TrimAl version 1.3 [ 87 ] to explore the effect of inferring phylogenetic relationships based on the four automated trimming methods (Table  1 ).

Protein coding loci

GenBank files were generated in Sequin for all the newly assembled plastomes, and other Oleaceae plastome data were downloaded from GenBank. The coding genes were extracted from the annotated plastomes using a custom Python script. Each gene was aligned with the codon-based alignment model in the MAFFT version 7.313 plugin in PhyloSuite version 1.2.2 [ 88 ]. The ycf1 and ycf2 genes were excluded from the following analyses because of the greater number of indels in the alignment. Alignments were visualized and concatenated in PhyloSuite version 1.2.2. The resulting matrix comprised 77 protein-coding genes, 180 samples, and 55,296 aligned bp.

Three separate protein-coding matrices were analyzed: (1) “180s77Gnt,” the nucleotide sequences of all protein coding loci including all taxa; (2) “180s77Gaa,” the amino acid sequences of all protein coding loci including all taxa; (3) “91s77G,” a reduce sample set from 180s77Gnt with nearly all representative lineages of Oleaceae used for divergence time analyses.

Orthologous nuclear gene identification

Eight species from Oleaceae (one species represented each tribe or subtribe) and Origanum vulgare from Lamiaceae were used to identify orthologous gene families. Four species (Myxopyreae: Myxopyrum hainanense , Fontanesieae: Fontanesia phillyreoides , Jasmineae: Jasminum mesnyi , and Oleeae subtribe Ligustrinae: Syringa pubescens ) were subjected to whole genome sequencing, and the sequencing depth was approximately 30X. The raw data of Schrebera swietenioides (Oleeae subtribe Schreberinae) were downloaded from the SRA database (SRR8247314). Three sequenced genomes of Oleaceae plants, including Fraxinus excelsior (Oleeae subtribe Fraxininae), and Olea europaea (Oleeae subtribe Oleinae), Forsythia suspensa (Forsythieae), and the outgroup Origanum vulgare (Lamiaceae), were downloaded from the published database.

The raw data were subjected to Trimmomatic 0.39 for quality control and assembled de novo into contigs using SPAdes 3.6.1 [ 75 ]. The completeness of the assembled genome was estimated by BUSCO 4.0 [ 89 ]. Groups of orthologous sequences were defined using OrthoFinder2 [ 90 ] under the parameters S  = diamond. Each single-copy orthogroup was aligned via MAFFT version 7 [ 91 ] with the setting “--auto,” and all alignments were further trimmed using TrimAl version 1.2 [ 87 ] with the “automate1” method.

To reveal the evolutionary history of Oleaceae at different levels, two nuclear datasets were constructed at the tribe and subtribe levels. The tribe nuclear dataset included five ingroups (one species representing each tribe, i.e., Myxopyrum hainanense , Fontanesia phillyreoides , Forsythia suspensa , Jasminum mesnyi , and Fraxinus excelsior ) and one outgroup species ( Origanum vulgare ). A total of 2,608 single-copy orthologous genes, which were more than 300 bp in length, were identified. The nuclear dataset of subtribe Oleeae includes four ingroups (one species representing each subtribe, i.e., Schrebera swietenioides , Syringa pubescens , Fraxinus excelsior , and Olea europaea ) and one species of Forsythia suspensa . A total of 1865 single-copy orthologous genes were identified using OrthoFinder2.

Gene tree reconstruction based on plastid and SNP datasets

Gene trees were reconstructed using the maximum likelihood (ML) methods as implemented in the programs RAxML-NG [ 92 ] and IQ-TREE 2 [ 93 ]. RAxML-NG is a from-scratch reimplementation of the established greedy tree search algorithm of RAxML/ExaML, and it offers improved accuracy and speed [ 92 ]. IQ-TREE is a user-friendly and widely used software package for phylogenetic inference using maximum likelihood and supports more evolutionary models.

Each analysis used the best fit models, which were selected using ModelFinder [ 94 ]. For the datasets 180s77Gnt and 180s77Gaa, we used the following partition schemes: (i) unpartitioned, (ii) partitioned according to results from PartitionFinder 2 [ 95 ] with predefined partitioning by genes, (iii) partitioned by genes, and (iv) partitioned by codons (only in 77G180snt dataset). All partitioning analyses were run in PartitionFinder 2 [ 95 ] under the model selection Akaike Information Criterion criteria (AICc) and with branch length linked. RAxML-NG [ 92 ] was run for the ML tree with 500 bootstrap replicates. In order to investigate phylogenetic incongruence within the SNP data, we used the dividing method, thereby avoiding to simply include concatenation-based ML analyses based on the GTR+G model. The SNP-ash dataset was used for this analysis, because of this dataset included the most number of SNPs. Each 10 kb of the SNPs were divided into a new data matrix and used for tree reconstruction.

Many studies have shown that heterotachous evolution, i.e., rate variation across sites and lineages, may mislead phylogenetic inference [ 11 , 96 , 97 ]. The posterior mean site frequency (PMSF) model [ 98 ] and general heterogeneous evolution on a single topology (GHOST) model [ 99 ] were used to reconstruct alternative trees. The PMSF model implemented in IQ-TREE considers mixture classes of rates and substitution models (here, the LG model) across sites as a rapid approximation to the CAT model in PhyloBayes [ 100 ]. The dataset 180s77Gaa was used for PMSF phylogenetic reconstruction because this method only supported the amino acid data. Specifically, we used the LG + C60+G+F model for PMSF phylogenetic reconstruction. PMSF requires a guide tree, which we obtained from RAxML-NG analysis. Nodal support was assessed with 1000 replicates of the ultrafast bootstrapping (UFBoot) method [ 101 ].

GHOST is an edge-unlinked mixture model consisting of several site classes, each having a separate set of model parameters and edge lengths on the same tree topology. All nucleotide datasets were used to infer phylogenetic relationships using this model implemented in IQ-TREE. Branch support values were computed using the UFBoot method.

Comparison of multiple trees

The normalized Robinson-Fould’s distance (RF) was used to examine the topological congruence between each gene tree. The RF distance was calculated using IQ-TREE. Principal coordinates analysis (PCoA) based on the RF distance was used to assess the clustering pattern of multiple trees, which calculates the best reduced-spaced visualization of the distances between trees. PCoA performed using R.

Concordance among the trees generated from the plastid datasets and SNP datasets was analyzed using PhyParts [ 102 ] and visualized using PhyParts_PieCharts ( https://github.com/mossmatters/MJPythonNotebooks ; last accessed August 13, 2021). Both internode certainty all (ICA) values and conflicting/concordant bipartitions were calculated. For these analyses, branch support values less than 80% were cut off, and this node was regarded as uninformative for the reference tree node.

Assessment of discordance between gene trees and the species tree

For the nuclear single-copy orthologs, we used RAxML-NG to infer the best ML trees from unpartitioned alignments for each locus using a GTR + G substitution model, and the branch support value was computed with 200 bootstrap replicates.

Species trees were reconstructed by summarizing gene trees using ASTRAL-III [ 42 ]. Local posterior probabilities (LPPs) were calculated for branch support [ 103 ]. We further used the quartet scores (QS), gene concordance factor (gCF), and site concordance factor (sCF) to measure the amount of gene tree conflict around each branch of the species tree. The QS was calculated in ASTRAL to examine the number of gene tree quartets supporting the primary (q1), second (q2), and third (q3) alternative topologies. gCF and sCF represent the percentage of decisive gene trees and sites supporting a branch in the reference trees [ 104 ], respectively. gCF and sCF were computed in IQ-TREE.

To further visualize conflict, we built a density tree from 500 gene trees randomly sampled using the Toytree Python toolkit ( https://github.com/eaton-lab/toytree ; last accessed August 13, 2021). All gene trees were converted to ultrametric trees in TreePL [ 105 ].

We also used topological weighting to reduce the complexity of the six-taxon phylogeny of the Oleaceae and the five-taxon phylogeny of the tribe of Oleeae. Ignoring the branch length, there are 105 and 15 types of topologies within a rooted binary tree of six and five terminal branches. We calculated the frequency of the alternative topologies using the Python script ( twisst.py ; https://github.com/simonhmartin/twisst ; last accessed August 13, 2021).

D-statistic

We analyzed the D- statistic in the form D  = (nABBA-nBABA)/(nABBA+nBABA) in a rooted tree (((P1, P2), P3), O) to assess whether species P1 or P2 had gene flow with P3. The null hypothesis about no gene flow between the species is rejected when the D -statistic significantly deviates from 0 [ 106 , 107 ]. We used a threshold Z  > 3 to reject the null hypothesis, which corresponds to P  < 0.002. In the outcome of the D- statistic analysis, P2 and P3 had gene flow if a Z -score > 3 and a D -score > 0, and P1 and P3 had gene flow if a Z -score > 3 and a D -score < 0. All possible combinations of the four-taxon topology were subjected to the D -statistic analyses using the evobiR package in R ( https://github.com/coleoguy/evobir ; last accessed August 13, 2021).

QuIBL is based on the analysis of branch length distributions across gene trees to infer putative introgression patterns, which can be used to test hypotheses of whether phylogenetic discordance between all possible triplets is explained by ILS alone or by a combination of ILS and gene flow [ 19 ]. QuIBL uses the distribution of internal branch lengths and calculates the likelihood that the discordant gene tree is due to introgression rather than ILS. The Bayesian information criterion (BIC) was used to test whether the gene trees discordant from the species tree were more similar to introgression or ILS. We used a stringent cutoff of dBIC < − 10 to accept the ILS + introgression model, as suggested by the author [ 19 ]. The single-copy orthologous genes were used for QuIBL analyses.

Species network analysis

We inferred a species network to assess the effect of gene tree conflicts due to hybridizations. A species network based on the gene trees from the single-copy orthologous genes was carried out using the maximum pseudolikelihood method InferNetwork_MPL included in the package PhyloNet [ 108 ]. We carried out three network searches by allowing one to three reticulations and performed 10 independent searches for each reticulation setting to avoid local optima. The optimal networks were displayed in Dendroscope 3 [ 109 ].

Polytomy test

To test whether the gene tree discordance could be explained by polytomies instead of bifurcating nodes, quartet-based polytomy tests were carried out in ASTRAL-III following Sayyari and Mirarab [ 110 ]. Quartet frequencies for all branches were inferred using the gene trees to determine the presence of polytomies, where P  < 0.05 was considered to reject the null hypothesis of a polytomy. The analysis was run second to minimize error due to gene tree error (collapsing branches with < 50% bootstrap support).

Genome synteny analysis

We downloaded four genomes: Forsythin suspensa (Accession Number: GCA_020510225.1) of tribe Forsythieae [ 111 ], Jasmimum sambac (Accession Number: GCA_018223645.1) of tribe Jasmineae [ 112 ], and Olea europaea (Accession Number: GCA_002742605) and Fraxinus excelsior (Accession Number: GCA_019097785) of tribe Oleeae [ 79 , 113 ]. Transcripts of O. europaea and F. excelsior were downloaded as well. We first ran BLAST search of transcript of O. europaea against genomes of F. suspensa and J. sambac , respectively. We used whole transcripts of O. europaea and Fraxinus excelsior separately as cut-offs for BLAST matches, max e-value was set to 1e −5 during the analysis. When one cut-off matched to multiple locations, we retained the match with the highest hit-score and removed the rest to ensure that one cut-off matched to only one position on the genome.

We compared genome synteny among O. europaea , J. sambac , and F. suspensa , based on the results from the BLAST search. Genome synteny between F. excelsior and the putative parental lineages was analyzed with the same method. Local BLAST database construction and BLAST search were run by Geneious Prime [ 76 ], while genome synteny plots were constructed following the MCscan pipeline from Tang et al [ 114 ].

Time calibration of the phylogeny

We used BEAST v2.5.1 [ 115 ] to estimate the divergence times of Oleaceae using the 91s77G dataset. Four calibration priors were utilized in this study (Additional file 1 : Table S12). According to the results of Zhang et al. [ 4 ], the average age of the most recent common ancestor (TMRCA) of the Oleaceae and Carlemanniaceae (the root of the tree) was 62.23 Ma. The samaras of Fraxinus wilcoxiana Berry were described from the Middle Eocene Claiborne Formation of western Tennessee, USA [ 116 ]. Following Besnard et al. [ 39 ] and Hong-Wa and Besnard [ 33 ], we implemented this age as a lower bound of the TMRCA of subtribe Fraxininae and subtribe Oleinae. These fossil priors were given a lognormal distribution with offset values of 40 Ma and a standard deviation of 3 Ma. Fossils of Olea subgenus Olea occurred before 23 Ma [ 117 , 118 , 119 ] and were used to calibrate the crown of Olea subgenus Olea  > 23 Ma. A pollen of Fraxinus praedicta Heer from the upper Miocene in Europe (12 Ma) representing the extant taxon Fraxinus angustifolia was used to set the minimum age for the living European ashes (set to the crown of F. angustifolia and F. excelsior ) [ 117 ]. For these two priors, we used lognormal distributions with offset values of 23 and 12 Ma, respectively, and a mean of 1 Ma and a standard deviation of 0.5 Ma, allowing for the possibility that these nodes are considerably older than the fossils themselves.

We ran analyses with the GTR + G site model, relaxed clock lognormal to account for rate variability among lineages, Yule tree speciation models, and 500,000,000 generations with the MCMC method. The sampling frequency was 50,000 generations, and the adequacy of the parameters was checked using Tracer 1.6 [ 120 ] to evaluate convergence and to ensure a sufficient and effective sample size (ESS) surpassing 200. A maximum clade credibility tree was computed after discarding 10% of the saved trees as burn-in using TreeAnnotator v2.4.7.

Plastid substitution rate analyses and inference of rate changes

To assess variation in substitution rates among clades among the Oleaceae, node-to-tip branch lengths from the rooted species of each sample were calculated for the ML tree of 180s77gnt based on the gene partition model. Branch lengths were counted using the Toytree Python toolkit. The genetic P-distances between the Carlemannia griffithii (the outgroup species) and Oleaceae samples were calculated using MEGA 7.0 [ 121 ]. The t test was performed using R to test differences in branch lengths and genetic distance among clades.

We used the baseml module of PAML v.4.8 [ 122 ] to test the null hypothesis that Oleaceae evolve via a “Global Clock” (all rates equal among the clades/branches). The different “branch models” were tested, allowing rates to vary in prespecified regions of the tree corresponding to clades, as opposed to a “background” rate. Four models were used to test different rates among the clades (tribe or subtribe) in Oleaceae. Model M0 specified a global clock for all Oleaceae; Model M1 allowed Jasmineae to evolve via a local chock; Model M2 allowed local clocks for Jasmineae and Oleeae subtribe Ligustrinae; and Model M3 allowed the four clades of Jasmineae, Oleeae subtribe Ligustrinae, Oleeae, and Forsythieae to have independent local clocks. To evaluate significant differences in model fit, we used likelihood ratio tests and corrected Akaike information criterion comparisons following the method of Barrett et al. [ 123 ].

Availability of data and materials

Illumina sequence reads generated in this study have been deposited at NCBI’s short sequence read archive (SRA) under accession number PRJNA820313 [ 124 ] and PRJNA704245 [ 125 ]. The samples and the voucher specimens used in this study are deposited at the PE herbarium. Information on the samples can be found in Additional file 1 : Table S1.

Goremykin VV, Nikiforova SV, Cavalieri D, Pindo M, Lockhart P. The root of flowering plants and total evidence. Syst Biol. 2015;64(5):879–91.

Article   CAS   PubMed   Google Scholar  

Albert VA, Barbazuk WB, Depamphilis CW, Der JP, Leebens-Mack J, Ma H, et al. The Amborella genome and the evolution of flowering plants. Science. 2013;342(6165):1241089.

Morgan CC, Foster PG, Webb AE, Pisani D, McInerney JO, O'Connell MJ. Heterogeneous models place the root of the placental mammal phylogeny. Mol Biol Evol. 2013;30(9):2145–56.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Zhang C, Zhang T, Luebert F, Xiang Y, Huang C-H, Hu Y, et al. Asterid phylogenomics/phylotranscriptomics uncover morphological evolutionary histories and support phylogenetic placement for numerous whole-genome duplications. Mol Biol Evol. 2020;37(11):3188–210.

Koenen EJM, Ojeda DI, Steeves R, Migliore J, Bakker FT, Wieringa JJ, et al. Large-scale genomic sequence data resolve the deepest divergences in the legume phylogeny and support a near-simultaneous evolutionary origin of all six subfamilies. New Phytol. 2020;225(3):1355–69.

Zhang R, Wang YH, Jin JJ, Stull GW, Bruneau A, Cardoso D, et al. Exploration of plastid phylogenomic conflict yields new insights into the deep relationships of Leguminosae. Syst Biol. 2020;69(4):613–22.

Ma Z-Y, Nie Z-L, Ren C, Liu X-Q, Zimmer EA, Wen J. Phylogenomic relationships and character evolution of the grape family (Vitaceae). Mol Phylogenet Evol. 2021;154:106948.

Article   PubMed   Google Scholar  

Watson LE, Siniscalchi CM, Mandel J. Phylogenomics of the hyperdiverse daisy tribes: Anthemideae, Astereae, Calenduleae, Gnaphalieae, and Senecioneae. J Syst Evol. 2020;58(6):841–52.

Article   Google Scholar  

Feng C, Wang J, Harris AJ, Folta KM, Zhao M, Kang M. Tracing the diploid ancestry of the cultivated octoploid strawberry. Mol Biol Evol. 2021;38(2):478–85.

Lee-Yaw JA, Grassa CJ, Joly S, Andrew RL, Rieseberg LH. An evaluation of alternative explanations for widespread cytonuclear discordance in annual sunflowers (Helianthus). New Phytol. 2019;221(1):515–26.

Kapli P, Yang Z, Telford MJ. Phylogenetic tree building in the genomic age. Nat Rev Genet. 2020;21(7):428–44.

Mendes FK, Hahn MW. Gene tree discordance causes apparent substitution rate variation. Syst Biol. 2016;65(4):711-21.

Cai L, Xi Z, Lemmon EM, Lemmon AR, Mast A, Buddenhagen CE, et al. The perfect storm: gene tree estimation error, incomplete lineage sorting, and ancient gene flow explain the most recalcitrant ancient angiosperm clade, Malpighiales. Syst Biol. 2021;70(3):491–507.

Degnan JH, Rosenberg NA. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol Evol. 2009;24(6):332–40.

Philippe H, Roure B. Difficult phylogenetic questions: more data, maybe; better methods, certainly. BMC Biol. 2011;9:91.

Article   PubMed   PubMed Central   Google Scholar  

Hodel RGJ, Zimmer E, Wen J. A phylogenomic approach resolves the backbone of Prunus (Rosaceae) and identifies signals of hybridization and allopolyploidy. Mol Phylogenet Evol. 2021;160:107118.

Dong W, Liu Y, Li E, Xu C, Sun J, Li W, et al. Phylogenomics and biogeography of Catalpa (Bignoniaceae) reveal incomplete lineage sorting and three dispersal events. Mol Phylogenet Evol. 2022;166:107330.

Blischak PD, Chifman J, Wolfe AD, Kubatko LS. HyDe: a Python package for genome-scale hybridization detection. Syst Biol. 2018;67(5):821–9.

Edelman NB, Frandsen PB, Miyagi M, Clavijo B, Davey J, Dikow RB, et al. Genomic architecture and introgression shape a butterfly radiation. Science. 2019;366(6465):594.

Solís-Lemus C, Ané C. Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting. PLoS Genet. 2016;12(3):e1005896.

Article   PubMed   PubMed Central   CAS   Google Scholar  

Wang G, Zhang X, Herre EA, McKey D, Machado CA, Yu W-B, et al. Genomic evidence of prevalent hybridization throughout the evolutionary history of the fig-wasp pollination mutualism. Nat Commun. 2021;12(1):718.

Rose JP, Toledo CAP, Lemmon EM, Lemmon AR, Sytsma KJ. Out of sight, out of mind: Widespread nuclear and plastid-nuclear discordance in the flowering plant genus Polemonium (Polemoniaceae) suggests widespread historical gene flow despite limited nuclear signal. Syst Biol. 2021;70(1):162–80.

Wang K, Lenstra JA, Liu L, Hu Q, Ma T, Qiu Q, et al. Incomplete lineage sorting rather than hybridization explains the inconsistent phylogeny of the wisent. Commun Biol. 2018;1(1):169.

Zhu Q, Mai U, Pfeiffer W, Janssen S, Asnicar F, Sanders JG, et al. Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains Bacteria and Archaea. Nat Commun. 2019;10(1):5477.

Morales-Briones DF, Kadereit G, Tefarikis DT, Moore MJ, Smith SA, Brockington SF, et al. Disentangling sources of gene tree discordance in phylogenomic data sets: testing ancient hybridizations in Amaranthaceae s.l. Syst Biol. 2021;70(2):219–35.

Wallander E, Albert VA. Phylogeny and classification of Oleaceae based on rps16 and trnL-F sequence data. Am J Bot. 2000;87(12):1827–41.

Green PS: Oleaceae. In: Flowering Plants · Dicotyledons: Lamiales (except Acanthaceae including Avicenniaceae). Edited by Kadereit JW. Berlin, Heidelberg: Springer Berlin Heidelberg; 2004: 296-306.

Xia Z, Wen J, Gao Z. Does the enigmatic Wightia belong to Paulowniaceae (Lamiales)? Front Plant Sc. 2019;10:528.

Julca I, Marcet-Houben M, Vargas P, Gabaldón T. Phylogenomics of the olive tree ( Olea europaea ) reveals the relative contribution of ancient allo- and autopolyploidization events. BMC Biol. 2018;16(1):15.

Yuan W-J, Zhang W-R, Han Y-J, Dong M-F, Shang F-D. Molecular phylogeny of Osmanthus (Oleaceae) based on non-coding chloroplast and nuclear ribosomal internal transcribed spacer regions. J Syst Evol. 2010;48(6):482–9.

Guo S-Q, Xiong M, Ji C-F, Zhang Z-R, Li D-Z, Zhang Z-Y. Molecular phylogenetic reconstruction of Osmanthus Lour. (Oleaceae) and related genera based on three chloroplast intergenic spacers. Plant Syst Evol. 2011;294(1):57–64.

Besnard G, Green PS, Bervillé A. The genus Olea : molecular approaches of its structure and relationships to other Oleaceae. Acta Botanica Gallica. 2002;149(1):49–66.

Article   CAS   Google Scholar  

Hong-Wa C, Besnard G. Intricate patterns of phylogenetic relationships in the olive family as inferred from multi-locus plastid and nuclear DNA sequence analyses: a close-up on Chionanthus and Noronhia (Oleaceae). Mol Phylogenet Evol. 2013;67(2):367–78.

Hong-Wa C, Besnard G. Species limits and diversification in the Madagascar olive ( Noronhia , Oleaceae). Bot J Linn Soc. 2014;174(1):141–61.

Olofsson JK, Cantera I, Van de Paer C, Hong-Wa C, Zedane L, Dunning LT, et al. Phylogenomics using low-depth whole genome sequencing: a case study with the olive tribe. Mol Ecol Resour. 2019;19(4):877–92.

Dupin J, Raimondeau P, Hong-Wa C, Manzi S, Gaudeul M, Besnard G. Resolving the phylogeny of the olive family (Oleaceae): Confronting information from organellar and nuclear genomes. Genes. 2020;11(12):1508.

Article   CAS   PubMed Central   Google Scholar  

Dong W, Sun J, Liu Y, Xu C, Wang Y, Suo Z, Zhou S, Zhang Z, Wen J: Phylogenomic relationships and species identification of the olive genus Olea (Oleaceae). J Syst Evol. 2021:doi: https://doi.org/10.1111/jse.12802 .

Li J, Alexander JH, Zhang D. Paraphyletic Syringa (Oleaceae): evidence from sequences of nuclear ribosomal DNA ITS and ETS regions. Syst Bot. 2002;27(3):592–7.

Google Scholar  

Besnard G, Rubio de Casas R, Christin P-A, Vargas P. Phylogenetics of Olea (Oleaceae) based on plastid and nuclear ribosomal DNA sequences: tertiary climatic shifts and lineage differentiation times. Ann Bot. 2009;104(1):143–60.

Ha Y-H, Kim C, Choi K, Kim J-H. Molecular phylogeny and dating of Forsythieae (Oleaceae) provide insight into the Miocene history of Eurasian temperate shrubs. Front Plant Sc. 2018;9:99.

Van de Paer C, Bouchez O, Besnard G. Prospects on the evolutionary mitogenomics of plants: a case study on the olive family (Oleaceae). Mol Ecol Resour. 2018;18(3):407–23.

Article   PubMed   CAS   Google Scholar  

Zhang C, Rabiee M, Sayyari E, Mirarab S. ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinformatics. 2018;19(Suppl 6):153.

Zhong B, Deusch O, Goremykin VV, Penny D, Biggs PJ, Atherton RA, et al. Systematic error in seed plant phylogenomics. Genome Biol Evol. 2011;3:1340–8.

Smith SA, Donoghue MJ. Rates of molecular evolution are linked to life history in flowering plants. Science. 2008;322(5898):86–9.

Amanda R, Li Z, Van de Peer Y, Ingvarsson PK. Contrasting rates of molecular evolution and patterns of selection among gymnosperms and flowering plants. Mol Biol Evol. 2017;34(6):1363–77.

Schwarz EN, Ruhlman TA, Weng M-L, Khiyami MA, Sabir JSM, HajarahNH, et al. Plastome-wide nucleotide substitution rates reveal accelerated rates in Papilionoideae and correlations with genome features across legume subfamilies. J Mol Evol. 2017;84:187–203.

Choi K, Weng M-L, Ruhlman TA, Jansen RK. Extensive variation in nucleotide substitution rate and gene/intron loss in mitochondrial genomes of Pelargonium . Mol Phylogenet Evol. 2021;155:106986.

Lanfear R, Ho SYW, Jonathan Davies T, Moles AT, Aarssen L, Swenson NG, et al. Taller plants have lower rates of molecular evolution. Nat Commun. 2013;4(1):1879.

Bromham L, Hua X, Lanfear R, Cowman PF. Exploring the relationships between mutation rates, life history, genome size, environment, and species richness in flowering plants. Am. Nat. 2015;185(4):507–24.

Barraclough TG, Savolainen V. Evolutionary rates and species diversity in flowering plants. Evolution. 2001;55(4):677–83.

Corriveau JL, Coleman AW. Rapid screening method to detect potential biparental inheritance of plastid DNA and results for over 200 angiosperm species. Am J Bot. 1988;75(10):1443–58.

Zhang Q, Liu Y. Sodmergen: Examination of the cytoplasmic DNA in male reproductive cells to determine the potential for cytoplasmic inheritance in 295 angiosperm species. Plant Cell Physiol. 2003;44(9):941–51.

Wicke S, Schaferhoff B, Depamphilis CW, Muller KF. Disproportional plastome-wide increase of substitution rates and relaxed purifying selection in genes of Carnivorous Lentibulariaceae. Mol Biol Evol. 2014;31(3):529-45.

Sabir J, Schwarz E, Ellison N, Zhang J, Baeshen NA, Mutwakil M, et al. Evolutionary and biotechnology implications of plastid genome variation in the inverted-repeat-lacking clade of legumes. Plant Biotechnol J. 2014;12(6):743–54.

Nevill PG, Howell KA, Cross AT, Williams AV, Zhong X, Tonti-Filippini J, et al. Plastome-wide rearrangements and gene losses in Carnivorous Droseraceae. Genome Biol Evol. 2019;11(2):472–85.

Rabah SO, Shrestha B, Hajrah NH, Sabir MJ, Alharby HF, Sabir MJ, et al. Passiflora plastome sequencing reveals widespread genomic rearrangements. J Syst Evol. 2019;57(1):1–14.

Shrestha B, Weng M-L, Theriot EC, Gilbert LE, Ruhlman TA, Krosnick SE, et al. Highly accelerated rates of genomic rearrangements and nucleotide substitutions in plastid genomes of Passiflora subgenus Decaloba. Mol Phylogenet Evol. 2019;138:53–64.

Lee H-L, Jansen RK, Chumley TW, Kim K-J. Gene relocations within chloroplast genomes of Jasminum and Menodora (Oleaceae) are due to multiple, overlapping inversions. Mol Biol Evol. 2007;24(5):1161–80.

Guisinger MM, Kuehl JNV, Boore JL, Jansen RK. Genome-wide analyses of Geraniaceae plastid DNA reveal unprecedented patterns of increased nucleotide substitutions. Proc Nat Acad Sci USA. 2008;105(47):18424–9.

Weng M-L, Blazier JC, Govindu M, Jansen RK. Reconstruction of the ancestral plastid genome in geraniaceae reveals a correlation between genome rearrangements, repeats, and nucleotide substitution rates. Mol Biol Evol. 2014;31(3):645–59.

Barnard-Kubow KB, Sloan DB, Galloway LF. Correlation between sequence divergence and polymorphism reveals similar evolutionary mechanisms acting across multiple timescales in a rapidly evolving plastid genome. BMC Evol Biol. 2014;14(1):268.

Dong W, Xu C, Wu P, Cheng T, Yu J, Zhou S, et al. Resolving the systematic positions of enigmatic taxa: manipulating the chloroplast genome data of Saxifragales. Mol Phylogenet Evol. 2018;126:321–30.

Xu L-L, Yu R-M, Lin X-R, Zhang B-W, Li N, Lin K, Zhang D-Y, Bai W-N: Different rates of pollen and seed gene flow cause branch-length and geographic cytonuclear discordance within Asian butternuts. New Phytol 2021; n/a(n/a).

Besnard G, Rubio de Casas R, Vargas P: Plastid and nuclear DNA polymorphism reveals historical processes of isolation and reticulation in the olive tree complex ( Olea europaea ). J Biogeogr 2007, 34(4):736-752.

Wright JW. New chromosome counts in Acer and Fraxinus . Morris Arboretum Bull. 1957;8:33–4.

Meleshko O, Martin MD, Korneliussen TS, Schröck C, Lamkowski P, Schmutz J, Healey A, Piatkowski BT, Shaw AJ, Weston DJ. Extensive genome-wide phylogenetic discordance is due to incomplete lineage sorting and not ongoing introgression in a rapidly radiated bryophyte genus. Mol Biol Evol. 2021;38(7):2750–66.

Leo Elworth RA, Allen C, Benedict T, Dulworth P, Nakhleh L: D GEN ;: a test statistic for detection of general introgression scenarios. bioRxiv. 2018:348649.

Marques DA, Meier JI, Seehausen O. A combinatorial view on speciation and adaptive radiation. Trends Ecol Evol. 2019;34(6):531–44.

Taylor H. Cyto-taxonomy and phylogeny of the Oleaceae. Brittonia. 1945;5(4):337–67.

Xu C, Dong W, Shi S, Cheng T, Li C, Liu Y, et al. Accelerating plant DNA barcode reference library construction using herbarium specimens: improved experimental techniques. Mol Ecol Resour. 2015;15(6):1366–74.

Li J, Wang S, Jing Y, Wang L, Zhou S. A modified CTAB protocol for plant DNA extraction. Chin Bull Bot. 2013;48(1):72–8.

Dong W, Liu Y, Xu C, Gao Y, Yuan Q, Suo Z, et al. Chloroplast phylogenomic insights into the evolution of Distylium (Hamamelidaceae). BMC Genomics. 2021;22(1):293.

Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114–20.

Jin J-J, Yu W-B, Yang J-B, Song Y, de Pamphilis CW, Yi T-S, et al. GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes. Genome Biol. 2020;21(1):241.

Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19(5):455–77.

Kearse M, Moir R, Wilson A, Stones-Havas S, Cheung M, Sturrock S, et al. Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics. 2012;28(12):1647–9.

Huang DI, Cronk QCB. Plann: a command-line application for annotating plastome sequences. Appl Plant Sci. 2015;3(8):1500026.

Greiner S, Lehwark P, Bock R. OrganellarGenomeDRAW (OGDRAW) version 1.3.1: expanded toolkit for the graphical visualization of organellar genomes. Nucleic Acids Res. 2019;47(W1):W59–64.

Unver T, Wu Z, Sterck L, Turktas M, Lohaus R, Li Z, et al. Genome of wild olive and the evolution of oil biosynthesis. Proc Natl Acad Sci. 2017;114(44):E9413.

Sollars ES, Harper AL, Kelly LJ, Sambles CM, Ramirez-Gonzalez RH, Swarbreck D, et al. Genome sequence and genetic diversity of European ash trees. Nature. 2017;541(7636):212–6.

Li L-F, Cushman SA, He Y-X, Li Y. Genome sequencing and population genomics modeling provide insights into the local adaptation of weeping forsythia. Horm. Res. 2020;7(1):130.

Patel RK, Jain M. NGS QC Toolkit: a toolkit for quality control of next generation sequencing data. PLOS ONE. 2012;7(2):e30619.

Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9.

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. Genome Project Data Processing S: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078–9.

Li H. Improving SNP discovery by base alignment quality. Bioinformatics. 2011;27(8):1157–8.

Darling AE, Mau B. Perna NT: progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLOS ONE. 2010;5(6):e11147.

Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T: trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 2009, 25(15):1972-1973.

Zhang D, Gao F, Jakovlic I, Zou H, Zhang J, Li WX, et al. PhyloSuite: an integrated and scalable desktop platform for streamlined molecular sequence data management and evolutionary phylogenetics studies. Mol Ecol Resour. 2020;20(1):348–55.

Seppey M, Manni M, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness. Methods Mol. Biol. 1962;2019:227–45.

Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019;20(1):238.

Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30(4):772–80.

Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics. 2019;35(21):4453–5.

Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol Biol Evol. 2020;37(5):1530–4.

Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat Methods. 2017;14(6):587–9.

Lanfear R, Frandsen PB, Wright AM, Senfeld T, Calcott B. PartitionFinder 2: new methods for selecting partitioned models of evolution for molecular and morphological phylogenetic analyses. Mol Biol Evol. 2017;34(3):772-3.

Wang H-C, Susko E, Roger AJ. The relative importance of modeling site pattern heterogeneity versus partition-wise heterotachy in phylogenomic inference. Syst Biol. 2019;68(6):1003–19.

Philippe H, Brinkmann H, Lavrov DV, Littlewood DT, Manuel M, Worheide G, et al. Resolving difficult phylogenetic questions: why more sequences are not enough. PLoS Biol. 2011;9(3):e1000602.

Wang H-C, Minh BQ, Susko E, Roger AJ. Modeling site heterogeneity with posterior mean site frequency profiles accelerates accurate phylogenomic estimation. Syst Biol. 2018;67(2):216–35.

Crotty SM, Minh BQ, Bean NG, Holland BR, Tuke J, Jermiin LS, et al. GHOST: recovering historical signal from heterotachously evolved sequence alignments. Syst Biol. 2020;69(2):249–64.

CAS   PubMed   Google Scholar  

Rodrigue N, Lartillot N. Site-heterogeneous mutation-selection models within the PhyloBayes-MPI package. Bioinformatics. 2014;30(7):1020–1.

Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS. UFBoot2: improving the ultrafast bootstrap approximation. Mol Biol Evol. 2018;35(2):518–22.

Smith SA, Moore MJ, Brown JW, Yang Y. Analysis of phylogenomic datasets reveals conflict, concordance, and gene duplications with examples from animals and plants. BMC Evol Biol. 2015;15(1):150.

Sayyari E, Mirarab S. Fast coalescent-based computation of local branch support from quartet frequencies. Mol Biol Evol. 2016;33(7):1654–68.

Minh BQ, Hahn MW, Lanfear R. New methods to calculate concordance factors for phylogenomic datasets. Mol Biol Evol. 2020;37(9):2727–33.

Smith SA, O’Meara BC: treePL: divergence time estimation using penalized likelihood for large phylogenies. Bioinformatics 2012, 28(20):2689-2690.

Green RE, Krause J, Briggs AW, Maricic T, Stenzel U, Kircher M, et al. A draft sequence of the neandertal genome. Science. 2010;328(5979):710.

Martin SH, Davey JW, Jiggins CD. Evaluating the use of ABBA–BABA statistics to locate introgressed loci. Mol Biol Evol. 2015;32(1):244–57.

Than C, Ruths D, Nakhleh L. PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships. BMC Bioinformatics. 2008;9:322.

Huson DH, Scornavacca C. Dendroscope 3: an interactive tool for rooted phylogenetic trees and networks. Syst Biol. 2012;61(6):1061–7.

Sayyari E, Mirarab S. Testing for polytomies in phylogenetic species trees using quartet frequencies. Genes. 2018;9(3)132.

Li L-F, Cushman SA, He Y-X, Li Y. Genome sequencing and population genomics modeling provide insights into the local adaptation of weeping forsythia. Horm Res. 2020;7(1):1-12. %* 2020 The Author(s) %U https://www.nature.com/articles/s41438-41020-00352-41437 .

Xu S, Ding Y, Sun J, Zhang Z, Wu Z, Yang T, Shen F, Xue G: A high-quality genome assembly of Jasminum sambac provides insight into floral trait formation and Oleaceae genome evolution. Mol Ecol Resour. 2022, 22(2):724-739 %U https://onlinelibrary.wiley.com/doi/abs/710.1111/1755-0998.13497 .

Sollars ESA, Harper AL, Kelly LJ, Sambles CM, Ramirez-Gonzalez RH, Swarbreck D, Kaithakottil G, Cooper ED, Uauy C, Havlickova L et al. Genome sequence and genetic diversity of European ash trees. Nature 2017; 541(7636):212-216 %U http://www.nature.com/articles/nature20786 .

Tang H, Bowers JE, Wang X, Ming R, Alam M, Paterson AH. Synteny and collinearity in plant genomes. Science 2008; 320(5875):486-488. %U https://www.science.org/doi/410.1126/science.1153917 .

Bouckaert R, Heled J, Kuhnert D, Vaughan T, Wu CH, Xie D, et al. BEAST 2: a software platform for Bayesian evolutionary analysis. PLoS Comp Biol. 2014;10(4):e1003537.

Call VB, Dilcher DL. Investigations of angiosperms from the Eocene of southeastern North America: samaras of Fraxinus wilcoxiana Berry. Rev. Palaeobot. Palynol. 1992;74(3):249–66.

Palamarev E. Paleobotanical evidences of the Tertiary history and origin of the Mediterranean sclerophyll dendroflora. Plant Syst Evol. 1989;162(1/4):93–107.

Muller J. Fossil pollen records of extant angiosperms. Bot Rev. 1981;47(1):1–142.

Terral JF, Badal E, Heinz C, Roiron P, Thiebault S, Figueiral I. A hydraulic conductivity model points to post-neogene survival of the mediterranean olive. Ecology. 2004;85(11):3158–65.

Rambaut A, Suchard M, Xie D, Drummond A. Tracer v1. 6. In . ; 2014: Available from http://beast.bio.ed.ac.uk/Tracer .

Kumar S, Stecher G, Tamura K. MEGA7: molecular evolutionary genetics analysis version 7.0 for bigger datasets. Mol Biol Evol. 2016;33(7):1870–4.

Yang ZH. PAML 4: Phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24(8):1586–91.

Barrett CF, Baker WJ, Comer JR, Conran JG, Lahmeyer SC, Leebens-Mack JH, et al. Plastid genomes reveal support for deep phylogenetic relationships and extensive rate variation among palms and other commelinid monocots. New Phytol. 2016;209(2):855–70.

Dong W, Li E, Liu Y, Xu C, Liu K, Cui X, et al. Genome skimming data for: Phylogenomic approaches untangle early divergences and complex diversifications of the olive plant family. NCBI BioProject. 2022. https://identifiers.org/bioproject:PRJNA820313 .

Dong W, Li E, Liu Y, Xu C, Liu K, Cui X, et al. Genome skimming data for: Phylogenomic approaches untangle early divergences and complex diversifications of the olive plant family. NCBI BioProject; 2022. https://identifiers.org/bioproject: : PRJNA704245.

Download references

Acknowledgements

We thank Bo Xu for assistance with PAML analysis and the DNA Bank of China for providing materials.

This research was supported by CACMS Innovation Fund (No.CI2021A03909) and the Science and Technology Basic Resources Investigation Program of China (No. 2021FY100200).

Author information

Authors and affiliations.

Laboratory of Systematic Evolution and Biogeography of Woody Plants, School of Ecology and Nature Conservation, Beijing Forestry University, Beijing, 100083, China

Wenpan Dong, Enze Li, Yushuang Wang, Kangjia Liu, Xingyong Cui & Zhixiang Zhang

State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing, 100093, China

Yanlei Liu, Chao Xu, Zhili Suo & Shiliang Zhou

State Key Laboratory Breeding Base of Dao-di Herbs, National Resource Center for Chinese Materia Medica, China Academy of Chinese Medical Sciences, Beijing, 100700, China

Department of Botany, National Museum of Natural History, Smithsonian Institution, Washington, DC, 20013-7012, USA

You can also search for this author in PubMed   Google Scholar

Contributions

WD: supervision, conceptualization, methodology, formal analysis, investigation, writing—original draft, writing—review and editing. EL: methodology, software, data curation. YL: data curation, investigation; CX: resources, writing—original draft. YW: data curation, methodology. KL: investigation, methodology, software. XC: resources, methodology, data curation. JS: supervision, resources, funding acquisition. ZS: resources, investigation. ZZ: supervision, investigation. JW: conceptualization, writing—original draft, writing—review and editing; SZ: supervision, writing—review and editing, writing—original draft. The authors all read and approved the final manuscript.

Corresponding authors

Correspondence to Wenpan Dong , Jiahui Sun or Jun Wen .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: table s1..

Taxa included in this study with locality and voucher numbers. Table S2. Information from the GenBank data, including the accession number of chloroplast genome sequences and Sequence Read Archive (SRA). Table S3. Branch support values of the 25 gene trees at the tribe level. The number of the trees the same as in Table 2 . Table S4. Frequency of all the possible tree topologies from six species at the tribe level of Oleaceae. Table S5. D -statistic test results at the tribe level of Oleaceae with Origanum vulgare as an outgroup. Table S6. QuIBL analysis results at the tribe level of Oleaceae. Table S7. Average total introgression proportion per species pair in the QuIBL analysis at the tribe level of Oleaceae. Table S8. Frequency of all the possible tree topologies from five species at the subtribe level of tribe Oleeae. Table S9. D -statistic test results at the subtribe level of tribe Oleeae with Forsythia suspensa as an outgroup. Table S10. The QuIBL analysis results at the subtribe level of tribe Oleeae. Table S11. Average total introgression proportion per species pair in the QuIBL analysis at the subtribe level of tribe Oleeae. Table S12. Details of the four calibrations points used in the BEAST analysis.

Additional file 2: Fig. S1.

The maximum likelihood tree estimated from the 77G180saa based on the gene partition models used as a reference to evaluate conflict and concordance among the 19 plastid datasets trees (Table 2 ). Pie charts depict conflict amongst the input trees, with the blue, green, red, and gray slices representing, respectively, the proportion of input bipartitions concordant, conflicting (supporting a single main alternative topology), conflicting (supporting various alternative topologies), and uninformative (BS < 80) at each node. The numbers below each branch are ICA values. Fig. S2. The maximum likelihood tree estimated from the SNP-ash dataset used as a reference to evaluate conflict and concordance among the six SNP gene trees (Table 2 ). Pie charts depict conflict amongst the input trees, with the blue, green, red, and gray slices representing, respectively, the proportion of input bipartitions concordant, conflicting (supporting a single main alternative topology), conflicting (supporting various alternative topologies), and uninformative (BS < 80) at each node. The numbers below each branch are ICA values. Fig. S3. The maximum likelihood tree estimated from the SNP-ash dataset used as a reference to evaluate conflict and concordance among the 41 gene trees using the dividing methods. Pie charts depict conflict amongst the input trees, with the blue, green, red, and gray slices representing, respectively, the proportion of input bipartitions concordant, conflicting (supporting a single main alternative topology), conflicting (supporting various alternative topologies), and uninformative (BS < 80) at each node. The numbers below each branch are ICA values. Fig. S4. The maximum likelihood tree estimated from the 77G180saa based on the gene partition models used as a reference to evaluate conflict and concordance among the 24 trees (plastid datasets and SNP datasets, Table 2 ). Pie charts depict conflict amongst the input trees, with the blue, green, red, and gray slices representing, respectively, the proportion of input bipartitions concordant, conflicting (supporting a single main alternative topology), conflicting (supporting various alternative topologies), and uninformative (BS < 80) at each node. The numbers below each branch are ICA values. Fig. S5. The divergence time of Oleaceae was estimated by BEAST according to age calibrations of four nodes based on the concatenated 76-coding gene dataset.

Additional file 3:.

Note. The reason for using the ML tree from the 180s77Gaa dataset under a gene partitioning scheme as the reference tree.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Dong, W., Li, E., Liu, Y. et al. Phylogenomic approaches untangle early divergences and complex diversifications of the olive plant family. BMC Biol 20 , 92 (2022). https://doi.org/10.1186/s12915-022-01297-0

Download citation

Received : 02 December 2021

Accepted : 13 April 2022

Published : 25 April 2022

DOI : https://doi.org/10.1186/s12915-022-01297-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Ancient introgression
  • Gene tree conflict
  • Incomplete lineage sorting
  • Phylogenomics
  • Rate heterogeneity

BMC Biology

ISSN: 1741-7007

case study of phylogenetic analysis

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Implications of gene tree heterogeneity on downstream phylogenetic analyses: A case study employing the Fair Proportion index

Roles Conceptualization, Formal analysis, Investigation, Methodology, Validation, Writing – original draft, Writing – review & editing

Affiliation Department of Mathematical Sciences, New Jersey Institute of Technology, Newark, NJ, United States of America

ORCID logo

Roles Data curation, Formal analysis, Investigation, Methodology, Visualization, Writing – original draft, Writing – review & editing

Affiliation Division of Biostatistics, College of Public Health, The Ohio State University, Columbus, OH, United States of America

Roles Conceptualization, Investigation, Methodology, Supervision, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

Affiliations Department of Evolution, Ecology, and Organismal Biology, The Ohio State University, Columbus, OH, United States of America, Department of Statistics, The Ohio State University, Columbus, OH, United States of America

  • Kristina Wicke, 
  • Md. Rejuan Haque, 
  • Laura Kubatko

PLOS

  • Published: April 25, 2024
  • https://doi.org/10.1371/journal.pone.0300900
  • Reader Comments

Fig 1

Many questions in evolutionary biology require the specification of a phylogeny for downstream phylogenetic analyses. However, with the increasingly widespread availability of genomic data, phylogenetic studies are often confronted with conflicting signal in the form of genomic heterogeneity and incongruence between gene trees and the species tree. This raises the question of determining what data and phylogeny should be used in downstream analyses, and to what extent the choice of phylogeny (e.g., gene trees versus species trees) impacts the analyses and their outcomes. In this paper, we study this question in the realm of phylogenetic diversity indices, which provide ways to prioritize species for conservation based on their relative evolutionary isolation on a phylogeny, and are thus one example of downstream phylogenetic analyses. We use the Fair Proportion (FP) index, also known as the evolutionary distinctiveness score, and explore the variability in species rankings based on gene trees as compared to the species tree for several empirical data sets. Our results indicate that prioritization rankings among species vary greatly depending on the underlying phylogeny, suggesting that the choice of phylogeny is a major influence in assessing phylogenetic diversity in a conservation setting. While we use phylogenetic diversity conservation as an example, we suspect that other types of downstream phylogenetic analyses such as ancestral state reconstruction are similarly affected by genomic heterogeneity and incongruence. Our aim is thus to raise awareness of this issue and inspire new research on which evolutionary information (species trees, gene trees, or a combination of both) should form the basis for analyses in these settings.

Citation: Wicke K, Haque MR, Kubatko L (2024) Implications of gene tree heterogeneity on downstream phylogenetic analyses: A case study employing the Fair Proportion index. PLoS ONE 19(4): e0300900. https://doi.org/10.1371/journal.pone.0300900

Editor: Ruriko Yoshida, Naval Postgraduate School, UNITED STATES

Received: December 16, 2023; Accepted: March 1, 2024; Published: April 25, 2024

Copyright: © 2024 Wicke et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Data and code are available at the following GitHub repository: https://github.com/lkubatko/FP-Index-Discordance/ .

Funding: KW was supported by The Ohio State University’s President’s Postdoctoral Scholars Program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Estimating the evolutionary relationships among organisms is a central goal in evolutionary biology. On one hand, phylogenies are inferred to uncover past evolutionary trajectories and understand the relatedness and differences among organisms. On the other hand, phylogenies provide the basis for downstream phylogenetic analyses such as studying trait evolution (see, e.g., the review by [ 1 ]), ancestral state reconstruction (see, e.g., [ 2 ] for an overview and [ 3 ] for a more recent study), estimation of diversification rates and testing of macroevolutionary models (e.g., [ 4 – 6 ]), and quantifying biodiversity (e.g., [ 7 – 9 ]).

To provide meaningful results, these downstream analyses rely on accurate phylogenies representing the evolutionary relationships among the organisms studied. Traditionally, phylogenetic trees have been inferred from single genes and the resulting gene trees were assumed to be a valid estimate for the species tree, i.e., the “true” evolutionary history of the species under consideration. However, the advent of whole genome sequencing has resulted in increased appreciation of the differences between these gene trees and the species tree. Discordance among gene trees and between gene trees and the species tree is known to arise from numerous biological processes, including incomplete lineage sorting, lateral gene transfer, and gene duplication and loss (e.g., [ 10 – 12 ]). Downstream phylogenetic analyses thus face the challenge of conflicting signal in the data and the question of which evolutionary information (species trees, gene trees, or a combination of both) to employ in these settings.

In this paper, we exemplarily study the effects of phylogenetic variation in the form of gene tree heterogeneity in the context of phylogenetic diversity conservation. Phylogenetic diversity (PD) is a quantitative measure of biodiversity introduced by [ 7 ]. Given a weighted phylogeny, it sums the branch lengths connecting a subset of the species, thereby linking evolutionary history to feature diversity [ 7 ]. In fact, preserving PD and the “Tree of Life” has become an integral part of conservation considerations (see, e.g., the “Phylogenetic Diversity Task Force” initiated by the IUCN [ 13 ]).

While PD captures the biodiversity of sets of species, evolutionary isolation metrics, also known as phylogenetic diversity indices, apportion the total diversity of a tree among its leaves and quantify the relative importance of species for overall biodiversity based on their placement in the tree. Various methods can be devised to distribute the total diversity of a tree across present-day species, and a variety of phylogenetic diversity indices for phylogenetic trees has been introduced in the literature (for an overview, see, e.g., [ 9 , 14 ]). One of the most popular indices is the Fair Proportion (FP) index (also known as evolutionary distinctiveness (ED) score) introduced by [ 8 , 15 ]. Its underlying idea is to assign each species a “fair proportion” of the total PD of a tree. More precisely, each species descended from a given branch in the phylogeny receives an equal proportion of that branch’s length.

The FP index has been employed in the EDGE of Existence programme, a global conservation initiative focusing specifically on threatened species that represent a significant amount of unique evolutionary history ([ 8 ]; see also https://www.edgeofexistence.org/ ). This project aims at identifying species that are both e volutionary d istinct and g lobally e ndangered, where evolutionary distinctiveness is measured in terms of the FP index and the globally endangered score is based on the IUCN Red List Category. However, because the FP index heavily depends on the underlying phylogenetic tree and its branch lengths, different phylogenies for the same set of species may result in different prioritization orders. In this note, we show that this is indeed a widespread phenomenon occurring across various groups of organisms by analyzing nine multilocus data sets from the literature and comparing the FP prioritization orders obtained on individual gene trees, on species trees, and by averaging over gene trees.

We remark that while the FP index is currently employed in the EDGE of Existence project, a revised “EDGE2 protocol” has recently been advocated for in the literature in order to overcome some of the shortcomings of the existing approach [ 16 ]. The new protocol is more comprehensive and for instance incorporates uncertainty in the phylogeny and extinction risks of species as well as PD complementarity [ 17 ]. In particular, it replaces the FP index by an “ED2” score which is equivalent to the so-called “heightened evolutionary distinctiveness (HED)” score [ 18 ]. Intuitively speaking, the ED2 score of a species takes into account both the species’ placement in an underlying phylogeney as well as the extinction risk of its close relatives. Importantly, however, the ED2 score also relies on a given phylogeny as a precursor and is thus likely also affected by genetic heterogeneity and incongruence between gene trees and the species tree. In this study, we thus employ the simpler FP index to illustrate the effects of phylogenetic variation on species prioritization. However, we emphasize that we employ the FP index primarily as an example of a phylogenetic downstream analysis. While we analyze nine multilocus data sets from the literature covering a wide range of organisms, our study does not imply that these species currently deserve or do not deserve conservation attention. Moreover, while EDGE-like studies typically use dated and ultrametric phylogenies [ 16 ], we do not have enough information (e.g., about the fossil record) for the nine multilocus data sets we analyze to date them (however, as described below, we enforce a molecular clock for all species trees but not gene trees). Additionally, phylogenetic conservation studies are ideally performed using large phylogenies with near-complete taxonomic groups, which our data sets unfortunately also lack. Our case study is thus meant as an illustration of the impacts of phylogenetic heterogeneity on downstream analyses in general and not as a real-world conservation study.

Materials and methods

The fp index.

case study of phylogenetic analysis

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

case study of phylogenetic analysis

https://doi.org/10.1371/journal.pone.0300900.g001

Data collection

Nine multilocus data sets were curated from the literature. In some cases, we reduced the published data sets to a subset of species and/or genes due to large amounts of missing data or missing outgroups. Detailed information on both the data reduction process as well as the species and loci included in our analyses is available from the S1 Appendix . The full data set was used when not otherwise noted below:

  • Dolphin data set [ 19 ]: DNA sequence data from 24 genes for 47 aquatic mammals. We reduced the data set to 28 species and 22 genes.
  • Fungi data set [ 20 ]: Amino acid sequence data, gene tree, and species tree estimates based on 683 genes for 25 individuals of bipolar budding yeasts and four outgroups.
  • Mammal data set [ 21 , 22 ]: DNA sequence data and gene tree estimates based on 447 genes for 33 species of mammals and four outgroups.
  • Plant data set [ 20 ]: DNA sequence data, gene tree, and species tree estimates based on 363 genes for 48 Lamiaceae and four outgroup species. We reduced the data set to 318 genes for the 52 species.
  • Primate data set [ 23 ]: DNA sequence data from 52 genes for four primate species.
  • Rattlesnake data set [ 24 ]: DNA sequence data from 19 genes for 24 individuals of six subspecies of Sistrurus rattlesnakes and two outgroup species. We picked one sequence per subspecies and one outgroup sequence (i.e., seven sequences in total), and used 16 of the 19 loci.
  • Rodent data set [ 20 ]: DNA sequence data, gene tree, and species tree estimates based on 794 genes for 37 rodent species. We reduced the data set to 761 genes for the 37 species.
  • Snake data set [ 25 ]: DNA sequence data, gene tree, and species tree estimated based on 333 genes for 31 caenophidians and two outgroup species.
  • Yeast data set [ 26 ]: DNA sequence data from 106 genes for eight yeast species.

Tree estimation and data analysis

Five of the nine data sets (fungi, mammal, plant, rodent, and snake) contained gene tree estimates, which we subsequently used in the analysis. For the remaining four data sets (dolphin, primate, rattlesnake, and yeast), we estimated gene trees under the GTR+Gamma model using RAxML version 8.2.12 [ 27 ].

Except for the fungi and snake data sets, a species tree estimate was obtained using SVDquartets [ 28 ] as implemented in the PAUP* package [ 29 ] (note that we re-estimated species trees for the rodent and plant data sets instead of using the published species trees since we reduced the corresponding data sets in the gene tree analysis). Finally, maximum likelihood branch lengths were computed for all species trees under the GTR+Gamma model (for all but the fungi data set, for which we used the LG model for amino acid sequence evolution) with the molecular clock enforced using PAUP*.

FP indices on individual gene and species trees as well as average FP indices across gene trees were calculated and the resulting values were transformed into rankings (using the “1224” standard competition ranking for ties). Based on this, boxplots of the distribution of ranks across gene trees were generated for each species (an example is depicted in Fig 2 ) and their interquartile ranges (IQRs) were computed. The mean, minimum, and maximum IQR across taxa, standardized by dividing by the number of taxa, was computed. Larger values indicate more variability in the distribution of ranks, and thus ranks that differ more across loci. Kendall’s τ rank correlation was calculated between the rankings obtained from pairs of gene trees, and from individual gene trees and the species tree ( Fig 3 ), as well as from the average FP index and the species tree ( Table 1 ). Finally, for each data set, the set of species ranked in the top quartile on the species tree was computed and compared with the percentage of gene trees supporting the placement of these species in the top quartile ( Table 2 ). All statistical analyses were performed using the R Statistical Software [ 30 ].

thumbnail

In addition, the ranks obtained from the average FP index across the 22 gene trees (dots) and the ranks on the species tree (triangles) are depicted. The boxplots show large variability in ranks across gene trees. Moreover, the ranks obtained from the average FP index across gene trees and the FP index on the species tree are sometimes ordered differently. For example, F. attenuata receives a higher rank than P. electra on the species tree, but a smaller rank based on the average FP index.

https://doi.org/10.1371/journal.pone.0300900.g002

thumbnail

Note that for the plant data set, Kendall’s τ rank correlation between the rankings obtained from pairs of gene trees was based on the ranks of all species present in both gene trees, respectively (due to missing data, not all gene trees contained the complete set of species).

https://doi.org/10.1371/journal.pone.0300900.g003

thumbnail

Column 2: Kendall’s τ rank correlation between the ranking of species induced by the average FP index across gene trees and that induced by the FP index on the species tree. Columns 3–5: Mean, minimum, and maximum IQR of the ranks obtained in the gene tree analyses (standardized by dividing by the number of taxa).

https://doi.org/10.1371/journal.pone.0300900.t001

thumbnail

Note that when the top 25% of species was not an integer (e.g., when the total number of species was not a multiple of four or when there were ties), we extended the set to include some additional species.

https://doi.org/10.1371/journal.pone.0300900.t002

We observed extensive variation in the rankings of taxa induced by the FP index across different gene trees for all data sets analyzed. As a representative example, Fig 2 shows boxplots of the rankings for each of 28 marine mammal species for a data set with 22 genes [ 19 ]. Note in particular that some species (e.g., G. griseus ) have ranks that span almost the entire range of possibilities, while relatively few have consistent ranks across all genes (as would be indicated by “narrow” boxplots in Fig 2 ). The corresponding boxplots for the other data sets, which vary in the type of organism and the number of genes included, show similar behavior (see S1 – S8 Figs).

To measure the extent of variation in the rankings of species across genes trees, the IQR of ranks assigned was computed for each species. The mean, minimum, and maximum of the IQRs across species ( Table 1 , columns 3–5) are relatively stable, with the mammal, plant, and yeast data sets showing the lowest variability in rankings and the primate data set the highest. The large maximum IQR for the dolphin and fungi data sets suggests the presence of at least one species for which large variation in ranks across genes is observed.

Fig 3 summarizes results across all data sets examined. The blue boxplots show the distribution of values of Kendall’s τ rank correlation between rankings of species obtained from pairs of gene trees. While there tends to be a positive correlation overall, some boxes are closer to zero than to one, and some negative values (meaning that the corresponding rankings are negatively correlated; in other words, they tend to rank species in opposite order) are present. Thus, using different gene trees for the calculation of the FP index can result in very different prioritization orders.

A similar trend occurs for the correlation between rankings obtained from individual gene trees and the species tree. The red boxplots in Fig 3 again show a tendency for positive correlations, but in several cases the correlation is closer to zero than to one, and for some data sets (e.g., the primate and rattlesnake data sets) negative values are again observed.

Given the variability in ranks obtained from different gene trees, we additionally considered how the FP index averaged across gene trees compared to that obtained from the species tree. For most data sets, we observe some changes in the orderings of ranks when using the average over gene trees vs. the species tree (see Fig 2 for an example). However, as Table 1 shows, the rankings are all positively correlated and, with the exception of the plant and rattlesnake data sets, the correlation is larger than 0.5. Thus, the average FP index across loci and the FP index based on the species tree tend to result in similar prioritization orders of species in general.

Finally, from a conservation view point it may be relevant to know whether e.g., the top 25% of species are consistently placed in the top quartile of species across different gene trees, even if there are small rank switches among them. For this reason, we additionally computed the set of species placed in the top quartile in the ranking based on the species tree, together with the percentage of gene trees supporting its placement in the top quartile. Table 2 shows that in most cases species ranked high on the species tree are indeed also ranked high on the majority of gene trees. This indicates that while there might be small differences in the precise ranking orders, different gene trees mostly agree on the set of highly ranked species. However, there are some outliers with very low support on the gene tree level. These correspond to species for which the rank on the species tree is also very different from the average rank across gene trees (see, e.g., P. bambusetorum and H. sanguinea in the plant data set and compare this to the rank from the species tree and the rank from the mean over gene trees given in S3 Fig ).

Many questions in evolutionary biology require the specification of a phylogeny as a precursor to downstream analyses and are thus challenged by conflicting evolutionary signals such as hetereogeneity in gene trees or discordance between gene trees and the species tree. In this case study, we employed the FP index, a popular tool for prioritizing species for conservation based on their placement in an underlying phylogeny, to illustrate the effect of gene tree variation on downstream phylogenetic analyses. In analyzing species rankings obtained from the FP index for nine multilocus data sets for a broad range of organisms, we found that different loci/gene trees can induce very different prioritization orders of species (reflected by large variability in the rank a species is assigned across gene trees). However, with few exceptions, taking all loci into account by averaging the FP index across loci or by constructing a species phylogeny based on all loci resulted in similar prioritization orders. Additionally, we found that the set of species ranked in the top quartile on the species tree is in most cases also ranked highly by the majority of gene trees. On one hand, this result is in line with the previous observation that averaging the FP index across loci tends to result in similar prioritization orders as using the species tree. On the other hand, it may be relevant for actual conservation action targeted at the top set of species rather than individual species, as it illustrates that while different gene trees may rank species differently, in most cases they tend to agree on the set of most evolutionary distinct species. However, we also saw some exceptions to this trend, i.e., species ranked highly on the species tree, but with very low support on the gene tree level, and it would be interesting to investigate the reasons and implications of this further in future research.

While we have used the FP index to illustrate the effect of gene tree variation on prioritization, this phenomenon is not limited to the FP index. Because phylogenetic diversity indices rely on specification of an underlying phylogeny, all such indices have the potential for the particular phylogeny chosen to strongly influence the results. We remark, however, that the FP index is weighted towards the tips of the phylogeny (since, by its definition, the lengths of edges close to the tips have a higher influence than more ancient edges) and thus may be more strongly affected by processes like incomplete lineage sorting than other PD indices. It would thus be interesting to see how other PD indices, such as the new ED2 score mentioned in the introduction, fare in this setting.

More generally, it would be interesting to analyze the behavior of other phylogenetic downstream analyses in the presence of phylogenetic heterogeneity. Given the role of evolutionary processes like incomplete lineage sorting and hybridization in generating variation in the trees for individual loci throughout the genome, the choice of phylogeny to use for downstream analyses is far from straightforward. Recent work has for instance shown that analysis of quantitative trait evolution can benefit from the incorporation of gene tree variation [ 31 ]. The analyses presented here thus serve to underscore the importance of carefully considering the choice of phylogeny for settings such as these.

Our results demonstrate a need for future research into the effect of phylogeny choice on downstream inference. While in many cases use of a species tree might provide a robust estimate of evolutionary relationships, an argument can be made that use of a species tree could obscure relationships of interest to a particular question if such relationships were strongly supported only by a subset of genes. Approaches that more explicitly accommodate variability in the underlying histories of the loci by computing single-locus trees may be promising in such cases, as they would allow an examination of the contribution of individual phylogenies to the overall results. However, this comes at the expense of increased computation, as such approaches would require separate estimation of phylogenies for all individual loci. Another possibility would be to adopt a Bayesian approach to formally integrate over the distribution of gene trees, again incurring a substantial computational cost. Overall, we hope that our analyses of the FP index provide a call for future research on how to best infer evolutionary relationships for downstream analyses.

Supporting information

S1 fig. boxplot of the ranks obtained from the fp index for the fungi data set consisting of 25 species and 683 genes..

In addition, the ranks obtained from the average FP index across the 683 gene trees (dots) and the ranks on the species tree (triangles) are depicted.

https://doi.org/10.1371/journal.pone.0300900.s001

S2 Fig. Boxplot of the ranks obtained from the FP index for the mammal data set consisting of 37 species and 447 genes.

In addition, the ranks obtained from the average FP index across the 447 gene trees (dots) and the ranks on the species tree (triangles) are depicted.

https://doi.org/10.1371/journal.pone.0300900.s002

S3 Fig. Boxplot of the ranks obtained from the FP index for the plant data set consisting of 52 species and 318 genes.

In addition, the ranks obtained from the average FP index across the 318 gene trees (dots) and the ranks on the species tree (triangles) are depicted.

https://doi.org/10.1371/journal.pone.0300900.s003

S4 Fig. Boxplot of the ranks obtained from the FP index for the primate data set consisting of 4 species and 52 genes.

In addition, the ranks obtained from the average FP index across the 52 gene trees (dots) and the ranks on the species tree (triangles) are depicted.

https://doi.org/10.1371/journal.pone.0300900.s004

S5 Fig. Boxplot of the ranks obtained from the FP index for the rattlesnake data set consisting of 7 species and 16 genes.

In addition, the ranks obtained from the average FP index across the 16 gene trees (dots) and the ranks on the species tree (triangles) are depicted.

https://doi.org/10.1371/journal.pone.0300900.s005

S6 Fig. Boxplot of the ranks obtained from the FP index for the rodent data set consisting of 37 species and 761 genes.

In addition, the ranks obtained from the average FP index across the 761 gene trees (dots) and the ranks on the species tree (triangles) are depicted.

https://doi.org/10.1371/journal.pone.0300900.s006

S7 Fig. Boxplot of the ranks obtained from the FP index for the snake data set consisting of 33 species and 333 genes.

In addition, the ranks obtained from the average FP index across the 333 gene trees (dots) and the ranks on the species tree (triangles) are depicted.

https://doi.org/10.1371/journal.pone.0300900.s007

S8 Fig. Boxplot of the ranks obtained from the FP index for the yeast data set consisting of 8 species and 106 genes.

In addition, the ranks obtained from the average FP index across the 106 gene trees (dots) and the ranks on the species tree (triangles) are depicted.

https://doi.org/10.1371/journal.pone.0300900.s008

S1 Appendix. Details of the curation of the nine data sets.

The Appendix provides details of the methods used to curate each of the nine data sets included in the manuscript.

https://doi.org/10.1371/journal.pone.0300900.s009

Acknowledgments

We thank those who reviewed earlier versions of this paper for their valuable feedback and suggestions.

  • View Article
  • Google Scholar
  • PubMed/NCBI
  • 9. Vellend M, Cornwell WK, Magnuson-Ford K, Mooers AO. Measuring Phylogenetic biodiversity. In: Magurran AE, McGill BJ, editors. Biological Diversity: Frontiers in Measurement and Assessment. Oxford: Oxford University Press; 2011. p. 194–207.
  • 13. IUCN. IUCN SSC Phylogenetic Diversity Task Force; 2019. https://www.pdtf.org/ .
  • 15. Redding DW. Incorporating genetic distinctness and reserve occupancy into a conservation prioritisation approach. University Of East Anglia, Norwich, UK; 2003.
  • 29. Swofford D. PAUP*: Phylogenetic Analysis using Parsimony (*and other methods), version 4a168; 2021. Available from: www.paup.phylosolutions.com .
  • 30. R Core Team. R: A Language and Environment for Statistical Computing; 2021. Available from: https://www.R-project.org/ .
  • Open access
  • Published: 15 May 2024

Identification and expression analysis of the Xyloglucan transglycosylase/hydrolase ( XTH ) gene family under abiotic stress in oilseed ( Brassica napus L.)

  • Jingdong Chen 1 , 2 ,
  • Heping Wan 2 ,
  • Huixia Zhao 2 ,
  • Xigang Dai 2 ,
  • Wanjin Wu 2 ,
  • Jin Liu 1 ,
  • Jinsong Xu 1 ,
  • Rui Yang 1 ,
  • Benbo Xu 1 ,
  • Changli Zeng 2 &
  • Xuekun Zhang 1  

BMC Plant Biology volume  24 , Article number:  400 ( 2024 ) Cite this article

Metrics details

XTH genes are key genes that regulate the hydrolysis and recombination of XG components and plays role in the structure and composition of plant cell walls. Therefore, clarifying the changes that occur in XTHs during plant defense against abiotic stresses is informative for the study of the plant stress regulatory mechanism mediated by plant cell wall signals. XTH proteins in Arabidopsis thaliana was selected as the seed sequences in combination with its protein structural domains, 80 members of the BnXTH gene family were jointly identified from the whole genome of the Brassica napus ZS11, and analyzed for their encoded protein physicochemical properties, phylogenetic relationships, covariance relationships, and interoperating miRNAs. Based on the transcriptome data, the expression patterns of BnXTHs were analyzed in response to different abiotic stress treatments. The relative expression levels of some BnXTH genes under Al, alkali, salt, and drought treatments after 0, 6, 12 and 24 h were analyzed by using qRT-PCR to explore their roles in abiotic stress tolerance in B. napus . BnXTHs showed different expression patterns in response to different abiotic stress signals, indicating that the response mechanisms of oilseed rape against different abiotic stresses are also different. This paper provides a theoretical basis for clarifying the function and molecular genetic mechanism of the BnXTH gene family in abiotic stress tolerance in rapeseed.

Peer Review reports

Introduction

In complex and changeable natural environments, crops may be exposed to abiotic stresses such as temperature, drought, salt, heavy metals, etc., potentially compromising their yield and quality [ 1 ]. As the first barrier, the cell wall plays an important role in the defense of plants against external environmental stresses [ 2 ].

The cell wall has a complex structure consisting of pectin embedded microfibrils and non-cellulosic neutral polysaccharides, crosslinked with structural proteins and, depending on the tissue and organ, with lignin [ 3 ]. Plant cell walls are divided into the intercellular layer, primary and secondary walls [ 4 ]. The main component of the intercellular layer is pectin, which is located between two neighboring cells and is able to adhere to them and buffer intercellular extrusion [ 5 ]; the primary wall is mainly composed of cellulose, hemicellulose, and structural proteins, which has a large plasticity, which allows the cell to maintain a certain shape, but also can be extended with the growth of the cell [ 6 ]; the secondary wall, which is mainly composed of cellulose and often contains lignin, located between the plasma membrane and the primary wall, is usually thicker and harder, giving the cell wall great mechanical strength [ 7 ].

Xyloglucan (XG) is the main chain component of the polysaccharide hemicellulose that accounts for the largest proportion in the primary cell wall of dicotyledonous plants [ 8 ]. The glucan backbone of XG is composed of β-1,4-D-Glup, and the O-6 position is connected to α-D-Xylp. These Xyls can be further modified by other glycosyl groups [ 9 ]. The structures of XG are diverse, showing variability among plants, organs, tissues and even within the same plant [ 10 ]. The basic function of XG is to bind to the outside of cellulose fibrils to form a load-bearing network of the primary cell wall, which determines the tension of the cell wall. The modification and reconstruction of XG can adjust the elasticity and strength of the cell wall skeleton [ 11 ].

The recombination of XG requires the participation of Xyloglucan transglycosylase/hydrolase (XTH) of the GH16 family. XTH has dual activities of Xyloglucan endotransglycosylase (XET) and Xyloglucan hydrolase (XEH), allowing it to independently complete the cleavage and recombination process of XG, and can carry out the catalytic reaction of extending the XG chain without the direct involvement of ribose [ 12 ]. XTH genes play roles in plant growth and development, especially in response to abiotic stress. The cell wall modification of xth19 could affect freezing tolerance after cold and sub-zero acclimation. Compared with the Col-0 wild type, the cell wall structure and composition of the xth19 mutant of Arabidopsis changed, which resulted in a lower freezing tolerance after cold and sub-zero acclimation [ 13 ]. Compared to the control, the expressions of TaXTHs of wheat were significantly altered under drought stress; the drought tolerance of transgenic plants of Arabidopsis overexpressing TaXTH12.5a was improved, with the germination rate, root length, hypocotyl length, and the number of green leaves during the germination stage and nutrient growth stage significantly higher than that of the control lines [ 14 ]. Under salt stress, there were 11 differentially expressed PtrXTHs in the roots of poplar, nine differentially expressed PtrXTHs in the stems, and 7 differentially expressed PtrXTHs in the leaves [ 15 ]; the water retention capacity of the leaves in transgenic plants of tobacco overexpressing PeXTH of Populus were improved, and the net photosynthetic rate had a significant enhancement, which enhanced the salt tolerance of tobacco, indicating that the XTHs might play roles for plants in the response to salt stress [ 16 ]. It was found that many XTHs of soybean responded to ethylene and flooding treatments, The Arabidopsis thaliana AtXTH31 gene is overexpressed in soybean, and transgenic soybean plants with AtXTH31 overexpressing under flooding stress tolerance showed higher germination rate, and longer roots/ hypocotyls compared to the control lines at the seedling and nutrient growth stages, it was confirmed that XTHs played roles in regulating the response of soybean to abiotic stresses [ 17 ]. Du et al. found that aluminum (Al) induces ZmXTHs to be up-regulated in the maize root system, especially in the root tips, and this induction was Al 3+ specific; ZmXTHs overexpressing plants of Arabidopsis grew more healthily with lower Al content in their roots and root cell walls under Al stress compared to wild-type, suggesting that ZmXTHs could be endowed with aluminum tolerance on transgenic Arabidopsis plants by reducing the accumulation of aluminum in roots and cell walls [ 18 ]. XTHs also played important roles in the response of heavy metal stress in plants: most XTHs in ramie ( Boehmeria nivea ) were responsive to cadmium stress, while heterologous expression of BnXTH3, BnXTH6 and BnXTH15 significantly enhances the cadmium tolerance of transgenic yeast cells [ 19 ]; Xuan et al. identified 44 possible MtXTHs in the Medicago truncatula genome, 28 of which were responsive to HgCl 2 , and copper/mercury stresses would significantly induce the expression of the MtXTH3 both in the roots and shoots. In addition, the expression of MtXTH3 showed an increasing trend with the elevation of Cu and Hg concentrations [ 20 ].

While the impact of XTHs on plant cell wall composition and abiotic stress resilience has been thoroughly investigated in Arabidopsis and various crops, the expression profiles of these genes under abiotic stress conditions in Brassica napus remain comparatively underexplored. In this study, we undertook a comprehensive genome-wide identification of the BnXTH gene family members in rapeseed; additionally, we conducted an exhaustive analysis encompassing protein characteristics, gene structure, phylogenetic relationships, and collinearity the expression patterns of BnXTHs in two varieties of rapeseed were compared under alkaline (Na 2 CO 3 ) stress using RNA-seq analysis. At the same time, by qPCR, the relative expressions of BnXTHs under aluminum (AlCl 2 ), alkali (Na 2 CO 3 ), salt (NaCl) and drought (PEG 6000) stress at four time points (0, 6 h, 12 h and 24 h) were examined, which provided evidence for the response of BnXTHs to abiotic stress. The findings of our study establish a foundational framework for subsequent investigations into the functional roles of these genes in the abiotic stress response of rapeseed. Furthermore, they offer a theoretical underpinning for the selection and breeding of abiotic stress-tolerant rapeseed germplasms.

Identification and chromosomal localization of the BnXTH gene family

figure 1

Chromosomal location of the BnXTHs in the B. napus genome

Joint identification was conducted using BLASTP and HMMER, and the NCBI-CDD tool was used to remove the sequences with incomplete structural domains. Finally, 80 BnXTH proteins were obtained, which were named BnXTH1 – BnXTH80 based on the order of the location of their encoding genes on the chromosomes. Among these 80 genes, 79 were mapped to 18 identified chromosomes (Fig.  1 ), and one was mapped to pseudochromosome Scaffold0027. Of these, 10 BnXTHs were localized on ChrA03 and ChrC07, nine on ChrC09, seven on ChrA01, six on ChrA08 and ChrC09, five on ChrA06 and ChrC01, three on ChrA07, ChrC02, ChrA05, and ChrA09, two were localized on ChrA10 and ChrC05, one was localized on ChrA02, ChrA04, and ChrC04, while no possible BnXTHs were detected on ChrC06. Number of BnXTHs’ exons ranged from one to eight, and the information of their positions in the chromosomes was provided on Table S1 .

Physicochemical properties and subcellular localization prediction of BnXTH proteins

By predictive analysis, the number of amino acids (AAs) of the BnXTH proteins ranged from 212 to 473, the relative molecular mass (MW) ranged from 24.42 to 55.10, and the theoretical isoelectric point (PI) ranged from 4.77 to 9.96 (Table S1 ). Subcellular localization prediction results showed that 54 XTH proteins may exist in the cytoplasm, 21 in the cell wall, and 5 in the nucleus.

Phylogenetic analysis of BnXTH proteins

In order to explore the phylogenetic relationship between XTH proteins, 80 XTH proteins of B. napus and 33 XTH proteins of Arabidopsis thaliana were combined to form a phylogenetic tree. Based on the genetic relationship between them and the position of the branch where the AtXTH proteins were located, XTH proteins were divided into four groups: Early diverging group, Group I/II, Group IIIA and Group IIIB (Fig.  2 A). According to the phylogenetic tree, it could be seen that the number of proteins contained within each group varies, among which the most proteins were clustered into Group I/II, with 22 AtXTHs and 49 BnXTHs; in Group IIIB, there were 5 AtXTHs and 16 BnXTHs; in the Early diverging group, there were 4 AtXTHs and 8 BnXTHs; Group IIIA had the fewest proteins, with 2 AtXTHs and 7 BnXTHs (Table  1 ).

To further confirm the phylogenetic relationships of XTHs within Brassicaceae plants, a total of 52 XTH proteins from Brassica rapa and 47 XTH proteins from Brassica oleracea were identified by using BLASTP and HMMER methodologies for the construction of a phylogenetic tree (Table S2 , Fig.  2 B). The analysis revealed that these proteins could also be classified into four groups, with the groupings of BnXTHs remaining unchanged, thereby validating the reliability of the phylogenetic tree. Notably, the Group I/II contained the highest number of proteins, with 35 BrXTHs and 33 BoXTHs, followed by Group IIIB, which comprised 8 BrXTHs and 8 BoXTHs. Group IIIA included 4 BrXTHs and 4 BoXTHs, and the Early diverging group consisted of 4 BrXTHs and 4 BoXTHs (Table  1 ).

figure 2

Phylogenetic tree of XTH proteins. ( A ) Phylogenetic tree of XTH proteins in B. napus and (A) Thaliana . ( B ) Phylogenetic tree of XTH proteins in (B) napus , B. rapa , and B. oleracea

Gene structures, conserved motifs and promoter cis-acting elements analysis of BnXTHs

Analysis of motifs in conserved domains is a powerful tool for understanding the function, structure, and evolution of proteins. Conserved motif analysis of BnXTH proteins shows that Motif2, Motif 3, Motif 5 and Motif 6 were common to all BnXTH proteins; Motif7 was merely absent from BnXTH75; Motif4 was absent from BnXTH33 and BnXTH75 of Group I/II; Motif1 was absent in BnXTH7 and BnXTH24 of the Early diverging group; most Group IIIB and a small number of Early diverging group proteins did not contain Motif10; Motif8 existed in some Group I/II and BnXTH35 proteins; Motif9 was present in most of the Group IIIB proteins. (Fig.  3 A). The protein structural characteristics of BnXTHs were identified, and all proteins were found to contain a Glyco_hydro_16 and a XET_C domain (Fig.  3 B). Analysis of promoter cis-acting elements showed that BnXTHs may have a response mechanism in response to stress and plant hormone signals (Fig.  3 C). Among them, there were as many as 910 light responsiveness elements, far more than other cis-acting elements. In addition, BnXTH41 and BnXTH45 contained the most cis-acting elements, amounting to 41, indicating that they may play roles in the growth and development of rapeseed. The study of the mature mRNA structure of BnXTHs revealed that they contained 1–8 CDS (Coding sequence) regions (Fig.  3 D), and some BnXTHs also contained 1–2 UTR (Untranslational region) regions.

figure 3

Gene structure analysis of XTH family in B.napus . ( A ) Conserved motifs of BnXTH family proteins. ( B ) Pfam structure of BnXTH family proteins. ( C ) Promoter cis-acting element of BnXTHs . ( D ) The mRNA structure encoded by the BnXTHs

Collinearity analysis of XTH gene family

In order to investigate the collinearity of XTH genes, BnXTH80 located on the pseudochromosome was removed and the collinearity maps of rapeseed ( B. napus ), B. rapa and B. oleracea were drawn (Fig.  4 ). There were 92 pairs of collinear genes between the XTH genes of B. napus and B. rapa , and 88 pairs of collinear genes between the XTH genes of B. napus and B. oleracea (Table S3 ). Collinearity analysis of BnXTHs within the B. napus genome showed that there were 83 pairs of collinear genes among BnXTHs (Fig.  5 and Table S4 ). To understand the genetic relationship of BnXTH collinear gene pairs, KaKs values were calculated to understand their selection pressure relationships (Table S5 ). The results showed that the Ka/Ks values of these 83 collinear gene pairs were all less than 1, indicating that they were all affected by Purification selection.

figure 4

Collinearity of XTH genes in B. napus , B. rapa , and B. oleracea

figure 5

Collinearity of BnXTH s. The circles in the figure from inside to outside represent the unknown base N ratio, gene density, GC ratio, GC skew, and chromosome length of the B. napus genome

Prediction of targeting relationship between miRNA and BnXTHs

As shown in Fig.  6 , a total of 32 miRNAs of B. napus targeted 29 BnXTH genes through cleavage, and 5 miRNAs of B. napus target 3 BnXTH genes through translation. No miRNA was found to have both cleavage and translation inhibition effects on the same BnXTH gene. The results suggest that multiple miRNAs were involved in the post-transcriptional regulation of BnXTHs by targeting them through cleavage and translation.

figure 6

Sankey diagram for the relationships of miRNA targeting to BnXTHs transcripts. The 3 columns represent miRNA, BnXTHs and inhibition effect

Analysis of transcriptome expression patterns of BnXTHs under abiotic stresses

In order to explore the gene expression patterns of BnXTHs under different abiotic stresses, we obtained the transcriptome TPM data of BnXTHs in B. napus ZS11 under CK, salt, drought, freezing, cold, heat and osmotic stress, and expression heat maps were plotted after log 10 (TPM + 1) treatment (Fig.  7 ). Some BnXTHs were not expressed or had low expression in leaves and roots under different stresses; however, BnXTH26 , BnXTH63 and BnXTH37 showed high expression in leaves and roots under various stresses; BnXTH13 and BnXTH52 had high expression in roots under various stresses and high expression in leaves under some of the stresses; the expression levels of BnXTH15 , BnXTH66 , BnXTH25 , and BnXTH53 were higher in leaves under various stresses; the expression levels of BnXTH17 , BnXTH68 , BnXTH22 , BnXTH58 , and BnXTH73 showed higher expression in roots under various stresses.

figure 7

Analysis of expression patterns of BnXTHs under abiotic stress (CK, salt, drought, freezing, cold, heat, and osmotic) treatments. L: leaves; R: Roots

Detection of BnXTHs expression under various abiotic stresses by qPCR technique

The cell wall is the first barrier to protect plant cells from damage, and the XTH genes plays an important role in plants defense against abiotic stress. Therefore, to investigate the effect of BnXTHs on rapeseed response to abiotic stress, we performed qPCR on leaves of oilseed rape seedling ZS11 treated with 0.5 mmol·L -1 AlCl 3 , 0.2% (w/v) Na 2 CO 3 , 1.2% (w/v) NaCl, and 20% (w/v) PEG 6000 after 0, 6 h, 12 h and 24 h to analyze the relative expression of 9 different BnXTHs (Fig.  8 ). The results showed that they responded significantly to different abiotic stresses.

Under AlCl 3 treatment, the expression of BnXTH7 , BnXTH17 , and BnXTH6 did not change significantly at all time points; BnXTH72 and BnXTH37 increased with the extension of the treatment time; BnXTH24 , BnXTH9 , BnXTH20 , and BnXTH61 increased with the extension of the treatment time, reaching the maximum at 12 h, and then decreased at 24 h again.

Under Na 2 CO 3 treatment, the expression of BnXTH7 , BnXTH17 , BnXTH6 , BnXTH20 , and BnXTH37 increased with the prolongation of the treatment time; the expression of BnXTH24 , BnXTH19, and BnXTH61 increased and then decreased; and the expression of BnXTH72 showed a decreasing tendency first, but it was insignificant, and then increased significantly at 24 h. The expression of BnXTH72 increased significantly at 24 h, but it was not significant.

Under NaCl treatment, the expression of BnXTH7 and BnXTH6 did not change significantly; the expression of BnXTH72 , BnXTH37 , and BnXTH61 increased with the prolongation of time; and the expression of BnXTH24 , BnXTH17 , BnXTH9 , and BnXTH20 increased and then decreased.

figure 8

The relative expression of BnXTH7 , BnXTH24 , BnXTH17 , BnXTH6 , BnXTH9 , BnXTH20 , BnXTH72 , BnXTH37 , and BnXTH61 in leaves under 0.5 mmol·L -1 AlCl 3 , 0.2% (w/v) Na 2 CO 3 , 1.2% (w/v) NaCl, and 20% (w/v) PEG 6000 after 0, 6 h, 12 h and 24 h. Data represent the mean ± standard error for threebiological experiments. Student’s t-test was used to determine differences. *: significant differences between treatments at p  ≤ 0.05. **: significant differences between treatments at p  ≤ 0.01

Under PEG 6000-induced drought stress, the expression of BnXTH6 , BnXTH72 , and BnXTH37 did not change significantly; the expression of BnXTH7 increased with the prolongation of time; the expression of BnXTH17 , BnXTH9 , BnXTH20 , and BnXTH61 increased and then decreased; the expression of BnXTH24 at 6 h and then decreased at 12 h. The expression of BnXTH24 was elevated at 6 h, and then increased again at 24 h. The expression of BnXTH24 increased at 6 h and then decreased at 12 h, and then increased again after 24 h.

Gene family analysis is conducive to mining key functional genes in crop genomes and provides a genetic research basis for the development of high-yield and high-quality germplasms. In this study, the Arabidopsis XTH protein sequences were used as the seed sequence, combined with the Pfam domain search results, and finally identified 80 members of the B. napus XTH gene family, distributed on 18 definite chromosomes and 1 pseudochromosome. In this study, 52 BrXTH proteins and 47 BoXTH proteins were identified, the numbers of which differ from previous studies (53 XTHs of B. rapa and 38 XTHs of B. oleracea ) [ 21 ]. This discrepancy may be due to differences in the sources of genome databases and the selection of thresholds. Compared with the 52 BrXTH proteins and 47 BoXTH proteins, B. napus , as an allotetraploid, was identified with 80 XTH genes, a number smaller than the sum of the former two, which also suggested that recombination or mutation at the gene or chromosome level may have occurred in the B. napus genome during the evolutionary process. Phylogenetic analysis was performed based on the XTH protein sequences of (A) thaliana and (B) napus , and the distribution of XTH proteins in the phylogenetic tree was statistically analyzed. It was found that BnXTH proteins were more uniformly distributed among the four groups of the phylogenetic tree, suggesting that gene duplication events might have occurred in the genome of B. napus at the whole-genome level in these four groups (Fig.  2 A). In the construction of phylogenetic trees for three species within the Brassicaceae family, it was observed that the distribution of XTHs within the four groups was broadly consistent (refer to Fig.  2 B; Table  1 ). According to the U-triangle model, B. napus (AACC: 2n = 38), an allopolyploid species, originates from the cross between B. rapa (AA: 2n = 20) and B. oleracea (CC: 2n = 18) [ 22 ]. This analogous distribution pattern of XTHs across the species underscores their genetic conservation. Furthermore, the total count of BnXTHs being lesser than the combined total of BrXTHs and BoXTHs highlights the evolutionary distinctiveness of BnXTHs.

Further analysis of the gene structures and conserved domains showed that evolutionary conservation may exist among species including (A) thaliana [ 23 ], Osmanthus fragrans [ 24 ], Schima superba [ 25 ], (B) rapa and B. oleracea [ 21 ]. Motif 2, Motif 3, Motif 5, and Motif 6 were identified as ubiquitous across all BnXTH proteins. This uniform presence suggests that these four motifs likely constitute characteristic sequences of BnXTHs, playing pivotal roles in the structural and functional attributes of these proteins. Analysis of promoter cis-acting elements showed that the function of BnXTHs may be related to photosynthesis due to the large number of light responsiveness elements. By modifying the cell wall, XTHs influence leaf thickness, which can affect the internal leaf architecture and, consequently, the efficiency of light capture and the distribution of light within the leaf [ 16 ]. XTHs are also involved in the development and functioning of stomata by modulating the flexibility and integrity of the cell walls surrounding guard cells [ 26 ]. While the direct relationship between XTH activity and photosynthesis requires further empirical study, it is evident that XTHs, through their role in cell wall modification, indirectly contribute to optimizing the conditions necessary for efficient photosynthesis.

The Ka/Ks analysis is a method used to measure the selective pressure on genes, commonly applied in the study of gene or gene family evolution. In this context, Ka represents the rate of nonsynonymous substitutions, which lead to changes in amino acids, while Ks denotes the rate of synonymous substitutions that do not result in amino acid changes [ 27 ]. Ka/Ks analysis serves as a crucial tool for understanding the dynamics of gene evolution and revealing the key mechanisms behind the adaptive evolution of organisms, suitable for discussion in scientific literature. By collinearity analysis between genomes, we found that there were 92 pairs of XTH collinear genes between B. napus and B. rapa , 88 pairs of XTH collinear genes between B. napus and B. oleracea (Fig.  4 ), and 83 XTH covariance pairs within the B. napus genome (Fig.  5 ), and their Ka/Ks values were all less than 1 (Table S5 ). The number of collinear gene pairs of XTHs among species was greater than the number of BnXTHs we identified, indicating that XTHs are evolutionarily correlated and have both inter- and intra-species evolutionary conservation.

miRNAs (microRNA) are small non-coding RNAs that regulate gene expression at the post-transcriptional level by inhibiting the translation of messenger RNAs (mRNAs) or by promoting mRNA degradation [ 28 ]. miRNAs are approximately 20–24 nucleotides (nt) in size and are encoded by plants, animals and some viruses [ 29 ]. In plants, miRNAs inhibit gene expression mainly by mediating target RNA cleavage or translational repression [ 30 ], and they control target genes at the post-transcriptional or translational level by controlling the level of protein synthesis, and have an impact on plant growth, development, and response to environmental stresses [ 31 ]. For example, bna-miR159, bna-miR6029, and bna-miR827 negatively regulate target genes related to nitrogen metabolism pathway in B. napus , thereby affecting nitrogen signaling within the B. napus plant and consequently its pod thickness [ 32 ]. bna- miR319 overexpressing transgenic lines of B. napus exhibited abnormal development of serrated leaves and stem tip meristematic tissues [ 33 ]. Fu et al. analyzed miRNA-mRNA expression of Cd-stressed B. napus seedlings and found that Cd treatment significantly affected the expression of 22 miRNAs belonging to 11 families in roots and 29 miRNAs belonging to 14 families in shoots, and identified 8 miRNA-mRNA interaction pairs in roots and 8 miRNA-mRNA interaction pairs in shoots. 8 miRNA-mRNA interaction pairs in roots and 8 miRNA-mRNA interaction pairs in shoots were identified [ 34 ]. In this study, a total of 37 miRNAs of B. napus were targeted to 29 BnXTH genes through cleavage and translational repression (Fig.  6 ), and it was hypothesized that miRNAs form a miRNA-target gene regulatory network with BnXTHs , which are involved in the process of reorganization and hydrolysis of XG in B. napus cell walls. Interestingly, BnXTHs have similar cis-acting elements (Fig.  3 ), but could have relatively different expression patterns (Fig.  7 ). For example, BnXTH2 , and BnXTH41 all targetted bna-miR167 and had relatively consistent expression patterns. They were not expressed in leaves under various abiotic stresses but have a certain amount of expression in roots. However, compared with the expression patterns of other BnXTHs , they were different.

Various abiotic stresses may affect the integrity of plant cell walls, but plants are able to repair adverse changes in the cell wall, including changing its composition, structure, and mechanical properties to maintain growth [ 35 ]. There has been some research on how the cell wall responds to stress [ 36 ]. However, due to the constraints of imaging and biomechanical equipment and the potential crosstalk between cell wall signaling and stress signaling, it is difficult to draw general conclusions about the regulatory mechanisms of the cell wall in response to various abiotic stresses [ 2 ]. XTH genes are key genes that regulate the hydrolysis and recombination of XG components and plays an important role in the structure and composition of plant cell walls [ 25 ]. Therefore, clarifying the changes that occur in XTHs during plant defense against abiotic stresses is informative for the study of the plant stress regulatory mechanism mediated by plant cell wall signals. In this study, qPCR technology was used to detect the relative expression BnXTH7 , BnXTH24 , BnXTH17 , BnXTH6 , BnXTH9 , BnXTH20 , BnXTH72 , BnXTH37 , and BnXTH61 in leaves under 4 abiotic stresses treatment after 4 time points: 0, 6 h, 12 h, and 24 h (Fig.  8 ). It was found that in response to abiotic stresses, the expression of most of the BnXTHs gradually increased or first increased and then decreased with the growth of treatment time, indicating that oilseed rape needs large amounts of recombinant XGs in response to abiotic stress in order to repair the damage that may be caused by the stress to the cell wall. And with the time went, the plant slowly adapted to the stress, and some of the BnXTHs may have received the relevant signals, thus the expression fell back.

Materials and methods

Identification and chromosome mapping of bnxth family members.

The sequence files of ZS11 (a variety of B.napus ), cabbage ( B. rapa ), and kale ( B. oleracea ) were obtained from the BRAD database (BRAD: http://brassicadb.cn/#/ ) [ 37 ]. 33 AtXTH protein sequences from the A. thaliana genome database (TAIR: https://www.arabidopsis.org/ ) were downloaded as seed sequences, and searched for possible BnXTH proteins in the whole protein sequences of B.napus (e Value < 1e-10) by BLASTp [ 38 ]. The PF06955 and PF00722 conserved structural domain files were downloaded from the Pfam database to search the possible BnXTH proteins by HMMER software ( http://www.hmmer.org/ ) [ 39 , 40 ], combining the BLAST results to obtain the BnXTH hypothetical proteins. The putative protein sequence was uploaded to the NCBI-CDD website ( https://www.ncbi.nlm.nih.gov/cdd/ ) for further confirmation, and 80 members of the BnXTH gene family were finally identified. The BnXTHs were named according to their chromosomal position sequence, and were named as BnXTH1  ~  BnXTH80 . The Protparam function in the ExPASy website was used to predict the physical and chemical properties of BnXTH proteins [ 41 ], and the subcellular localization prediction results were implemented by the Plant-mPLoc website [ 42 ]. Chromosomal location information of BnXTHs was retrieved using downloaded annotation files and visualized by TBTools [ 43 ].

Phylogenetic analysis

The XTH protein sequences of A. thaliana and B.napus were imported into MEGA 11 software, and the parameters were adjusted to neighbor joining (NJ) and 1000 boots-trap repetitions to obtain a phylogenetic tree, and the phylogenetic tree was embellished in the online website iTOL ( https://itol.embl.de/ ) [ 44 , 45 ].

Prediction of gene structure, protein conserved motifs and promoter cis-acting elements

The CDS/UTR region and gene structure information of BnXTH gene family members were obtained from the NCBI-CDD website and Pfam database. The MEME 5.5.1 website was used to obtain the conserved motif of the BnXTH proteins [ 46 ], and the prediction of the cis-acting element 2000 bp upstream of the BnXTHs promoter region was predicted through the PlantCARE website [ 47 ]. All data information visualization was completed by TBTools.

Collinearity analysis of XTH genes

MCScanX software was used to analyze the collinear relationship between XTHs within the genomes of B. napus, B. rapa , and B. oleracea , as well as within the genomes of B. napus [ 48 ]. Collinearity information visualization was performed using TBTools. Ka/Ks value calculation was implemented using KaKs_calculator 3.0 software [ 49 ].

The CDS sequences of BnXTHs were uploaded to the psRNATarget website ( https://www.zhaolab.org/psRNATarget/analysis?function=2/ ) [ 50 ], and combined with the miRNA mature sequences information of B. napus collected by the website, the targeting relationships between miRNA and BnXTHs were analyzed. The data were visualized using the Bioinformatics website ( https://www.bioinformatics.com.cn/plot_basic_alluvial_plot_017 ).

Analysis of the expression pattern of BnXTHs under different abiotic stress

To investigate the gene expression patterns of BnXTHs under abiotic stress treatments, the BnXTH gene IDs were uploaded to the BNIR database [ 51 ] to obtain the Transcript per Kilobase per Million mapped reads (TPM) of B. napus ZS11 in control (CK), salt (200 mmol·L -1 NaCl treatment), drought (exposure to airflow for 1 h), freezing (recovering to 25 °C after 3 h of stress at -4 °C), cold (4 °C), heat (recovering to 25 °C after 3 h of stress at 38 °C), and osmosis (300 mmol-L-1 mannitol) after 24 h, and the data were visualized using TBTools software after calculating log 10 (TPM + 1).

Materials and treatment

B. napus ZS11 seeds were germinated on wet gauze (soaked with water) in a plant growth chamber at 20 to 22 °C and 65% humidity under a long-day condition (16-h-light/ 8-h-dark cycle) [ 52 ]. The one-week-old seedlings were then transferred into a previously described hydroponic system [ 53 ] under the same culture conditions for nearly 20 days until the fourth leaves had extended (plant samples used for transcriptome sequencing were also harvested for root system). For stress treatment research, leaf samples from 4-week-old plants of ZS11 were collected after 0, 6, 12 and 24 h of 0.5 mmol·L -1 AlCl 3 , 0.2% (w/v) Na 2 CO 3 , 1.2% (w/v) NaCl, and 20% (w/v) PEG 6000 treatment. Seedlings without any stress treatment were used as the control. Each treatment includes three biological replications. Leaves were harvested immediately frozen in liquid nitrogen and stored at -80 °C for RNA extraction.

Relative expression of BnXTHs under different abiotic stresses

Total RNA of ZS11 under control and abiotic treatments was extracted using the RNA simple Total RNA kit (Tiangen Biotechnology Co., Ltd., Beijing, China) according to the manufacturer’s protocol. cDNA was synthesized with 1 µg RNA from each sample with HiScript® II Q Select RT SuperMix with gDNA wiper (Vazyme, Nanjing, China). Gene-specific primers used for quantitative real-time PCR (qRT-PCR) listed in Table  2 . qRT-PCR was run on the AriaMx real-time PCR system (Agilent Technologies). The following cycling parameters were used: initial denaturation at 95 °C for 5 min; 40 amplification cycles consisting of denaturation at 95 °C for 10s, annealing and extension at 60 °C for 30 s; The melting curve was then tested at 65–95 °C. The internal standard was the B. napus actin gene ( BnaA01g27090D ). To investigate the expression patterns of BnXTHs under various environmental conditions, 9 BnXTHs were randomly selected from the 4 groups for qRT-PCR experiments. Three biotic replicates were performed for each sample, and each replicate contained three technical replicates. Relative expression levels were calculated according to the 2 −ΔΔCt method [ 54 ]. Data represent the mean ± standard error for threebiological experiments. Student’s t-test was used to determine differences. *: significant differences between treatments at p  ≤ 0.05. **: significant differences between treatments at p  ≤ 0.01.

Conclusions

B. napus genome was identified to contain 80 XTH genes distributed on 18 definite chromosomes and one pseudochromosome. Their phylogenetic relationships, protein physicochemical properties, subcellular localization, gene structures, promoter cis-acting elements, covariance relationships, and reciprocal miRNAs were predicted and analyzed, and their transcriptional expression patterns as well as differences in expression in response to abiotic stress treatments were investigated. The results showed that the expression patterns of BnXTHs under abiotic stress treatments were varied, suggesting that cell wall signaling in B. napus in response to various abiotic stresses changes depending on the type of stress. The analysis of the BnXTH gene family and the study of the response pattern in abiotic stresses provide a good theoretical basis for further research on this family of genes in resistance breeding of B. napus .

Data availability

All the data generated or analyzed during this study are included in this published article and its supplementary information files.

Kopecká R, Kameniarová M, Černý M, Brzobohatý B, Novák J. Abiotic stress in Crop Production. Int J Mol Sci 2023, 24(7).

Rui Y, Dinneny JR. A wall with integrity: surveillance and maintenance of the plant cell wall under stress. New Phytol. 2020;225(4):1428–39.

Article   PubMed   Google Scholar  

Swaminathan S, Lionetti V, Zabotina OA. Plant Cell Wall Integrity Perturbations and Priming for Defense. Plants (Basel, Switzerland) 2022, 11(24).

Le Gall H, Philippe F, Domon JM, Gillet F, Pelloux J, Rayon C. Cell wall metabolism in response to Abiotic Stress. Plants (Basel Switzerland). 2015;4(1):112–66.

PubMed   Google Scholar  

Zamil MS, Geitmann A. The middle lamella-more than a glue. Phys Biol. 2017;14(1):015004.

Article   CAS   PubMed   Google Scholar  

Bidhendi AJ, Chebli Y, Geitmann A. Fluorescence visualization of cellulose and pectin in the primary plant cell wall. J Microsc. 2020;278(3):164–81.

Xu H, Giannetti A, Sugiyama Y, Zheng W, Schneider R, Watanabe Y, Oda Y, Persson S. Secondary cell wall patterning-connecting the dots, pits and helices. Open Biology. 2022;12(5):210208.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Park YB, Cosgrove DJ. Xyloglucan and its interactions with other components of the growing cell wall. Plant Cell Physiol. 2015;56(2):180–94.

von Schantz L, Gullfot F, Scheer S, Filonova L, Cicortas Gunnarsson L, Flint JE, Daniel G, Nordberg-Karlsson E, Brumer H, Ohlin M. Affinity maturation generates greatly improved xyloglucan-specific carbohydrate binding modules. BMC Biotechnol. 2009;9:92.

Article   Google Scholar  

Hsieh YS, Harris PJ. Xyloglucans of monocotyledons have diverse structures. Mol Plant. 2009;2(5):943–65.

Park YB, Cosgrove DJ. A revised architecture of primary cell walls based on biomechanical changes induced by substrate-specific endoglucanases. Plant Physiol. 2012;158(4):1933–43.

Hrmova M, Stratilová B, Stratilová EJIJMS. Broad specific xyloglucan: xyloglucosyl transferases are formidable players in the re-modelling of plant cell wall structures. 2022, 23(3):1656.

Takahashi D, Johnson KL, Hao P, Tuong T, Erban A, Sampathkumar A, Bacic A, Livingston DP 3rd, Kopka J, Kuroha T, et al. Cell wall modification by the xyloglucan endotransglucosylase/hydrolase XTH19 influences freezing tolerance after cold and sub-zero acclimation. Plant Cell Environ. 2021;44(3):915–30.

Han J, Liu Y, Shen Y, Li W. A surprising diversity of Xyloglucan Endotransglucosylase/Hydrolase in wheat: New in Sight to the roles in Drought Tolerance. Int J Mol Sci 2023, 24(12).

Cheng Z, Zhang X, Yao W, Gao Y, Zhao K, Guo Q, Zhou B, Jiang T. Genome-wide identification and expression analysis of the xyloglucan endotransglucosylase/hydrolase gene family in poplar. BMC Genomics. 2021;22(1):804.

Han Y, Wang W, Sun J, Ding M, Zhao R, Deng S, Wang F, Hu Y, Wang Y, Lu Y, et al. Populus Euphratica XTH overexpression enhances salinity tolerance by the development of leaf succulence in transgenic tobacco plants. J Exp Bot. 2013;64(14):4225–38.

Song L, Valliyodan B, Prince S, Wan J, Nguyen HT. Characterization of the XTH Gene Family: New Insight to the roles in soybean flooding tolerance. Int J Mol Sci 2018, 19(9).

Du H, Hu X, Yang W, Hu W, Yan W, Li Y, He W, Cao M, Zhang X, Luo B, et al. ZmXTH, a xyloglucan endotransglucosylase/hydrolase gene of maize, conferred aluminum tolerance in Arabidopsis. J Plant Physiol. 2021;266:153520.

Ma YS, Jie HD, Zhao L, Lv XY, Liu XC, Tang YY, Zhang Y, He PL, Xing HC, Jie YC. Identification of the Xyloglucan Endotransglycosylase/Hydrolase (XTH) gene family members expressed in Boehmeria nivea in response to cadmium stress. Int J Mol Sci 2022, 23(24).

Xuan Y, Zhou ZS, Li HB, Yang ZM. Identification of a group of XTHs genes responding to heavy metal mercury, salinity and drought stresses in Medicago truncatula. Ecotoxicol Environ Saf. 2016;132:153–63.

Wu D, Liu A, Qu X, Liang J, Song M. Genome-wide identification, and phylogenetic and expression profiling analyses, of XTH gene families in Brassica rapa L. and Brassica oleracea L. BMC Genomics. 2020;21(1):782.

Yim WC, Swain ML, Ma D, An H, Bird KA, Curdie DD, Wang S, Ham HD, Luzuriaga-Neira A, Kirkwood JS, et al. The final piece of the triangle of U: evolution of the tetraploid Brassica carinata genome. Plant Cell. 2022;34(11):4143–72.

Article   PubMed   PubMed Central   Google Scholar  

Becnel J, Natarajan M, Kipp A, Braam J. Developmental expression patterns of Arabidopsis XTH genes reported by transgenes and Genevestigator. Plant Mol Biol. 2006;61(3):451–67.

Yang Y, Miao Y, Zhong S, Fang Q, Wang Y, Dong B, Zhao H. Genome-Wide Identification and Expression Analysis of XTH Gene Family during Flower-Opening Stages in Osmanthus fragrans. Plants (Basel, Switzerland) 2022, 11(8).

Yang Z, Zhang R, Zhou Z. The XTH Gene Family in Schima superba: genome-wide identification, expression profiles, and Functional Interaction Network Analysis. Front Plant Sci. 2022;13:911761.

Choi JY, Seo YS, Kim SJ, Kim WT, Shin JSJP. Constitutive expression of CaXTH3, a hot pepper xyloglucan endotransglucosylase/hydrolase, enhanced tolerance to salt and drought stresses without phenotypic defects in tomato plants (Solanum lycopersicum Cv. Dotaerang). 2011;30:867–77.

CAS   Google Scholar  

Hurst LDJTiG. The Ka/Ks ratio: diagnosing the form of sequence evolution. 2002, 18(9):486–7.

Correia de Sousa M, Gjorgjieva M, Dolicka D, Sobolewski C, Foti M. Deciphering miRNAs’ Action through miRNA Editing. Int J Mol Sci 2019, 20(24).

Kilikevicius A, Meister G, Corey DR. Reexamining assumptions about miRNA-guided gene silencing. Nucleic Acids Res. 2022;50(2):617–34.

Li M, Yu B. Recent advances in the regulation of plant miRNA biogenesis. RNA Biol. 2021;18(12):2087–96.

Bai JF, Wang YK, Wang P, Duan WJ, Yuan SH, Sun H, Yuan GL, Ma JX, Wang N, Zhang FT, et al. Uncovering male fertility transition responsive miRNA in a wheat photo-thermosensitive genic male sterile line by deep sequencing and Degradome Analysis. Front Plant Sci. 2017;8:1370.

Chen Z, Huo Q, Yang H, Jian H, Qu C, Lu K, Li J. Joint RNA-Seq and miRNA profiling analyses to Reveal Molecular mechanisms in regulating thickness of Pod Canopy in Brassica napus. Genes 2019, 10(8).

Lu H, Chen L, Du M, Lu H, Liu J, Ye S, Tao B, Li R, Zhao L, Wen J, et al. miR319 and its target TCP4 involved in plant architecture regulation in Brassica napus. Plant Science: Int J Experimental Plant Biology. 2023;326:111531.

Article   CAS   Google Scholar  

Fu Y, Mason AS, Zhang Y, Lin B, Xiao M, Fu D, Yu H. MicroRNA-mRNA expression profiles and their potential role in cadmium stress response in Brassica napus. BMC Plant Biol. 2019;19(1):570.

Novaković L, Guo T, Bacic A, Sampathkumar A, Johnson KL. Hitting the Wall-Sensing and Signaling pathways involved in Plant Cell Wall Remodeling in response to Abiotic Stress. Plants (Basel Switzerland) 2018, 7(4).

Wang Z, Wang M, Yang C, Zhao L, Qin G, Peng L, Zheng Q, Nie W, Song CP, Shi H, et al. SWO1 modulates cell wall integrity under salt stress by interacting with importin ɑ in Arabidopsis. Stress Biology. 2021;1(1):9.

Chen H, Wang T, He X, Cai X, Lin R, Liang J, Wu J, King G, Wang X. BRAD V3.0: an upgraded Brassicaceae database. Nucleic Acids Res. 2022;50(D1):D1432–41.

Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, Muller R, Dreher K, Alexander DL, Garcia-Hernandez M, et al. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 2012;40(Database issue):D1202–1210.

Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, Tosatto SCE, Paladin L, Raj S, Richardson LJ, et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 2021;49(D1):D412–9.

Potter SC, Luciani A, Eddy SR, Park Y, Lopez R, Finn RD. HMMER web server: 2018 update. Nucleic Acids Res. 2018;46(W1):W200–4.

Duvaud S, Gabella C, Lisacek F, Stockinger H, Ioannidis V, Durinx C. Expasy, the Swiss Bioinformatics Resource Portal, as designed by its users. Nucleic Acids Res. 2021;49(W1):W216–27.

Horton P, Park KJ, Obayashi T, Fujita N, Harada H, Adams-Collier CJ, Nakai K. WoLF PSORT: protein localization predictor. Nucleic Acids Res. 2007;35(Web Server issue):W585–587.

Chen C, Chen H, Zhang Y, Thomas HR, Frank MH, He Y, Xia R. TBtools: an integrative Toolkit developed for interactive analyses of big Biological Data. Mol Plant. 2020;13(8):1194–202.

Tamura K, Stecher G, Kumar S. MEGA11: Molecular Evolutionary Genetics Analysis Version 11. Mol Biol Evol. 2021;38(7):3022–7.

Letunic I, Bork P. Interactive tree of life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 2021;49(W1):W293–6.

Bailey TL, Johnson J, Grant CE, Noble WS. The MEME suite. Nucleic Acids Res. 2015;43(W1):W39–49.

Lescot M, Déhais P, Thijs G, Marchal K, Moreau Y, Van de Peer Y, Rouzé P, Rombauts S. PlantCARE, a database of plant cis-acting regulatory elements and a portal to tools for in silico analysis of promoter sequences. Nucleic Acids Res. 2002;30(1):325–7.

Wang Y, Tang H, Debarry JD, Tan X, Li J, Wang X, Lee TH, Jin H, Marler B, Guo H, et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 2012;40(7):e49.

Zhang Z. KaKs_Calculator 3.0: calculating selective pressure on Coding and non-coding sequences. Genom Proteom Bioinform. 2022;20(3):536–40.

Dai X, Zhuang Z, Zhao PX. psRNATarget: a plant small RNA target analysis server (2017 release). Nucleic Acids Res. 2018;46(W1):W49–54.

Yang Z, Wang S, Wei L, Huang Y, Liu D, Jia Y, Luo C, Lin Y, Liang C, Hu Y, et al. BnIR: a multi-omics database with various tools for Brassica napus research and breeding. Mol Plant. 2023;16(4):775–89.

Zhang H, Zhang X, Zhao H, Hu J, Wang Z, Yang G, Zhou X, Wan H. Genome-wide identification and expression analysis of phenylalanine ammonia-lyase (PAL) family in rapeseed (Brassica napus L). BMC Plant Biol. 2023;23(1):481.

Wan H, Chen L, Guo J, Li Q, Wen J, Yi B, Ma C, Tu J, Fu T, Shen J. Genome-wide Association Study reveals the Genetic Architecture Underlying Salt Tolerance-related traits in rapeseed (Brassica napus L). Front Plant Sci. 2017;8:593.

Livak KJ, Schmittgen TD. Analysis of relative gene expression data using real-time quantitative PCR and the 2(-Delta Delta C(T)) method. Methods (San Diego Calif). 2001;25(4):402–8.

Download references

Acknowledgements

We would like to thank the Hubei Province “515” Action (Cooperative Extension) “Rapeseed New Technology Demonstration and Science and Technology Service Rapeseed Industry Chain Project” and Ministry of Agriculture and Rural Affairs Plantation Management Department “Rapeseed Industry Policy Research and Main Pests and Diseases Prevention and Control Strategies " (15214011), National major biological breeding project of China (2022ZD04010). Thanks for their financial support.

This study was supported by the grants from the Hubei Province “515” Action (Cooperative Extension) “Rapeseed New Technology Demonstration and Science and Technology Service Rapeseed Industry Chain Project”, Ministry of Agriculture and Rural Affairs Plantation Management Department “Rapeseed Industry Policy Research and Main Pests and Diseases Prevention and Control Strategies " (15214011), National major biological breeding project of China (2022ZD04010) and Jianghan University scientific research project funding scheme (2022XKZX17).

Author information

Authors and affiliations.

MARA Key Laboratory of Sustainable Crop Production in the Middle Reaches of the Yangtze River (Co-construction by Ministry and Province), College of Agriculture, Yangtze University, Jingzhou, 434025, China

Jingdong Chen, Jin Liu, Jinsong Xu, Rui Yang, Benbo Xu & Xuekun Zhang

Hubei Engineering Research Center for Protection and Utilization of Special Biological Resources in the Hanjiang River Basin, College of Life Science, Jianghan University, Wuhan, 430056, Hubei, China

Jingdong Chen, Heping Wan, Huixia Zhao, Xigang Dai, Wanjin Wu & Changli Zeng

You can also search for this author in PubMed   Google Scholar

Contributions

X.Z. and C.Z. designed and managed the project. J.C. and H.W. completed experiments and paper writing. H.Z., X.D., W.W., J.L., J.X. and R.Y. performed experiments, material sampling. Laboratory data measurements. X.Z., C.Z. and B.X. revised the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Changli Zeng or Xuekun Zhang .

Ethics declarations

Ethics approval and consent to participate.

Study complied with local and national regulations for using plants.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

12870_2024_5121_MOESM1_ESM.xlsx

Supplementary Material 1: Table S1: Physicochemical properties and subcellular localization prediction of BnXTH proteins.

Supplementary Material 2: Table S2: Table S2. XTH IDs of B. rapa and B. oleracea .

Supplementary material 3: table s3: collinearity of xth genes in b. napus , b. rapa , and b. oleracea ., supplementary material 4: table s5: collinearity of xth genes in b. napus ., supplementary material 5: table s4: collinearity of xth genes in b. napus ., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Chen, J., Wan, H., Zhao, H. et al. Identification and expression analysis of the Xyloglucan transglycosylase/hydrolase ( XTH ) gene family under abiotic stress in oilseed ( Brassica napus L.). BMC Plant Biol 24 , 400 (2024). https://doi.org/10.1186/s12870-024-05121-5

Download citation

Received : 26 October 2023

Accepted : 09 May 2024

Published : 15 May 2024

DOI : https://doi.org/10.1186/s12870-024-05121-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Brassica napus
  • XTH gene family
  • Abiotic stress tolerance
  • Expression profiling

BMC Plant Biology

ISSN: 1471-2229

case study of phylogenetic analysis

U.S. flag

A .gov website belongs to an official government organization in the United States.

A lock ( ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

  • What is Advanced Molecular Detection (AMD)?
  • AMD Scientific Superhero Posters
  • About CDC's Advanced Molecular Detection Program
  • Broad Agency Announcement Awards
  • Success Stories

COVID-19 Genomic Epidemiology Toolkit

  • AMD Investments

What to know

The Office of Advanced Molecular Detection presents this toolkit to address topics related to the application of genomics to epidemiologic investigations and public health response to SARS-CoV-2. The COVID-19 Genomic Epidemiology Toolkit is meant to further the use of genomics in responding to COVID-19 at the state and local level.

Decorative image with text COVID-19 Genomic Epidemiology Toolkit

Welcome and Overview

Overview: CDC's Dr. Greg Armstrong gives an introduction to the COVID-19 Genomic Epidemiology Toolkit and describes the role for genome sequencing in public health.

Posted: 01/08/21

Presenter: Gregory L. Armstrong, MD Director, Advanced Molecular Detection Program, CDC

Part 1: Introduction

Modules 1.1 - 1.4.

thumbnail for module 1.1

Module 1.1 – What is Genomic Epidemiology?

thumbnail for module 1.2

Module 1.2 – The SARS-CoV-2 Genome

thumbnail for module 1.3

Module 1.3 – How to read a phylogenetic tree

thumbnail for module 1.4

Module 1.4 – Emerging variants of SARS-CoV-2

Part 2: Case Studies

Modules 2.1 - 2.7.

thumbnail for module 2.1

Module 2.1 – SARS-CoV-2 sequencing in Arizona

thumbnail for module 2.2

Module 2.2 – Healthcare cluster transmission

thumbnail for module 2.3

Module 2.3 – Investigating workplace-community transmission

thumbnail for module 2.4

Module 2.4 – Superspreading event in a pre-symptomatic population

thumbnail for module 2.5

Module 2.5 – Confirming SARS-CoV-2 reinfection with whole genome sequencing

thumbnail for module 2.6

Module 2.6 – Detecting and prioritizing SARS-CoV-2 variants

thumbnail for module 2.7

Module 2.7 – Wastewater-based variant tracking for SARS-CoV-2

Part 3: Implementation

Modules 3.1 - 3.6.

thumbnail for module 3.1

Module 3.1 – Getting started with Nextstrain

thumbnail for module 3.2

Module 3.2 – Getting started with MicrobeTrace

thumbnail for module 3.3

Module 3.3 – Real-time phylogenetics with UShER

thumbnail for module 3.4

Module 3.4 – Walking through Nextstrain trees

thumbnail for module 3.5

Module 3.5 – Public genome repositories for SARS-CoV-2

thumbnail for module 3.6 test once more

Module 3.6 – Sequencing strategies for SARS-CoV-2

AMD integrates next-generation genomic sequencing technologies with bioinformatics and epidemiology expertise to help us find, track, and stop pathogens.

For Everyone

Public health.

  • Search Menu
  • Special Issues
  • Advance articles
  • High Impact Research
  • Why Publish with AoB?
  • Author Guidelines
  • Submission Site
  • Open Access Policies
  • Self-Archiving Policy
  • Benefits of Publishing Open Access
  • Quarterly Newsletter
  • About Annals of Botany
  • About the Annals of Botany Company
  • Editorial Board
  • Advertising and Corporate Services
  • Journals Career Network
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

  • < Previous

‘Out of Africa’ origin of the pantropical staghorn fern genus Platycerium (Polypodiaceae) supported by plastid phylogenomics and biogeographical analysis

  • Article contents
  • Figures & tables
  • Supplementary Data

Bine Xue, Erfeng Huang, Guohua Zhao, Ran Wei, Zhuqiu Song, Xianchun Zhang, Gang Yao, ‘Out of Africa’ origin of the pantropical staghorn fern genus Platycerium (Polypodiaceae) supported by plastid phylogenomics and biogeographical analysis, Annals of Botany , Volume 133, Issue 5-6, May/June 2024, Pages 697–710, https://doi.org/10.1093/aob/mcae003

  • Permissions Icon Permissions

The staghorn fern genus Platycerium is one of the most commonly grown ornamental ferns, and it evolved to occupy a typical pantropical intercontinental disjunction. However, species-level relationships in the genus have not been well resolved, and the spatiotemporal evolutionary history of the genus also needs to be explored.

Plastomes of all the 18 Platycerium species were newly sequenced. Using plastome data, we reconstructed the phylogenetic relationships among Polypodiaceae members with a focus on Platycerium species, and further conducted molecular dating and biogeographical analyses of the genus.

The present analyses yielded a robustly supported phylogenetic hypothesis of Platycerium . Molecular dating results showed that Platycerium split from its sister genus Hovenkampia ~35.2 million years ago (Ma) near the Eocene–Oligocene boundary and began to diverge ~26.3 Ma during the late Oligocene, while multiple speciation events within Platycerium occurred during the middle to late Miocene. Biogeographical analysis suggested that Platycerium originated in tropical Africa and then dispersed eastward to southeast Asia–Australasia and westward to neotropical areas.

Our analyses using a plastid phylogenomic approach improved our understanding of the species-level relationships within Platycerium . The global climate changes of both the Late Oligocene Warming and the cooling following the mid-Miocene Climate Optimum may have promoted the speciation of Platycerium , and transoceanic long-distance dispersal is the most plausible explanation for the pantropical distribution of the genus today. Our study investigating the biogeographical history of Platycerium provides a case study not only for the formation of the pantropical intercontinental disjunction of this fern genus but also the ‘out of Africa’ origin of plant lineages.

Email alerts

Citing articles via.

  • Recommend to your Library

Affiliations

  • Online ISSN 1095-8290
  • Print ISSN 0305-7364
  • Copyright © 2024 Annals of Botany Company
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

  • Introduction
  • Conclusions
  • Article Information

The patient was treated with 6 weeks of itraconazole, 100 mg twice daily.

A k-mer analysis was conducted in CLC Genomics Workbench using assembled genomes from 11 isolates recovered from patients in New York City, 8 publicly available Indian T indotineae genomes, and 1 publicly available T interdigitale genome from an isolate originating from Germany. K-mers with a prefix of ATGAC and a length of 16 on either strand were included in the analysis. The scale bar for branch length indicates the level of similarity in k-mer distribution among isolates. These isolates form a cluster distinct from T indotineae isolates from India.

A, New York City T indotineae isolate sequencing reads were mapped to the Indian T indotineae reference isolate TIMM20114, and a phylogenetic tree was generated using a maximum likelihood algorithm with a Juke-Cantor nucleotide substitution model and 1000 bootstrap replicates. Bootstrap values on branches indicate the percentage likelihood that a particular branching pattern is correct. The scale bar shows the number of substitutions/changes per nucleotide. Terbinafine minimum inhibitory concentration (MIC) values and squalene epoxidase (SQLE) variants for each isolate are indicated by the red and blue bars to the right of the phylogenetic tree, respectively. B, Pairwise single nucleotide variation matrix of New York City T indotineae isolates. Single nucleotide variation values are color-coded along a gradient from low (blue) to high (pink). The outbreak isolates are shown from patients A to K.

eMethods. Supplementary Methods

eFigure 1. Phylogenetic Tree of Trichophyton indotineae Isolates Based on ITS Sequence

eFigure 2. Model of Trichophyton indotineae SQLE Enzyme Bound to Terbinafine

eTable 1. GenBank Accession Numbers of Trichophyton Isolates Analyzed in This Study for the Construction of ITS Phylogenetic Tree

eTable 2. New York City Trichophyton indotineae WGS Reads Accession Numbers

eTable 3. Genome Assembly Details

eTable 4. GenBank Accesstion Number for SQLE Coding Sequences of Trichophyton indotineae

Data Sharing Statement

  • Resistant Trichophyton indotineae Dermatophytosis—An Emerging Pandemic, Now in the US JAMA Dermatology Editorial May 15, 2024 Toan S. Bui, BS; Kenneth A. Katz, MD, MSc, MSCE

See More About

Select your interests.

Customize your JAMA Network experience by selecting one or more topics from the list below.

  • Academic Medicine
  • Acid Base, Electrolytes, Fluids
  • Allergy and Clinical Immunology
  • American Indian or Alaska Natives
  • Anesthesiology
  • Anticoagulation
  • Art and Images in Psychiatry
  • Artificial Intelligence
  • Assisted Reproduction
  • Bleeding and Transfusion
  • Caring for the Critically Ill Patient
  • Challenges in Clinical Electrocardiography
  • Climate and Health
  • Climate Change
  • Clinical Challenge
  • Clinical Decision Support
  • Clinical Implications of Basic Neuroscience
  • Clinical Pharmacy and Pharmacology
  • Complementary and Alternative Medicine
  • Consensus Statements
  • Coronavirus (COVID-19)
  • Critical Care Medicine
  • Cultural Competency
  • Dental Medicine
  • Dermatology
  • Diabetes and Endocrinology
  • Diagnostic Test Interpretation
  • Drug Development
  • Electronic Health Records
  • Emergency Medicine
  • End of Life, Hospice, Palliative Care
  • Environmental Health
  • Equity, Diversity, and Inclusion
  • Facial Plastic Surgery
  • Gastroenterology and Hepatology
  • Genetics and Genomics
  • Genomics and Precision Health
  • Global Health
  • Guide to Statistics and Methods
  • Hair Disorders
  • Health Care Delivery Models
  • Health Care Economics, Insurance, Payment
  • Health Care Quality
  • Health Care Reform
  • Health Care Safety
  • Health Care Workforce
  • Health Disparities
  • Health Inequities
  • Health Policy
  • Health Systems Science
  • History of Medicine
  • Hypertension
  • Images in Neurology
  • Implementation Science
  • Infectious Diseases
  • Innovations in Health Care Delivery
  • JAMA Infographic
  • Law and Medicine
  • Leading Change
  • Less is More
  • LGBTQIA Medicine
  • Lifestyle Behaviors
  • Medical Coding
  • Medical Devices and Equipment
  • Medical Education
  • Medical Education and Training
  • Medical Journals and Publishing
  • Mobile Health and Telemedicine
  • Narrative Medicine
  • Neuroscience and Psychiatry
  • Notable Notes
  • Nutrition, Obesity, Exercise
  • Obstetrics and Gynecology
  • Occupational Health
  • Ophthalmology
  • Orthopedics
  • Otolaryngology
  • Pain Medicine
  • Palliative Care
  • Pathology and Laboratory Medicine
  • Patient Care
  • Patient Information
  • Performance Improvement
  • Performance Measures
  • Perioperative Care and Consultation
  • Pharmacoeconomics
  • Pharmacoepidemiology
  • Pharmacogenetics
  • Pharmacy and Clinical Pharmacology
  • Physical Medicine and Rehabilitation
  • Physical Therapy
  • Physician Leadership
  • Population Health
  • Primary Care
  • Professional Well-being
  • Professionalism
  • Psychiatry and Behavioral Health
  • Public Health
  • Pulmonary Medicine
  • Regulatory Agencies
  • Reproductive Health
  • Research, Methods, Statistics
  • Resuscitation
  • Rheumatology
  • Risk Management
  • Scientific Discovery and the Future of Medicine
  • Shared Decision Making and Communication
  • Sleep Medicine
  • Sports Medicine
  • Stem Cell Transplantation
  • Substance Use and Addiction Medicine
  • Surgical Innovation
  • Surgical Pearls
  • Teachable Moment
  • Technology and Finance
  • The Art of JAMA
  • The Arts and Medicine
  • The Rational Clinical Examination
  • Tobacco and e-Cigarettes
  • Translational Medicine
  • Trauma and Injury
  • Treatment Adherence
  • Ultrasonography
  • Users' Guide to the Medical Literature
  • Vaccination
  • Venous Thromboembolism
  • Veterans Health
  • Women's Health
  • Workflow and Process
  • Wound Care, Infection, Healing

Others Also Liked

  • Download PDF
  • X Facebook More LinkedIn

Caplan AS , Todd GC , Zhu Y, et al. Clinical Course, Antifungal Susceptibility, and Genomic Sequencing of Trichophyton indotineae . JAMA Dermatol. Published online May 15, 2024. doi:10.1001/jamadermatol.2024.1126

Manage citations:

© 2024

  • Permissions

Clinical Course, Antifungal Susceptibility, and Genomic Sequencing of Trichophyton indotineae

  • 1 The Ronald O. Perelman Department of Dermatology, NYU Grossman School of Medicine, New York, New York
  • 2 Dermatology Service, Bellevue Hospital Center, New York, New York
  • 3 Wadsworth Center Mycology Laboratory, New York State Department of Health, Albany
  • 4 SUNY Downstate Health Sciences University, Department of Dermatology, Brooklyn, New York
  • 5 Department of Dermatology, Weill Cornell Medicine, New York, New York
  • 6 Division of Infectious Diseases, Department of Pediatrics, Weill Cornell Medicine, New York, New York
  • 7 Division of Infectious Diseases, Department of Medicine, Weill Cornell Medicine, New York, New York
  • 8 Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, New York
  • 9 NYC Health + Hospitals/Lincoln Medical Center, Department of Dermatology, Bronx, New York, USA Department of Dermatology, Weill Cornell Medicine, New York
  • 10 Mycotic Diseases Branch, Centers for Disease Control and Prevention, Atlanta, Georgia
  • 11 New York City Department of Health and Mental Hygiene, New York, New York
  • 12 Division of Translational Medicine, Wadsworth Center, New York State Department of Health, Albany
  • 13 Department of Biomedical Sciences, School of Public Health, University at Albany, Albany, New York
  • Editorial Resistant Trichophyton indotineae Dermatophytosis—An Emerging Pandemic, Now in the US Toan S. Bui, BS; Kenneth A. Katz, MD, MSc, MSCE JAMA Dermatology

Question   What are the clinical features, antifungal susceptibility, and genome sequencing of Trichophyton indotineae ?

Findings   In this case series of 11 patients in New York City, severe disease, ineffective standard antifungal treatments, and diagnostic delays were common. Squalene epoxidase sequence variations L393S and F397L, and terbinafine minimum inhibitory concentration values of 0.5 μg/mL or higher were associated with terbinafine therapy failure, with US isolates showing differences from known Indian isolates.

Meaning   The manifestation of T indotineae involves extensive and recalcitrant infections, often resistant to standard terbinafine therapy, while analysis of travel history and isolate relatedness suggests a probable origin of these infections in Bangladesh.

Importance   Trichophyton indotineae is an emerging dermatophyte causing outbreaks of extensive tinea infections often unresponsive to terbinafine. This species has been detected worldwide and in multiple US states, yet detailed US data on infections with T indotineae are sparse and could improve treatment practices and medical understanding of transmission.

Objective   To correlate clinical features of T indotineae infections with in vitro antifungal susceptibility testing results, squalene epoxidase gene sequence variations, and isolate relatedness using whole-genome sequencing.

Design, Setting, and Participants   This retrospective cohort study of patients with T indotineae infections in New York City spanned May 2022 to May 2023. Patients with confirmed T indotineae infections were recruited from 6 New York City medical centers.

Main Outcome and Measure   Improvement or resolution at the last follow-up assessment.

Results   Among 11 patients with T indotineae (6 male and 5 female patients; median [range] age, 39 [10-65] years), 2 were pregnant; 1 had lymphoma; and the remainder were immunocompetent. Nine patients reported previous travel to Bangladesh. All had widespread lesions with variable scale and inflammation, topical antifungal monotherapy failure, and diagnostic delays (range, 3-42 months). Terbinafine treatment failed in 7 patients at standard doses (250 mg daily) for prolonged duration; these patients also had isolates with amino acid substitutions at positions 393 (L393S) or 397 (F397L) in squalene epoxidase that correlated with elevated terbinafine minimum inhibitory concentrations of 0.5 μg/mL or higher. Patients who were treated with fluconazole and griseofulvin improved in 2 of 4 and 2 of 5 instances, respectively, without correlation between outcomes and antifungal minimum inhibitory concentrations. Furthermore, 5 of 7 patients treated with itraconazole cleared or had improvement at the last follow-up, and 2 of 7 were lost to follow-up or stopped treatment. Based on whole-genome sequencing analysis, US isolates formed a cluster distinct from Indian isolates.

Conclusion and Relevance   The results of this case series suggest that disease severity, diagnostic delays, and lack of response to typically used doses and durations of antifungals for tinea were common in this primarily immunocompetent patient cohort with T indotineae , consistent with published data. Itraconazole was generally effective, and the acquisition of infection was likely in Bangladesh.

Dermatophytosis is a common, contagious superficial skin, hair, or nail infection caused by dermatophyte fungi, 1 , 2 most commonly by members of the genus Trichophyton . 2 - 4 Skin infections are often mild and resolve with topical antifungals. Systemic antifungals, such as terbinafine, a first-line oral antifungal that inhibits the enzyme squalene epoxidase (SQLE), are used to treat refractory or extensive dermatophytosis or infections involving hair follicles or nails.

In the past decade, dermatophytes failing typical doses and durations of antifungal therapy have become a global public health concern, illustrated by major outbreaks of severe, recalcitrant dermatophytosis in South Asia. 5 - 9 These outbreaks have been driven by the emergence of a recently described species, Trichophyton indotineae , formerly Trichophyton mentagrophytes genotype VIII. 6  T indotineae causes extensive, pruritic plaques, typically on the trunk, extremities, and groin, 10 which are often minimally inflammatory, do not resolve with topical antifungals alone, and typically fail oral terbinafine treatment 7 , 11 at doses and durations used for tinea infections. Clinical failure with other antifungals, including azoles and griseofulvin, are also reported. 11 Itraconazole is currently recommended for patients who do not respond with terbinafine, but high doses and long treatment durations (>6 weeks) are required to clear T indotineae infections with reports of relapse occurring. 12 , 13

Published data demonstrate that T indotineae isolates displaying terbinafine minimum inhibitory concentration (MIC) values of 0.5 μg/mL or higher are associated with treatment failure at standard doses and durations of therapy, and they harbor specific sequence variations in SQLE, including L393S and F397L. 9 , 14 - 16 Daily doses of terbinafine at 500 mg per day may overcome mildly elevated terbinafine MICs, 17 , 18 yet response rates vary. 19 Isolates responsive to terbinafine therapy are correlated with SQLE alterations at position 448 (A448T). 9 , 16 However, no established clinical breakpoints for antifungal medications exist for dermatophytes, and in vitro antifungal susceptibility testing (AFST) does not necessarily correspond to clinical response, creating additional treatment challenges.

In May 2023, dermatologists and public health officials reported the first 2 confirmed T indotineae cases in the US. 20 Subsequently, a retrospective review of dermatophyte AFST data identified T indotineae in several US states, with the earliest confirmed isolate from 2017. 4 Despite increased US spread, cases are likely underrecognized due to lack of awareness. Furthermore, T indotineae is a member of the T mentagrophytes species complex and cannot be morphologically differentiated from other members of the complex. As such, T indotineae is frequently misidentified as T mentagrophytes or T interdigitale . 6 , 20 Molecular methods, such as DNA sequencing, are required to accurately identify T indotineae . 6 , 20 Due to these challenges, the prevalence of T indotineae in the US is unknown, and published US cases lack data linking the clinical course to AFST. 4 , 20 To inform approaches to diagnosis, treatment, and outcomes, we describe the largest cohort of US patients with confirmed T indotineae infection to date to our knowledge, correlating epidemiologic and clinical features of cases to AFST and SQLE sequence information, and modeling of terbinafine binding to SQLE. Finally, to improve our understanding of the emergence of T indotineae in the US, we applied WGS to determine the relatedness of isolates recovered to that of WGS sequences from Indian isolates.

This case series includes all T indotineae isolates identified at the Wadsworth Center, New York State Department of Health, between May 2022 and May 2023. Cases were associated with 6 New York City medical centers and confirmed from testing of laboratory samples. Data were collected on patient demographics, exposure characteristics, underlying conditions, health care utilization, and treatment outcomes using a standardized case report form created by public health authorities and distributed to clinicians after T indotineae laboratory confirmation. Dermatophytes other than T indotineae were excluded. This study encompassed public health surveillance activities and is exempt according to New York State Department of Health Institutional Review Board guidelines. Informed consent was waived because the data were deidentified. Laboratory methods are described briefly herein. Additional details are available in the eMethods in Supplement 1 . The Strengthening the Reporting of Observational Studies in Epidemiology ( STROBE ) reporting guideline was followed.

T indotineae identification was determined by DNA sequencing of the internal transcribed spacer (ITS) region. Genomic DNA from each suspected T indotineae isolate was extracted, followed by polymerase chain reaction amplification of the ITS region, DNA sequencing of the polymerase chain reaction product, and analysis of the sequence data using the Basic Local Alignment Search Tool. ITS sequences were submitted to GenBank with accession numbers OR483778 to OR483790 (eTable 1 and eFigure 1 in Supplement 1 ).

AFST was performed by a broth microdilution assay according to Reference Method M27-A3 of the Clinical and Laboratory Standards Institute guidance. 21 MICs were determined after 96 hours of incubation at 35 °C.

To determine relatedness among T indotineae isolates, T indotineae genomes were sequenced, and raw sequencing reads were deposited in the Sequence Read Archive with accession numbers SRR27198731 to SRR27198741 (eTable 2 in Supplement 1 ). Bioinformatic analyses were performed in CLC Genomics Workbench using CLC Microbial Genomics Module software (version 23.0.4; Qiagen, Inc). Briefly, whole genome assemblies were prepared de novo for each isolate and compared with all assembled T indotineae genomes available in GenBank (eTable 3 in Supplement 1 ). Sequencing reads from T indotineae isolates in the US were mapped to the reference T indotineae isolate, TIMM20114, from India. 22 Subsequently, single nucleotide variations (SNVs) were detected and used to construct a phylogenetic tree. From each isolate’s DNA sequence, the protein sequence of SQLE was determined and variants were identified.

To characterize the molecular mechanism underpinning terbinafine resistance, an AlphaFold 23 model for T mentagrophytes SQLE was used to generate a model of T indotineae SQLE using template-based homology modeling with Swiss-Modeler. 24 The structure for terbinafine was obtained from its 3D model in PubChem. Terbinafine docking was performed using QuickVina2 25 with the AutoDock Vina scoring function. 26 To obtain an ideal starting location for terbinafine bound to SQLE, terbinafine was aligned to 2 previously solved human SQLE protein crystal structures with bound inhibitors (Protein Data Bank identifier: 6C6N and 6C6P). ChimeraX, version 1.4, 27 was used to generate molecular figures, and Gimp (version 2.10) was used to make composite figures.

Between May 2022 to May 2023, T indotineae isolates from 11 unique patients from 6 different New York City medical centers were confirmed at the Wadsworth Center, New York State Department of Health ( Table 1 ). AFST was performed against a panel of antifungal agents ( Table 2 ). The median (range) patient age was 39 (10-65) years with 6 male and 5 female patients. Two patients (patients B and E) were pregnant. One patient had untreated lymphoma (patient H); the remainder had no immunocompromising conditions. All but 2 patients (patients B and C) reported travel in Bangladesh before acquiring a T indotineae infection; patient B had no travel history or known contact with an infected person, and exposure characteristic data were missing for patient C. Household transmission was considered highly likely in 3 instances and possible in another instance.

The median (range) time from tinea onset to diagnosis was 10 (3 [patient H] to 42 [patient K]) months. All patients had lesions affecting multiple body sites, most commonly the trunk, extremities, buttocks, and groin; 2 patients (patients F and G) had facial involvement. Nine patients reported pruritus. Six patients received medium- to high-potency topical corticosteroids before tinea diagnosis; 2 patients received the topical corticosteroid as part of combination corticosteroid-antifungal creams obtained in Bangladesh. All patients received at least 1 topical antifungal medication, none of which was effective as monotherapy. Duration of topical and oral antifungal therapy and adherence to therapy abroad before presentation were unknown. Ten of the patients received multiple topical antifungal agents. All patients demonstrated incomplete responses to typical doses and duration of oral antifungal therapy for tinea corporis/cruris ( Figure 1 ).

Seven patients (patients A, B, D, F, G, J, and K) received a course of terbinafine without resolution of infection. Terbinafine doses were universally 250 mg daily and ranged from 14 days (patients B, J, and K) to 28 days (patients D and F) to 42 days (patient A). The dose and duration of terbinafine for 1 patient (patient G) was unknown. Seven patients’ isolates had elevated terbinafine MIC values (range, 0.5 to >128 μg/mL). AFST data and terbinafine MIC values were unavailable for patient A, as this patient’s isolate was not saved. Three patients (patients C, E, and I) who did not receive oral terbinafine treatment had isolates with terbinafine MIC values of 0.0039 μg/mL or lower.

Four patients (patients A, E, I, and J) received fluconazole treatment. Two patients (patients A and E) had a resolution while taking fluconazole; fluconazole AFST data were unavailable for patient A. The fluconazole MIC value for the isolate recovered from patient E was 32 μg/mL. Patient A received fluconazole, 150 mg, weekly for 4 weeks, and patient E received fluconazole, 200 mg, weekly for 12 weeks. Fluconazole therapy failed in 2 patients (patients I and J), with fluconazole MIC values of 32 μg/mL and 16 μg/mL, respectively. The fluconazole dose for patient I was unobtainable. Patient J received fluconazole, 200 mg, weekly for 4 weeks, without resolution.

Five patients (patients A, C, D, G, and J) were treated with griseofulvin. Patient D’s infection resolved after an 8-week course of ultramicrosize griseofulvin (375 mg/d); the griseofulvin MIC value for this patient’s isolate was 4 μg/mL. Patient J also showed improvement while taking griseofulvin after 2 months of therapy, with a MIC value of 2 μg/mL. Among the patients (patients A, C, and G) who did not respond to griseofulvin, the griseofulvin MIC values were all 4 μg/mL. Patients A and C received griseofulvin for 1 and 2 months, respectively.

Seven patients (patients B, C, F, G, H, I, and K) were treated with itraconazole, 3 of whom (patients B, F, and I), had resolution after treatment with itraconazole (dose and duration in Table 1 ). Two patients (patients G and H) were improving at the last known follow-up. Patient C was lost to follow-up after starting itraconazole. Patient K stopped itraconazole due to gastrointestinal adverse effects. Of the 7 patients treated with itraconazole, all had itraconazole MIC values of 0.5 μg/mL or lower. Following treatment and resolution of infection, 1 patient (patient I) developed acute urticaria and dermatographism. One patient (patient J) had contraindications to itraconazole, precluding its use. No other complications of therapy were reported. No patient received voriconazole in the US. Two patients (patients F and G) reported receiving voriconazole in Bangladesh, but the dose and duration of therapy are unknown.

All 11 T indotineae isolates were evaluated for SQLE variants ( Table 2 ). Five isolates harbored the point variant F397L (recovered from patients A, F, G, J, and K) and exhibited elevated terbinafine MIC values (32 to >128 μg/mL). Three isolates (recovered from patients B, D, and H) had a point variant at position 393 (L393S), corresponding to terbinafine MIC values of 0.5 to 1 μg/mL. Three isolates (recovered from patients C, E, and I) harbored a change at position 448 (A448T) and demonstrated low terbinafine MIC values, 0.0039 μg/mL or lower ( Table 2 ).

A phylogenetic k-mer analysis revealed that the US isolates formed a distinct cluster from Indian isolates ( Figure 2 ). When mapped to the most closely related Indian isolate (TIMM20114 22 ), SNV analysis of the US isolates ( Figure 3 A) revealed that 2 patients who lived in the same household and reported previous travel to Bangladesh had identical isolates (no difference in SNVs). Two patients who lived in the same household but separately traveled to Bangladesh, had very closely related isolates (a difference of 2 SNVs). Between 32 and 373 SNVs were associated with other isolates ( Figure 3 B).

To obtain a mechanistic understanding of how SQLE variants might lead to decreased terbinafine susceptibility, an AlphaFold model of T mentagrophytes SQLE was used to model the terbinafine binding site in T indotineae SQLE using template-based homology modeling with Swiss-Modeler. The model revealed that residues L393 and F397, but not A448, form part of a hydrophobic pocket that accommodates the naphthalene moiety of terbinafine (eFigure 2 in Supplement 1 ).

To our knowledge, this case series is the largest US cohort describing T indotineae infections and highlights several important points. Patients experienced extensive, prolonged pruritic lesions that generally failed monotherapy with topical antifungals and showed inadequate response to typical doses and durations of oral antifungal medications, including prolonged terbinafine therapy at standard doses, consistent with findings from international reports. 5 , 9 , 15 , 16 , 28 Diagnostic delays were common, and most patients did not have immunocompromising conditions that might predispose them to severe dermatophytosis. These findings, in addition to others (eg, recent travel to a high-prevalence region and/or initial laboratory culture identification of T mentagrophytes or T interdigitale in patients with compatible physical examination findings and history) should prompt consideration of T indotineae , which requires molecular-based techniques for definitive diagnosis. Because T indotineae identification was retrospective in most cases described in this report, treating dermatologists were often unaware of T indotineae diagnosis at the time of treatment, likely leading to the selection of ineffective or suboptimal antifungal treatment. Data highlight that 100-mg or 200-mg daily dosing of itraconazole for 6 to 8 weeks is the current preferred therapy, yet longer durations and higher doses may be required with reported relapse occurring. 11 , 12 Terbinafine at higher than standard doses (500 mg daily) may be effective for some patients. 17 , 19 Griseofulvin and fluconazole show limited efficacy. 7 , 13

Consistent with published literature, 9 , 15 , 16 , 29  T indotineae isolates recovered from patients in this study whose infections were unresponsive to terbinafine at standard doses and durations harbored variants at either position L393 (L393S; MIC values of 0.5 to 1.0 μg/mL) or position F397 (F397L; MIC values between 32 to >128 μg/mL). Three patients (patients B, D, and H) who had isolates with terbinafine MIC values of 0.5 to 1 μg/mL received terbinafine at 250 mg dosing; whether 500-mg daily terbinafine dose would have achieved cure in these patients as has been reported elsewhere is unknown. 17 Only 3 of 11 (patients C, E, and I) patients had isolates with terbinafine MIC values in ranges reported to correlate with terbinafine susceptibility (≤0.0039 μg/mL); however, the clinical response is unknown as none was treated with terbinafine. These 3 patient isolates harbored the A448T variant, consistent with published literature. 9 , 30

Itraconazole therapy did not fail in any patients in this series, yet prolonged treatment durations were required to achieve a cure. Only 1 isolate in this series had an elevated itraconazole MIC value (patient G, 0.5 μg/mL). This patient was improving while taking itraconazole but was lost to follow-up. Three patients (patients C, I, and K) isolates displayed itraconazole MIC values (0.12 μg/mL); the infection resolved successfully for 1 patient (patient I). Patient C was lost to follow-up; patient I had itraconazole-associated gastrointestinal adverse effects and stopped therapy. In 1 large study of Indian isolates, the presence of the SQLE variant A448T was associated with elevated voriconazole and itraconazole MIC values 15 ; however, this finding may have been chance and requires further investigation. Compared with the data herein, of the 3 patients whose T indotineae isolates harbored the SQLE A448T variant (patients C, E, and I), all had voriconazole MIC values (0.25-4 μg/mL), but none were treated with voriconazole and thus clinical responsiveness could not be determined. The third patient (patient E) with SQLE A448T had an itraconazole MIC of 0.25 μg/mL but was not treated with itraconazole, so clinical responsiveness could not be determined.

Patient response to other antifungals did not correlate with antifungal MIC values. Only 2 of 4 patients treated with griseofulvin exhibited improvement, despite the low MIC values for all isolates (2-4 μg/mL). Notably, this medication is not available for use in Bangladesh, presumably the source of acquisition for most patients in this report, for comparison of clinical efficacy. 13 , 31 Similarly, of 4 patients treated with fluconazole, 2 (patients A and E) were successfully cured while 2 (patients I and J) were not despite isolates from these patients having MIC values of 16 to 32 μg/mL.

The phylogenetic k-mer tree revealed that the 11 T indotineae isolates grouped and were distinct from T indotineae isolates from India. Except for the present study, no T indotineae WGS data are available in GenBank originating from countries outside of India at the time of acceptance of this article. Based on the recent travel history of 10 of 11 patients ( Table 1 ), the US T indotineae isolates likely originated in Bangladesh suggesting that T indotineae may be endemic in Bangladesh. T indotineae easily transmits from person to person, and WGS data revealed 2 patients had identical isolates suggesting household transmission or acquisition from the same source, while 2 separate patients had very closely related isolates (only 2 SNVs difference between isolates), suggesting either transmission among household contacts or acquisition from a similar source. The other 7 isolates had 32 to 373 SNVs, which could indicate several independent introductions of T indotineae or variants of T indotineae isolates within New York City.

Based on the homology modeling of T indotineae SQLE, A448 resides outside of the terbinafine binding pocket, consistent with the inability of the A448T variant to change terbinafine susceptibility. 9 , 16 In contrast, residues L393 and F397 form part of the hydrophobic binding site for the naphthalene moiety of terbinafine. Variants in these residues result in higher terbinafine MIC values, 9 , 14 - 16 likely due to disruption of terbinafine binding. The importance of an appropriately sized hydrophobic binding pocket is supported by a study demonstrating that terbinafine is a weak, partial inhibitor of human SQLE, and modeling of terbinafine in human SQLE revealed that residues I197 and L324 did not leave sufficient room for the bulky naphthalene group of terbinafine. 32 In Trichophyton species, these residues are valines with smaller hydrophobic side chains that can likely better accommodate terbinafine binding, resulting in the terbinafine susceptible phenotype. These findings 32 along with the T indotineae SQLE model with L393S and F397L in the terbinafine binding site support the notion that even minor disruptions to the binding pocket’s size and hydrophobicity may lead to terbinafine resistance.

This study is limited by the small study size and inclusion of patients from only 1 region of the country. Furthermore, there are little published data on T indotineae in Bangladesh, the presumable source of infection for most of these patients, limiting the potential for comparative analyses. Additionally, this study is limited by the lack of treatment details for some patients before their presentation in New York City. Further studies are needed that encompass other geographic regions in the US, expand on clinical, epidemiologic, molecular, and mycologic information, and prospectively follow up patients with confirmed T indotineae infection.

In this case series, we describe the largest US-based cohort to date to our knowledge of patients with T indotineae infection, highlighting the importance of prompt clinical recognition and the need for molecular testing to accurately identify T indotineae . Additionally, we review current preferred treatment approaches and correlate AFST data with T indotineae SQLE variants. Given the global spread of T indotineae , a strong international collaborative effort among clinicians, public health professionals, and clinical microbiologists is needed to characterize T indotineae pathophysiology and transmissibility, promote the judicious use of antifungal medications, develop standardized treatment algorithms, and monitor and mitigate the spread of emerging resistance in dermatophyte infections.

Accepted for Publication: March 18, 2024.

Published Online: May 15, 2024. doi:10.1001/jamadermatol.2024.1126

Open Access: This is an open access article distributed under the terms of the CC-BY License . © 2024 Caplan AS et al. JAMA Dermatology .

Corresponding Author: Sudha Chaturvedi, PhD, Mycology Laboratory, Wadsworth Center, 120 New Scotland Ave, New York State Department of Health, 120 New Scotland Ave, Albany, NY 12208 ( [email protected] ).

Author Contributions: Dr Chaturvedi had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Drs Caplan and Todd these authors contributed equally as co–first authors.

Concept and design: Caplan, Jakus, Lipner, Fonseca, Cline, Gold, Lockhart, Chiller, Chaturvedi.

Acquisition, analysis, or interpretation of data: Caplan, Todd, Zhu, Sikora, Akoh, Jakus, Lipner, Babbush, Acker, Morales, Marrero Rolon, Westblade, Fonseca, Smith, Greendyke, Manjari, Banavali, Chaturvedi.

Drafting of the manuscript: Caplan, Todd, Lipner, Morales, Cline, Gold, Manjari, Banavali, Chaturvedi.

Critical review of the manuscript for important intellectual content: Caplan, Zhu, Sikora, Akoh, Jakus, Lipner, Babbush, Acker, Marrero Rolon, Westblade, Fonseca, Lockhart, Smith, Chiller, Greendyke, Chaturvedi.

Administrative, technical, or material support: Caplan, Zhu, Akoh, Jakus, Babbush, Gold, Lockhart, Greendyke, Banavali.

Conflict of Interest Disclosures: Dr Westblade reported grants from bioMerieux, Inc, and Hardy Diagnostics outside the submitted work. No other disclosures were reported.

Funding/Support: This work was supported in part by the Centers for Disease Control and Prevention Antibiotic Resistance Laboratory Network (grant NU50CK000516) and the Clinical Laboratory Reference System, Wadsworth Center, New York State Department of Health.

Role of the Funder/Sponsor: The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Disclaimer: The data presented in this article are the author’s own and do not reflect the view of the Wadsworth Center, New York State Department of Health, Centers for Diseases Control and Prevention, NYU Grossman School of Medicine, New York, Bellevue Hospital Center, New York, Weill Cornell Medicine, New York, New York City Department of Health and Mental Hygiene, New York.

Data Sharing Statement: See Supplement 2 .

Additional Contributions : We thank the Wadsworth Center Advanced Genomic Technologies Core for DNA sequencing and the Media and Tissue Culture Core for preparing various reagents and media for the study. We also thank the health care staff and patients who participated in this study. We thank Lisa A. Biega, MS, Director’s Office, Wadsworth Center, New York State Department of Health, for agreeing to review the manuscript critically. We thank the patient in Figure 1, Avrom S. Caplan, MD, George A. Zakhem, MD, MBA, Jadesola Olayinka, MD, and Miriam K. Pomeranz, MD, for the use of clinical pictures in Figure 1.

  • Register for email alerts with links to free full-text articles
  • Access PDFs of free articles
  • Manage your interests
  • Save searches and receive search alerts

Genomic analysis and phylogenetic characterization of Himalayan snow trout, Schizothorax esocinus based on mitochondrial protein-coding genes

  • Original Article
  • Published: 15 May 2024
  • Volume 51 , article number  659 , ( 2024 )

Cite this article

case study of phylogenetic analysis

  • G. Akhter   ORCID: orcid.org/0000-0002-6255-8988 1 ,
  • I. Ahmed   ORCID: orcid.org/0000-0003-4821-1707 1 &
  • S. M. Ahmad   ORCID: orcid.org/0000-0001-6036-6467 2  

Mitochondrial DNA (mtDNA) has become a significant tool for exploring genetic diversity and delineating evolutionary links across diverse taxa. Within the group of cold-water fish species that are native to the Indian Himalayan region, Schizothorax esocinus holds particular importance due to its ecological significance and is potentially vulnerable to environmental changes. This research aims to clarify the phylogenetic relationships within the Schizothorax genus by utilizing mitochondrial protein-coding genes.

Standard protocols were followed for the isolation of DNA from S. esocinus . For the amplification of mtDNA, overlapping primers were used, and then subsequent sequencing was performed. The genetic features were investigated by the application of bioinformatic approaches. These approaches covered the evaluation of nucleotide composition, codon usage, selective pressure using nonsynonymous substitution /synonymous substitution (Ka/Ks) ratios, and phylogenetic analysis.

The study specifically examined the 13 protein-coding genes of Schizothorax species which belongs to the Schizothoracinae subfamily. Nucleotide composition analysis showed a bias towards A + T content, consistent with other cyprinid fish species, suggesting evolutionary conservation. Relative Synonymous Codon Usage highlighted leucine as the most frequent (5.18%) and cysteine as the least frequent (0.78%) codon. The positive AT-skew and the predominantly negative GC-skew indicated the abundance of A and C. Comparative analysis revealed significant conservation of amino acids in multiple genes. The majority of amino acids were hydrophobic rather than polar. The purifying selection was revealed by the genetic distance and Ka/Ks ratios. Phylogenetic study revealed a significant genetic divergence between S. esocinus and other Schizothorax species with interspecific K2P distances ranging from 0.00 to 8.87%, with an average of 5.76%.

The present study provides significant contributions to the understanding of mitochondrial genome diversity and genetic evolution mechanisms in Schizothoracinae, hence offering vital insights for the development of conservation initiatives aimed at protecting freshwater fish species.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

case study of phylogenetic analysis

Data availability

The data that support the findings of this study are openly available in GenBank via Bankit: http://www.ncbi.nlm.nih.gov/BankIt/ .

Yin F, Cadenas E (2015) Mitochondria: the cellular hub of the dynamic coordinated network. Antioxid Redox Signal 22:961–964. https://doi.org/10.1089/ars.2015.6313

Article   CAS   PubMed   PubMed Central   Google Scholar  

Smith DR, Keeling PJ (2015) Mitochondrial and plastid genome architecture: reoccurring themes, but significant differences at the extremes. Proc Natl Acad Sci U S A 112:10177–10184. https://doi.org/10.1073/pnas.1422049112

Bucklin A, Steinke D, Blanco-Bercial L (2011) DNA barcoding of marine metazoa. Annu Rev Mar Sci 3:471–508. https://doi.org/10.1146/annurev-marine-120308-080950

Article   Google Scholar  

Bernt M, Braband A, Schierwater B, Stadler PF (2013) Genetic aspects of mitochondrial genome evolution. Mol Phylogenet Evol 69:328–338. https://doi.org/10.1016/j.ympev.2012.10.020

Article   CAS   PubMed   Google Scholar  

Anderson S, Bankier AT, Barrell BG et al (1981) Sequence and organization of the human mitochondrial genome. Nature 290:457–465. https://doi.org/10.1038/290457a0

Satoh TP, Miya M, Mabuchi K, Nishida M (2016) Structure and variation of the mitochondrial genome of fishes. BMC Genomics. https://doi.org/10.1186/s12864-016-3054-y

Article   PubMed   PubMed Central   Google Scholar  

Inoue JG, Miya M, Tsukamoto K, Nishida M (2003) Basal actinopterygian relationships: a mitogenomic perspective on the phylogeny of the “ancient fish.” Mol Phylogenet Evol 26:110–120. https://doi.org/10.1016/S1055-7903(02)00331-7

Peng Z, Wang J, He S (2006) The complete mitochondrial genome of the helmet catfish Cranoglanis bouderius (Siluriformes: Cranoglanididae) and the phylogeny of otophysan fishes. Gene 376:290–297. https://doi.org/10.1016/j.gene.2006.04.014

Domingues VS, Santos RS, Brito A, Alexandrou M, Almada VC (2007) Mitochondrial and nuclear markers reveal isolation by distance and effects of Pleistocene glaciations in the northeastern Atlantic and Mediterranean populations of the white seabream (Diplodus sargus, L.). J Exp Mar Bio Ecol 346:102–113. https://doi.org/10.1016/j.jembe.2007.03.002

Article   CAS   Google Scholar  

Wu X, Wang L, Chen S, Zan R, Xiao H, Zhang YP (2010) The complete mitochondrial genomes of two species from Sinocyclocheilus (Cypriniformes: Cyprinidae) and a phylogenetic analysis within Cyprininae. Mol Biol Rep 37:2163–2171. https://doi.org/10.1007/s11033-009-9689-x

Zhang X, Yue B, Jiang W, Song Z (2009) The complete mitochondrial genome of rock carp Procypris rabaudi (Cypriniformes: Cyprinidae) and phylogenetic implications. Mol Biol Rep 36:981–991. https://doi.org/10.1007/s11033-008-9271-y

Kong XH, Wang XZ, Gan XN, Li JB, He SP (2007) Phylogenetic relationships of Cyprinidae (Teleostei: Cypriniformes) inferred from the partial S6K1 gene sequences and implication of indel sites in intron 1. Sci China, Ser C Life Sci 50:780–788. https://doi.org/10.1007/s11427-007-0076-3

Chen YF, Cao WX (2000) Schizothoracinae. Fauna Sinica, Osteichthyes, Cypriniformes III. Science Press, Beijing, pp 273–335

Google Scholar  

Mir FA, Mir JI, Chandra S (2013) Phenotypic variation in the Snowtrout Schizothorax richardsonii (Gray, 1832) (Actinopterygii: Cypriniformes: Cyprinidae) from the Indian Himalayas. Contrib Zool 82:115–122. https://doi.org/10.1163/18759866-08203001

Mirza M (1991) A contribution to the systematics of the Schizothoracine fishes (Pisces: Cyprinidae) with the description of three new tribes. Pak J Zool 23:339–341

Ganai FA, Yousuf AR, Dar SA, Wani SU, Tripathi NK (2011) Cytotaxonomic status of schizothoracine fishes of kashmir himalaya (teleostei: Cyprinidae). Caryologia 64:435–445. https://doi.org/10.1080/00087114.2011.10589811

Jhingran VG (1991) Fish and fisheries of India. Hindustan Pub. Corp, New Delhi, p 727

Sunder S, Bhagat MJ (1979) A note on the food of Schizothorax plagiostomus (McClelland) in the Chenab drainage of Jammu Province during 1973–74. J Inland Fish Soc India 11:117–118

Bashir A, Bisht BS, Mir JI, Patiyal RS, Kumar R (2016) Morphometric variation and molecular characterization of snow trout species from Kashmir valley, India. Mitochondrial DNA A DNA Mapp Seq Anal 27:4492–4497. https://doi.org/10.3109/19401736.2015.1101537

Kullander SO, Fang F, Delling B, Åhlander E (1999) The fishes of the Kashmir Valley. River Jhelum, Kashmir Valley: impacts on the aquatic environment. Göteborg, Swedmar, pp 99–167

Ahmad SM, Bhat FA, Balkhi MUH, Bhat BA (2014) Mitochondrial DNA variability to explore the relationship complexity of Schizothoracine (Teleostei: Cyprinidae). Genetica 142:507–516. https://doi.org/10.1007/s10709-014-9797-y

Kartavtsev YP, Batischeva NM, Bogutskaya NG, Katugina AO, Hanzawa N (2017) Molecular systematics and DNA barcoding of Altai osmans, oreoleuciscus (pisces, cyprinidae, and leuciscinae), and their nearest relatives, inferred from sequences of cytochrome b (Cyt-b), cytochrome oxidase c (Co-1), and complete mitochondrial genome. Mitochondrial DNA A DNA Mapp Seq Anal 28:502–517. https://doi.org/10.3109/24701394.2016.1149822

Ma Q, He K, Wang X, Jiang J, Zhang X, Song Z (2020) Better resolution for cytochrome b than cytochrome c oxidase subunit i to identify Schizothorax species (Teleostei: Cyprinidae) from the Tibetan Plateau and its adjacent area. DNA Cell Biol 39:579–598. https://doi.org/10.1089/dna.2019.5031

Lakra WS, Goswami M, Gopalakrishnan A (2009) Molecular identification and phylogenetic relationships of seven Indian Sciaenids (Pisces: Perciformes, Sciaenidae) based on 16S rRNA and cytochrome c oxidase subunit i mitochondrial genes. Mol Biol Rep 36:831–839. https://doi.org/10.1007/s11033-008-9252-1

Akhtar T, Ali G, Shafi N et al (2020) Sequencing and characterization of mitochondrial protein-coding genes for Schizothorax niger (Cypriniformes: Cyprinidae) with phylogenetic consideration. Biomed Res Int. https://doi.org/10.1155/2020/5980135

Hall TA (1999) BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucleic Acids Symp Ser 41:95–98

CAS   Google Scholar  

Tamura K, Stecher G, Kumar S (2021) MEGA11: molecular evolutionary genetics analysis version 11. Mol Biol Evol 38:3022–3027. https://doi.org/10.1093/molbev/msab120

Rozas J, Ferrer-Mata A, Sanchez-DelBarrio JC, Guirao-Rico S, Librado P, Ramos-Onsins SE, Sanchez-Gracia A (2017) DnaSP 6: DNA sequence polymorphism analysis of large data sets. Mol Biol Evol 34:3299–3302. https://doi.org/10.1093/molbev/msx248

Xia X (2013) DAMBE5: a comprehensive software package for data analysis in molecular biology and evolution. Mol Biol Evol 30:1720–1728. https://doi.org/10.1093/molbev/mst064

Bouckaert R, Vaughan TG, Barido-Sottani J et al (2019) BEAST 2.5: an advanced software platform for Bayesian evolutionary analysis. PLoS Comput Biol. https://doi.org/10.1371/journal.pcbi.1006650

Tamura K, Nei M (1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol 10:512–526

CAS   PubMed   Google Scholar  

Wang C, Song Y, Zi F, Ge J, Chen S (2022) The mitochondrial genome of Schizothorax argentatus from Northern Xinjiang and its phylogenetic analysis. Mitochondrial DNA B Resour 7:1834–1836. https://doi.org/10.1080/23802359.2022.2133555

Wang X, Wang J, He S, Mayden RL (2007) The complete mitochondrial genome of the Chinese hook snout carp Opsariichthys bidens (Actinopterygii: Cypriniformes) and an alternative pattern of mitogenomic evolution in vertebrate. Gene 399:11–19. https://doi.org/10.1016/j.gene.2007.04.019

Takashima Y, Morita T, Yamashita M (2006) Complete mitochondrial DNA sequence of Atlantic horse mackerel Trachurus trachurus and molecular identification of two commercially important species T. trachurus and T. japonicus using PCR-RFLP. Fish Sci 72:1054–1065. https://doi.org/10.1111/j.1444-2906.2006.01256.x

Waldbieser GC, Bilodeau AL, Nonneman DJ (2003) Complete sequence and characterization of the channel catfish mitochondrial genome. DNA Seq 14:265–277. https://doi.org/10.1080/1042517031000149057

Liu ZZ, Wang CT, Ma LB, He AY, Yang JQ, Tang WQ (2012) Complete mitochondrial genome of the mudskipper Boleophthalmus pectinirostris (Perciformes, Gobiidae): Repetitive sequences in the control region. Mitochondrial DNA 23:31–33. https://doi.org/10.3109/19401736.2011.643879

Khan MF, Khattak MNK, He D, Liang Y, Li C, Dawar FU, Chen Y (2016) The mitochondrial genome of Schizothorax esocinus (Cypriniformes: Cyprinidae) from Northern Pakistan. Mitochondrial DNA 27:3772–3773. https://doi.org/10.3109/19401736.2015.1079899

Khan MF, Khattak MNK, He D, Liang Y, Li C, Dawar FU, Chen Y (2016) The complete mitochondrial genome organization of Schizothorax plagiostomus (Teleostei: Cyprinidae) from Northern Pakistan. Mitochondrial DNA 27:3630–3632. https://doi.org/10.3109/19401736.2015.1079829

Barat A, Ali S, Sati J, Sivaraman GK (2012) Phylogenetic analysis of fishes of the subfamily Schizothoracinae (Teleostei: Cyprinidae) from Indian Himalayas using cytochrome b gene. Indian J Fish 59:43–47

Akhtar T, Ali G, Shafi N, Rauf A (2020) Molecular characterization of subfamily schizothoracinae (Teleostei: Cyprinidae) using complete sequence of mitochondrial 16S rRNA gene. Pak J Zool 52:273–282

Sharma P, Purohit S, Kothiyal S, Bhattacharya I (2023) Molecular phylogeny of Schizothorax species based on concatenated CO-I and Cyt b sequences. J Mt Res. https://doi.org/10.51220/jmr.v18i1.14

Qiao H, Cheng Q, Chen Y, Chen W, Zhu Y (2013) The complete mitochondrial genome sequence of Coilia ectenes (Clupeiformes: Engraulidae). Mitochondrial DNA 24:123–125. https://doi.org/10.3109/19401736.2012.731405

Rehman A, Khan MF, Bibi S, Nouroz F (2020) Comparative phylogeny of (Schizothorax esocinus) with reference to 12s and 16 sribosomal RNA from River Swat, Pakistan. Mitochondrial DNA A DNA Mapp Seq Anal 31:81–85. https://doi.org/10.1080/24701394.2020.1741561

Tsigenopoulos CS, Berrebi P (2000) Molecular phylogeny of north Mediterranean freshwater barbs (genus Barbus: Cyprinidae) inferred from cytochrome b sequences: biogeographic and systematic implications. Mol Phylogenet Evol 14:165–179. https://doi.org/10.1006/mpev.1999.0702

Liu S, Liu Y, Zhou G, Zhang X, Luo C, Feng H, He X, Zhu G, Yang H (2001) The formation of tetraploid stocks of red crucian carp × common carp hybrids as an effect of interspecific hybridization. Aquaculture 192:171–186. https://doi.org/10.1016/S0044-8486(00)00451-8

Yang Z, Bielawski JR (2000) Statistical methods for detecting molecular adaptation. Trends Ecol Evol 15:496–503. https://doi.org/10.1016/S0169-5347(00)01994-7

Schaack S, Ho EKH, MacRae F (2020) Disentangling the intertwined roles of mutation, selection and drift in the mitochondrial genome. Philos Trans R Soc B Biol Sci. https://doi.org/10.1098/rstb.2019.0173

Nielsen R (2005) Molecular signatures of natural selection. Annu Rev Genet 39:197–218. https://doi.org/10.1146/annurev.genet.39.073003.112420

Barrientos A, Barros MH, Valnot I, Rötig A, Rustin P, Tzagoloff A (2002) Cytochrome oxidase in health and disease. Gene. https://doi.org/10.1016/S0378-1119(01)00803-4

Article   PubMed   Google Scholar  

Ruan H, Li M, Li Z, Huang J, Chen W, Sun J, Liu L, Zou K (2020) Comparative analysis of complete mitochondrial genomes of three gerres fishes (Perciformes: Gerreidae) and primary exploration of their evolution history. Int J Mol Sci. https://doi.org/10.3390/ijms21051874

Dukler N, Mughal MR, Ramani R, Huang YF, Siepel A (2022) Extreme purifying selection against point mutations in the human genome. Nat Commun. https://doi.org/10.1038/s41467-022-31872-6

Cvijović I, Good BH, Desai MM (2018) The effect of strong purifying selection on genetic diversity. Genetics 209:1235–1278. https://doi.org/10.1534/genetics.118.301058

Bibi S, Fiaz khan M (2019) Phylogenetic association of Schizothorax esocinus with other Schizothoracinae fishes based on protein coding genes. Mitochondrial DNA B Resour 4:352–355. https://doi.org/10.1080/23802359.2018.1536445

Chen W, Yue X, He S (2017) Genetic differentiation of the Schizothorax species complex (Cyprinidae) in the Nujiang River (upper Salween). Sci Rep. https://doi.org/10.1038/s41598-017-06172-5

Download references

Acknowledgements

The authors extend gratitude to the Department of Zoology, University of Kashmir and the Division of Animal Biotechnology, SKUAST-K for laboratory facilities. Appreciation is also expressed to the Council of Scientific & Industrial Research (CSIR), Government of India, for providing financial support through a CSIR fellowship to author Ms. Gulshan Akhter.

This research received no specific grant from any funding agency.

Author information

Authors and affiliations.

Fish Nutrition Research Laboratory, Department of Zoology, University of Kashmir, Hazratbal, Srinagar, Jammu and Kashmir, 190 006, India

G. Akhter & I. Ahmed

Division of Biotechnology, Faculty of Veterinary Sciences & Animal Husbandry, Sher-E-Kashmir University of Agricultural Sciences and Technology, Srinagar, India

S. M. Ahmad

You can also search for this author in PubMed   Google Scholar

Contributions

Gulshan Akhter: Conceptualization; formal analysis; validation; Data curation; visualization; writing—original draft, writing- review and editing. Imtiaz Ahmed: Conceptualization; validation; visualization; supervision; writing- review and editing. S.M. Ahmad: Conceptualization; formal analysis; writing- review and editing.

Corresponding authors

Correspondence to I. Ahmed or S. M. Ahmad .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Ethical approval

Sampling of the target organism aligns with ethical standards and approved protocols established by an Animal Ethical Committee known as, Committee for the Purpose of Control and Supervision on Experiments on Animals (Reference Number 801/Go/RE/S/2003/CPCSEA).

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 46 KB)

Supplementary file2 (docx 448 kb), rights and permissions.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Akhter, G., Ahmed, I. & Ahmad, S.M. Genomic analysis and phylogenetic characterization of Himalayan snow trout, Schizothorax esocinus based on mitochondrial protein-coding genes. Mol Biol Rep 51 , 659 (2024). https://doi.org/10.1007/s11033-024-09622-2

Download citation

Received : 17 February 2024

Accepted : 07 May 2024

Published : 15 May 2024

DOI : https://doi.org/10.1007/s11033-024-09622-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Codon usage
  • Genetic diversity
  • Protein coding genes
  • Schizothoracinae
  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. Using the tree for classification

    case study of phylogenetic analysis

  2. Phylogenetic Tree- Definition, Analysis, Elements, Methods, Uses

    case study of phylogenetic analysis

  3. how to draw a phylogenetic tree from dna sequences

    case study of phylogenetic analysis

  4. Phylogenetic tree analysis. a Phylogenetic relationship of PRR proteins

    case study of phylogenetic analysis

  5. Phylogenetic Tree- Definition, Types, Steps, Methods, Uses

    case study of phylogenetic analysis

  6. Phylogenetic analysis of the full genome. The phylogenetic tree was

    case study of phylogenetic analysis

VIDEO

  1. Phylogenetics

  2. Phylogenetic and Phylogenetic software || phylip @bscbiowallah @PhysicsWallah #semester_class

  3. Phylogenetic Tree Service

  4. Statistical model choice in phylogenetic biogeography

  5. Phylogenomics in KBase Webinar

  6. knowledge tv hindiFactviral shortPhysiology fact

COMMENTS

  1. Molecular phylogenetics: principles and practice

    Phylogenetic analysis is pervading every field of biological study. The authors review and assess the main methods of phylogenetic analysis — including parsimony, distance, likelihood and ...

  2. Phylogenomics and the reconstruction of the tree of life

    The three main types of standard phylogenetic reconstruction method (distance, parsimony and likelihood methods 10; Box 1) have been adapted for use in phylogenomics. Phylogenomic reconstruction ...

  3. Phylogenetic Analysis Guides Transporter Protein Deorphanization: A

    Phylogenetic Analysis Guides Transporter Protein Deorphanization: A Case Study of the SLC25 Family of Mitochondrial Metabolite Transporters Biomolecules. 2023 Aug 28 ... Here, we combine phylogenetic analysis to profile SLC25 transporters across common eukaryotic model organisms, from Saccharomyces cerevisiae, Caenorhabditis elegans, ...

  4. Phylogenetic and phylodynamic approaches to understanding and ...

    In a second UK study, phylogenetic analysis of infections from 31 care home staff and 61 residents indicated transmission within, ... a case study. Lancet Infect. Dis. 21, 52-58 (2021).

  5. Biomolecules

    The SLC25 family is a prime candidate for our investigation into phylogenetic-analysis-guided studies for many reasons. First, the family is extraordinarily functionally diverse, being the largest protein family (53 in humans) that translocate a variety of chemically distinct metabolite ligands across the inner mitochondrial membrane, ranging ...

  6. A workflow with R: Phylogenetic analyses and visualizations ...

    Phylogenetic analyses can provide a wealth of information about the past demography of a population and the level of genetic diversity within and between species. By using special computer programs developed in recent years, large amounts of data have been produced in the molecular genetics area. To analyze these data, powerful new methods based on large computations have been applied in ...

  7. Roadmap to the study of gene and protein phylogeny and evolution ...

    Packages for phylogenetic analysis can facilitate phylogenetic inference, analysis and other evolutionary studies . A complete series of libraries for bioinformatics including tools for sequence alignment, phylogenetic analysis, study of molecular evolution and population genetics is available through the Bio++ suite [ 154 ] and HyPhy [ 128 ].

  8. Phylogenetic analysis based on single-copy orthologous ...

    This study provides insights into the evolution of cp-genomes throughout the genus Corydalis and also provides a reference for further studies on the taxonomy, identification, phylogeny, and ...

  9. Phylogenetic Analysis Guides Transporter Protein Deorphanization: A

    The SLC25 family is a prime candidate for our investigation into phylogenetic-analysis-guided studies for many reasons. First, the family is extraordinarily functionally diverse, being the largest protein family (53 in humans) that translocate a variety of chemically distinct metabolite ligands across the inner mitochondrial membrane, ranging ...

  10. Phylogenetic Analysis

    In this chapter, the authors attempt to understand the underlying phylogeny principle and how researchers implement diverse methods to discover the appropriate phylogeny. Results obtained revealed that phylogenetic trees reflect evolutionary past as a canonical framework. Phylogenetic tree building step essentially comprises of five steps: (a ...

  11. A practical approach to phylogenomics: the phylogeny of ray-finned fish

    The phylogenetic informativeness of characters has been extensively debated on theoretical grounds [26, 27], as well as in empirical cases [28-30]. Our study does not intend to contribute to this debate, but rather to focus on the practical issues involved in obtaining the raw data for analysis.

  12. How Much Data are Needed to Resolve a Difficult Phylogeny? Case Study

    It is unreasonable to expect any phylogenetic analysis to provide 100% accurate, supported resolution (but c.f. Rokas et al., 2003), but good resolution (CFI > 0.9) can be achieved with relatively small amounts of sequence data ... Our case study of Lamiales suggests that large-scale, focused molecular sequencing projects and subsequent ...

  13. Molecular phylogenetic study of flavonoids in medicinal plants: a case

    The current study examined the phylogenetic pattern of medicinal species of the family Apiaceae based on flavonoid groups production, as well as the overall mechanism of the key genes involved in flavonol and flavone production. Thirteen species of the family Apiaceae were used, including Eryngium campestre from the subfamily Saniculoideae, as well as Cuminum cyminum, Carum carvi, Coriandrum ...

  14. Utilizing RADseq data for phylogenetic analysis of challenging

    Utilizing RADseq data for phylogenetic analysis of challenging taxonomic groups: A case study in Carex sect. Racemosae 1 Rob Massatti 2 Ant , on A. Reznicekand , L. Lacey Knowles PREMISE OF THE STUDY: Relationships among closely related and recently diverged taxa can be especially diffi cult to resolve.

  15. Phylogenomic approaches untangle early divergences and complex

    Phylogenetic hypotheses of Oleaceae from previous studies. a-f The six alternate topologies of the five tribes.g-j The four alternate topologies of the four subtribes of Oleeae.a Dupin et al. [] using the 80 concatenated plastid coding genes based on the maximum likelihood (ML) method. b Dupin et al. [] using the 37 concatenated mitochondrial genes based on the ML method.

  16. Source identification in two criminal cases using phylogenetic analysis

    The case of a Florida dentist was a high-profile investigation inferring the phylogenetic relationships of HIV-1 in different individuals and establishing that viral sequences from the dentist and six of his patients were more closely related to each other than to unrelated controls (15, 16).Other phylogenetic studies have provided support for the transmission of HIV-1 from a French surgeon ...

  17. Implications of gene tree heterogeneity on downstream phylogenetic

    Additionally, phylogenetic conservation studies are ideally performed using large phylogenies with near-complete taxonomic groups, which our data sets unfortunately also lack. Our case study is thus meant as an illustration of the impacts of phylogenetic heterogeneity on downstream analyses in general and not as a real-world conservation study.

  18. Phylogenetic farming: Can evolutionary history predict crop rotation

    In combining these two approaches, one study found that increasing phylogenetic distance between neighbors improved focal plant growth in field-collected "live" soil, but after the soil was experimentally treated with fungicide the relationship dissipated (Liu et al., 2012). These data suggest that plant species- and/or genus-specific ...

  19. Reliable placement of beetle fossils via phylogenetic analyses

    Their phylogenetic assignment is problematic because of fragmentary preservation, yet crucial for the correct use of the information they provide. Here an analysis is presented of the phylogenetic position of Leehermania prorova , the Late Triassic compressed fossil which was described and hitherto widely used as the oldest representative of ...

  20. Can Pheromones Contribute to Phylogenetic Hypotheses? A Case Study of

    The results show that, although this is the first analysis in Chrysomelidae to use pheromones as a phylogenetic character, much can be observed in agreement with previous analyses, thus confirming that pheromones, when known in their entirety within lineages, can be used as characters in phylogenetic analyses, bringing elucidation to the ...

  21. Forensic application of phylogenetic analyses

    The results of phylogenetic analysis of the two analyzed genomic regions (pol and env) were consistent, by all the applied methods.The query sequences were shown to form a well-supported separate transmission cluster, according to the predefined sets of criteria (Fig. 1, Fig. 2).Download : Download high-res image (519KB) Download : Download full-size image

  22. Determinants of taxonomic, functional, and phylogenetic beta diversity

    There were significant positive correlations between the overall bird taxonomic, functional, and phylogenetic beta diversities and differences in patch area and PAR, and inter-patch distance (Figure 3a-c; Table S3).Concurrently, significant positive correlations were found between the nestedness-resultant components of the three facets and differences in patch area and PAR; However, no ...

  23. Identification and expression analysis of the Xyloglucan

    In this study, we undertook a comprehensive genome-wide identification of the BnXTH gene family members in rapeseed; additionally, we conducted an exhaustive analysis encompassing protein characteristics, gene structure, phylogenetic relationships, and collinearity the expression patterns of BnXTHs in two varieties of rapeseed were compared ...

  24. COVID-19 Genomic Epidemiology Toolkit

    This module describes how to interprete phylogenetic trees in the context of transmission ... Part 2: Case Studies. Modules 2.1 - 2.7. Module 2.1 - SARS-CoV-2 sequencing in Arizona. This case study from Arizona describes using sequencing for surveillance and investigation. Jan. 8, 2021 ... an interactive tool for transmission network analysis ...

  25. 'Out of Africa' origin of the pantropical staghorn fern genus

    Our study investigating the biogeographical history of Platycerium provides a case study not only for the formation of the pantropical intercontinental disjunction of this fern genus but also the 'out of Africa' origin of plant lineages. ... Phylogenetic analysis. Plastomes of all 18 Platycerium species were newly sequenced in the present ...

  26. Clinical Course, Antifungal Susceptibility, and Genomic Sequencing of

    A phylogenetic k-mer analysis revealed that the US isolates formed a distinct cluster from Indian isolates . When mapped to the most closely related Indian isolate (TIMM20114 22 ), SNV analysis of the US isolates ( Figure 3 A) revealed that 2 patients who lived in the same household and reported previous travel to Bangladesh had identical ...

  27. Plants

    In conclusion, the present study suggested diverse functions of 10 RsRboh genes in radishes. The gene structure analysis, physicochemical properties of proteins, and phylogenetic analysis revealed their conserved nature. The cis-acting elements and expression profiling suggested the involvement of RsRboh proteins in radish pithiness responses.

  28. Substantial variation of the two‐dimensional spectrum of senescent leaf

    A previous study found that the leaf quality (e.g., C:N ... In order to evaluate the phylogenetic influence on senescent leaf traits (i.e., C, N, P, K, Ca, Na, Mg, Al, Fe, Mn, C:N, C ... Variation partitioning analysis indicated that climates rather than soil properties or their interactions explained the variation in leaf functional traits ...

  29. Genomic analysis and phylogenetic characterization of ...

    Phylogenetic study revealed a significant genetic divergence between S. esocinus and other Schizothorax species with interspecific K2P distances ranging from 0.00 to 8.87%, with an average of 5.76%. ... (Ka/Ks) ratios, and phylogenetic analysis. Results. The study specifically examined the 13 protein-coding genes of Schizothorax species which ...

  30. Taxonomic identity of the Bacillus licheniformis strains used to

    Several studies indicate that taxonomic identification through phylogenetic analysis based solely on the 16S rRNA gene sequence may be insufficient to distinguish between different species within the Bacillus genus (Burgess et al., 2010; Branquinho et al., 2014; Olajide et al., 2021).