Bayesian species delimitation in Pleophylla chafers (Coleoptera) – the importance of prior choice and morphology

Background Defining species units can be challenging, especially during the earliest stages of speciation, when phylogenetic inference and delimitation methods may be compromised by incomplete lineage sorting (ILS) or secondary gene flow. Integrative approaches to taxonomy, which combine molecular and morphological evidence, have the potential to be valuable in such cases. In this study we investigated the South African scarab beetle genus Pleophylla using data collected from 110 individuals of eight putative morphospecies. The dataset included four molecular markers (cox1, 16S, rrnL, ITS1) and morphometric data based on male genital morphology. We applied a suite of molecular and morphological approaches to species delimitation, and implemented a novel Bayesian approach in the software iBPP, which enables continuous morphological trait and molecular data to be combined. Results Traditional morphology-based species assignments were supported quantitatively by morphometric analyses of the male genitalia (eigenshape analysis, CVA, LDA). While the ITS1-based delineation was also broadly congruent with the morphospecies, the cox1 data resulted in over-splitting (GMYC modelling, haplotype networks, PTP, ABGD). In the most extreme case morphospecies shared identical haplotypes, which may be attributable to ILS based on statistical tests performed using the software JML. We found the strongest support for putative morphospecies based on phylogenetic evidence using the combined approach implemented in iBPP. However, support for putative species was sensitive to the use of alternative guide trees and alternative combinations of priors on the population size (θ) and rootage (τ0) parameters, especially when the analysis was based on molecular or morphological data alone. Conclusions We demonstrate that continuous morphological trait data can be extremely valuable in assessing competing hypotheses to species delimitation. In particular, we show that the inclusion of morphological data in an integrative Bayesian framework can improve the resolution of inferred species units. However, we also demonstrate that this approach is extremely sensitive to guide tree and prior parameter choice. These parameters should be chosen with caution – if possible – based on independent empirical evidence, or careful sensitivity analyses should be performed to assess the robustness of results. Young species provide exemplars for investigating the mechanisms of speciation and for assessing the performance of tools used to delimit species on the basis of molecular and/or morphological evidence. Electronic supplementary material The online version of this article (doi:10.1186/s12862-016-0659-3) contains supplementary material, which is available to authorized users.

with shape and scale parameters equal to 1, G (1,1). The invariant sites parameter was specified using a uniform prior, U(0, 1). Rate heterogeneity across sites was modelled using the discrete gamma model, with four independent rate categories and the shape parameter was specified using an exponential prior, with mean equal E(0.5).
The uncorrelated lognormal relaxed clock model was implemented to estimate divergence times [30]. The mean substitution rate of cox1 was fixed, and a diffuse exponential prior (E(1/3)) was used to specify the parameter that describes variation in the substitution rate (ucld.stdev). Clock model parameters were unlinked across genes, and the rate of ITS1 was estimated relative to that of cox1. We applied a range of mean branch rates, in five independent sets of analyses (2, 2.5, 3, 3.5 or 4% My -1 ). The parameter used to describe rate variation across branches in the relaxed clock modelthe standard deviation of the lognormal distribution of rateswas specified using an exponential prior, E(0. 3). The mean of cox1 was fixed as described above, and the mean rate of ITS1 was specified using an exponential prior, E(1).
We assumed a Yule speciation process, with a constant speciation rate and population size through time. The population size and birth rate parameters were specified using the non-informative improper prior, 1/x. Two independent MCMC runs were performed for each analysis, each consisting of 500 million iterations, sampling every 5,000th generation and discarding the first 50 million steps, resulting in 90,000 samples post burn-in. As above, convergence was assessed using Tracer 1.5. Mean node ages and credibility intervals were calculated using TreeAnnotator 1.7.5. and ITS1). The substitution model and rate parameters were specified using the parameters estimated by *BEAST. The relative rates for each cox1 partition were calculated by multiplying the relative rate of each codon by the mean branching rate of cox1. Appropriate heredity scalars were selected for cox1 (= 0.5) and ITS1 (= 2).

Sequence data, alignment and model selection
We obtained 438 new sequences for 110 individuals (Supplementary Table S1). There is remarkably low molecular variation among the members of the genus. Excluding sequences for the outgroup species (Omalopia nigromarginata and O. ruricola) were included in our analysis [42]. We also obtained ITS1 sequences for O. nigromarginata and O. ruricola, however it was not possible to align unambiguously these sequences with the ingroup taxa, so these sequences were excluded from further analysis. The final concatenated alignment contained 2,795 characters, including 3.57% missing data and gaps. The partition schemes and optimal substitution models selected using PartitionFinder for each dataset and for use in different programs are presented in Supplementary Table S3.

Phylogenetic analysis and the monophyly of morphospecies
Morphospecies sp09 was the only species that was ubiquitously recovered as monophyletic in the independent analyses of all four markers. In the analysis of rrnL specimens Pleo834751 (sp10) and Pleo834776 (sp11) were recovered within the sp11 and sp06 clades respectively, probably due to misleading phylogenetic signal or laboratory errors. Otherwise there was consistently good support for the monophyly of sp11. Morphospecies sp10 and sp06 were not recovered as monophyletic by cox1, but these groups received strong support in both the ITS1 gene tree and the combined analyses. The clade that contains morphospecies spX2 + five female specimens was also strongly supported.
The interspecific relationships were also strongly supported in the combined analysis. Morphospecies sp09 was always recovered as the most ancestral branching member of the genus. The next most ancestral branching group was morphospecies sp12 + sp11, however the position of morphospecies sp12 fluctuates in the rrnL, cox1 and combined mitochondrial trees (Figs. S3-S5). The interspecific relationships among morphospecies sp09, sp11, sp10, sp06, spX2 and the morphospecies sp01/ sp02 group were fully resolved and strongly supported in the combined analysis using different tree inference approaches (Figs. S3-S5).
The Bayesian species tree estimated using *BEAST for the combined cox1 and ITS1 dataset resulted in strong support for the same interspecific relationships estimated using the cox1 data. Individual gene trees (for cox1 and ITS1) shared the same topology as the gene trees estimated using PhyML, RAxML and MrBayes, but in contrast, the species tree favoured a sister group relationship between sp06 and sp10. The estimated divergence times for each species pair and associated uncertainty obtained under variable substation rates are presented in Supplementary table 6.

The impact of the branch length prior in MrBayes
Changing the branch length prior implemented in MrBayes had no impact on the inferred topology but had a large impact on tree length (the sum of branch lengths) (Supplementary Table S4). Note that in some cases the Bayesian 95% posterior intervals do not contain the maximum likelihood estimate. A large difference was observed between the estimates of tree length obtained using the Bayesian analysis, under the default branch length prior in MrBayes, and the maximum likelihood estimates obtained using RAxML and PhyML (Supplementary Table S4). Changing the branch length prior to favour shorter branch lengths reduced this discrepancy, and for analyses of cox1 actually produced shorter lengths than the maximum likelihood analyses. Partitioned maximum likelihood analysis produced longer tree lengths. In particular, when cox1 is included in the analysis, partitioning by codon had a large impact on the branch lengths.

JMLresults obtained for independent partitions
The observed pairwise distances between individuals of all morphospecies were not lower than expected at the 5% level (P > 0.05), given the null model (the coalescent with no migration or hybridization). However, the results obtained using JML varied among partitions (Supplementary Table S7). The minimum distances and the probabilities of observing these distances were similar for the first and second (cox1) codon partitions (P > 0.1 for all pairwise comparisons). In contrast, the minimum distances were much greater for the third (cox1) codon partition, and the probabilities of obtaining the observed distances under the null model were smaller. The probability of obtaining the observed pairwise distances between morphospecies sp09 or sp11 and all other species was lower than expected at the 10% level (P < 0.1; P > 0.1 for all other pairwise comparisons). Note that the third codon position evolves at one and two orders of magnitude faster than the first and second positions respectively (Supplementary Table   S5). For the ITS1 data, the probabilities of obtaining the observed sequence distances under the null model were not lower than expected (P > 0.2 for all pairwise comparisons).   Table S1. Accession numbers for specimens included in the morphometric and phylogenetic analyses, along with voucher numbers and geographical origin. All specimens were included in the phylogenetic analyses.
Specimens that were included in the morphometric analysis are indicated by an asterisk (*).          Table S13. Posterior probabilities of speciation splits of all guide trees for prior only analyses and 3 data sets, 9 τ and θ prior combinations, and 10 repeats of each analysis.
è Excel file 'Table_S12.xls' Table S14. Lumping behaviour of BPP analyses with simultaneous species tree estimation. Posterior probabilities for species delimitation scenarios are shown for all initial guide trees and all θ and τ 0 prior combinations. Lumped species are indicated by parentheses. Scenarios with pp < 0.05 in all analyses are not shown.
è Excel file 'Table_S13.xls' Table S15. Posterior probabilities for a priori defined morphospecies from BPP analyses with simultaneous species tree estimation for all initial guide trees. Species that were never sampled are denoted with 'na'.     Figure S6. Pairwise plots of Eigenshape axes 1-3 from the Eigenshape analysis of partial paramere outlines.