Improved phylogenomic sampling of free-living nematodes enhances resolution of higher-level nematode phylogeny

Background Nematodes are among the most diverse and abundant metazoans on Earth, but research on them has been biased toward parasitic taxa and model organisms. Free-living nematodes, particularly from the clades Enoplia and Dorylaimia, have been underrepresented in genome-scale phylogenetic analyses to date, leading to poor resolution of deep relationships within the phylum. Results We supplemented publicly available data by sequencing transcriptomes of nine free-living nematodes and two important outgroups and conducted a phylum-wide phylogenomic analysis including a total of 108 nematodes. Analysis of a dataset generated using a conservative orthology inference strategy resulted in a matrix with a high proportion of missing data and moderate to weak support for branching within and placement of Enoplia. A less conservative orthology inference approach recovered more genes and resulted in higher support for the deepest splits within Nematoda, recovering Enoplia as the sister taxon to the rest of Nematoda. Relationships within major clades were similar to those found in previously published studies based on 18S rDNA. Conclusions Expanded transcriptome sequencing of free-living nematodes has contributed to better resolution among deep nematode lineages, though the dataset is still strongly biased toward parasites. Inclusion of more free-living nematodes in future phylogenomic analyses will allow a clearer understanding of many interesting aspects of nematode evolution, such as morphological and molecular adaptations to parasitism and whether nematodes originated in a marine or terrestrial environment. Electronic supplementary material The online version of this article (10.1186/s12862-019-1444-x) contains supplementary material, which is available to authorized users.


Background
Nematodes are ubiquitous and diverse metazoans that are found free-living in nearly every terrestrial and aquatic habitat and parasitizing most animals and plants. Fewer than 30,000 species have been described, but the actual diversity of the phylum may be closer to 1 million species [1]. Despite estimates that at least half of all nematodes are free-living [1,2], most research has focused on parasitic nematodes of medical and agricultural importance. Particularly neglected are the free-living marine nematodes, with only around 6900 species described [3] and no genomes published to date [4]. Of significance, free-living nematodes are generally the most abundant and diverse metazoans of marine sediments [5][6][7][8] where they are important as decomposers, predators, food for higher trophic levels [9], and as bioindicators for climate change and ecological disturbance [10][11][12].
Despite the importance of nematodes as free-living animals and as parasites of humans, livestock, and crops, and despite more than a century of intensive research, certain aspects of their origin and early evolution, such as the branching order near the root of Nematoda, are not yet fully understood [13,14]. Nematode evolutionary history is particularly interesting because of the diversity of niches they occupyranging from the blood and tissues of vertebrate and invertebrate animals, unicellular eukaryotes, all parts of plants, virtually every terrestrial habitat, and all aquatic environments including deep-sea hydrothermal vent communitiesis unrivaled in Metazoa [15][16][17][18]. Thus, resolving nematode phylogeny, especially the branching order close to the root of the nematode tree, will not only improve our understanding of the origin of economically important groups, but will provide a phylogenetic framework for understanding the underlying key characters (e.g., genomic modifications) corresponding to different nematode lifestyles, advancing all aspects of nematology, from basic evolutionary biology to pathogen control and drug development [19].
Morphology-based hypotheses of higher-level nematode relationships (reviewed by [17,[20][21][22]) placed emphasis on the presence or absence of a lateral canal excretory system and a number of esophageal features [23,24]. These characters were interpreted as evidence of two major lineages: a primarily terrestrial (but also including many plant and animal parasites) grouping called Secernentea, and a primarily aquatic grouping called Adenophorea. Subsequent morphological investigations by Andrássy [25] and Malakhov [26] distinguished three main lineages, elevating the primarily aquatic Enoplia and Chromadoria out of Adenophorea and re-classifying most Secernentea as Rhabditia.
Enoplia has generally been thought to represent the sister group to all remaining nematodes [7,30] because of the presence of presumably ancestral developmental features, which are common in other animal phyla but not seen in other lineages of nematodes thus far investigated. These include indeterminate development [31][32][33] and retention of the nuclear envelope in mature spermatozoa (other nematodes investigated to date have determinate development and spermatozoa that lose the nuclear envelope upon maturation [34]). As other metazoan lineages are thought to have marine origins, nematodes have traditionally been assumed to have evolved in the marine environment [24,26,35]. Thus the primarily marine habits of Enoplia, combined with their presumed ancestral developmental features, have led to them being viewed as the earliest-branching nematode lineage [7,30]. On the other hand, De Ley and Blaxter [22] suggested the possibility of a terrestrial origin of nematodes with the sister group to all other nematodes being the taxon least represented in the marine environment, Dorylaimia. Ribosomal DNA-based studies have been unable to resolve the branching order among these deepest branches within Nematoda. Even studies focused on improved representation of diverse marine free-living nematodes [7,36] failed to find resolution at the base of the nematode tree, suggesting that additional molecular markers are needed to resolve deep nematode phylogeny.
Recently, phylogenomic studies employing dozens to hundreds of nuclear protein-coding genes have addressed questions of nematode evolution, but taxon sampling in these studies has largely built on publicly available genome and transcriptome datasets [4,[37][38][39][40][41]. Until now, phylogenomic analyses of Nematoda have focused on parasitic taxa and model Caenorhabditis spp., with little or no representation of other freeliving nematodes. For example, Blaxter and Koutsovoulos [40] and Koutsovoulos [41] curated the largest phylogenomic datasets for Nematoda to date but included only a single member of Enoplia in their studies. The latest comparative phylogenomic study focusing on parasitic worms included a handful of free-living nematodes (mostly model organisms), but no representatives of Enoplia or early branching Chromadoria [42]. Likewise, phylogenetic analyses based on mitochondrial genomes have never included representatives of Enoplia [43][44][45][46] because the mitochondrial genome has not yet been sequenced for any member of this clade.
Here, we have assembled the largest and most diverse phylogenomic dataset for Nematoda to date with expanded transcriptome representation for previously undersampled free-living nematode taxa. Leveraging this dataset, we reexamine relationships among early-branching clades and provide a robustly resolved and expanded phylogenetic framework for Nematoda.

Results
Publicly available nematode and outgroup genomes and transcriptomes were supplemented with new transcriptomes from nine free-living nematodes, one nematomorph, and one kinorhynch for a total of 131 taxa sampled (Table 1, Additional file 2: Tables S1-S2). Building on an established phylogenomic data processing pipeline [47], we assembled two datasets using two different sequence selection strategies (see Methods). The first strategy used a strict orthology inference approach that refines initial orthology inference made by HaMStR [48] with PhyloTreePruner [49]. This strategy resulted in a dataset with 931 genes totalling 298, 009 amino acids in length with 84.67% missing data. The second strategy employed SCaFoS [50] to select the best sequence for each taxon in the HaMStR output. The SCaFoS strategy resulted in a dataset with 1025 genes totalling 321,951 amino acids in length with 35.01% missing data.
Our results based on the matrix assembled with the more conservative PhyloTreePruner orthology inference strategy but with a higher proportion of missing data ( Fig. 1) strongly support nematode monophyly (IQ-TREE / RAxML bootstrap support, bs = 100%/100%), and subsequent branching, with Enoplia being monophyletic (as previously recovered [7,51]), and the sister clade to Dorylaimia and Chromadoria. Enoplida+Triplonchida was moderately supported (bs = 88%/74%). However, Enoplida was paraphyletic with respect to Triplonchida, a single representative of which, Tobrilus sp., was included as an ingroup. Dorylaimia and Chromadoria were recovered as sister taxa with strong support in the IQ-TREE analysis (bs = 99%) and moderate support in the RAxML analysis (bs = 80%).
Dorylaimia was strongly supported (bs = 100%/96%). This clade was primarily represented by members of the animal parasitic Trichinellida (Trichinella and Trichuris), which was also strongly supported as monophyletic (bs =100%/100%). Dorylaimida, which was represented by the virus-transmitting plant pests Longidorus elongatus and Xiphinema index, was also strongly supported as monophyletic (bs =100%/100%). Monophyly could not be tested for the remaining three orders represented by just one taxon each: Mononchida (represented by Prionchulus punctatus), Mermithida (represented by Romanomermis culicivorax), and Dioctophymatida (represented by Soboliphyme baturini). Mermithida was recovered as the sister to Mononchida with maximal support.
Chromadoria was strongly supported (bs =100%/100%) with the sole representative of Chromadorida (Euchromadora sp.) sister to a well-supported (bs = 99%/82%) clade of all other Chromadoria with Odontophora sp., the single representative of Araeolaimida, at the base. The two sampled representatives of Plectida (Plectus sambesii and Anaplectus granulosus) were recovered in a clade (bs =100%/ 100%) sister to Rhabditida. Rhabditida includes most described species of Chromadoria, and also most of the currently available transcriptomes and genomes. It received maximal support as did its three subclades: Spirurina, Tylenchina, and Rhabditina. Relationships within the major clades of Rhabditida were also consistently strongly supported. All genera for which monophyly was testable (i.e., those with more than one representative available for study), were recovered monophyletic, with the exception of Heterorhabditis. Heterorhabditis bacteriophora was strongly supported as sister to a clade composed of Trichostrongylidae and Ancylostomatidae within Rhabditina (as expected), while a species identified as H. indica was strongly supported as the sister taxon of Globodera spp. in Tylenchina.
Examination of the Heterorhabditis indica dataset [52] revealed that this organism was incorrectly identified or mislabelledpartial sequences of the nuclear ribosomal operon mined from the H. indica transcriptome assembly show high similarity to reference sequences from various species of the genera Heterodera and Globodera (Hoplolaimidae, Tylenchina), and not Heterorhabditis (Heterorhabditidae, Rhabditina). This is further confirmed by the results of our tree-based taxonomy assignment using the 18S rDNA gene fragment (Additional file 1: Figure S1). Unfortunately, these partial sequences mined from the transcriptome of H. indica are relatively short, one with only 588 bases of the 5′ end of 18S rDNA and the other with just 863 bases of the 5′ end of 28S rDNA. They do not contain enough phylogenetically informative sites to ensure species-level identification.
Because of the high amount of missing data (84.67%) in the dataset assembled using PhyloTreePruner, we also used a less conservative orthology inference approach that did not employ an additional tree-based orthology confirmation after initial HaMStR orthology inference. This resulted in a larger and much more complete dataset with 1026 genes totalling 321,951 amino acids in length with 64.99% matrix completeness. Analysis of this SCaFoSbased dataset resulted in a nearly identical branching order as that of the PhyloTreePruner-based dataset (Fig. 2) . Whereas support for Enoplia was weak in the analysis of the PhyloTreePruner-based dataset, analysis of this dataset recovered Enoplia monophyletic and sister to the rest of Nematoda with maximal support. Tobrilus sp. (Triplonchida) was recovered sister to Enoplida with maximal support and Bathylaimus sp. was recovered sister to all other Enoplida with maximal support, which is in agreement with 18S rDNA-based analyses by van Megen et al. [53], Bik et al. [7], and Smythe [51]. Relationships within Dorylaimia were strongly supported and identical to the results based on the PhyloTreePruner dataset with the exception of relationships among Trichinella nativa, T. britovi, and T. murrelli. Likewise, relationships within Chromadoria were nearly identical; the one difference was placement of Oscheius tipulae, which was recovered sister to Rhabditomorpha sensu De Ley & Blaxter, 2004 [22] in the analysis of the PhyloTreePruner dataset and sister to Strongyloidea sensu De Ley & Blaxter, 2004 [22] in the analysis of the SCaFoS dataset.
With respect to higher-level ecdysozoan ( Fig. 3) relationships, both analyses recovered Scalidophora (represented by Priapulida + Kinorhyncha) monophyletic and sister to the rest of Ecdysozoa with strong support. IQ-TREE analysis of the PhyloTreePruner dataset recovered Onychophora sister to Arthropoda with strong support (bs = 98%) while the RAxML analysis had only moderate support for this placement (bs = 78%). However, analyses of the SCaFoS dataset recovered Onychophora sister to all nonscalidophoran ecdysozoans with similar levels of support (bs = 100%/82%). IQ-TREE analyses recovered Tardigrada Fig. 1 Phylogeny of Nematoda based on the IQ-TREE maximum likelihood analysis of the PhyloTreePruner dataset. "Classification" bar on the left side serves as a scale and represents the relative known taxonomic diversity of different taxa within Nematoda: the height of each colored bar is proportional to a number of known species (also given in the brackets after each taxon name), with the height of the entire multicolored background rectangle equal to 100% of known nematode diversity. IQ-TREE / RAxML bootstrap support values < 100% are shown. "Habitat" describes the lifestyle for each analysed species, such as animal parasitic (animal par.), plant parasitic (plant par.), entomopathogenic or entomoparasitic (entomop.), free-living freshwater (freshwater), terrestrial (terrestrial) and marine (marine). Newly generated transcriptomes are marked with an asterisk sister to Nematoda with moderate to strong support (bs = 84-97%) whereas RAxML analyses recovered Tardigrada + Nematomorpha sister to Nematoda. This was strongly supported in the analysis of the SCaFoS dataset (bs = 100%) but weakly supported in the analysis of the PhyloTreePruner dataset (bs = 66%).

Deep nematode phylogeny
Early evolution and diversification of nematodes has been a matter of much controversy (reviewed by [4,15,21,22,54]). Molecular phylogenetic studies have generally supported the existence of three major lineages and the monophyly of Chromadoria, but resolution of the deepest splits within Nematoda -relationships among Enoplia, Dorylaimia, and Chromadoria -has been recalcitrant. As in prior analyses based on 18S rDNA [7,36,53], analysis of our PhyloTreePruner-based dataset lacked support for relationships among these deepest branches in Nematoda. Enoplia received moderate support (bs = 88), while monophyly of Enoplida could not be established. Insufficient taxon sampling and limited matrix occupancy for Enoplia is, in our opinion, the prime issue to be considered and addressed in efforts to resolve relationships among these deep branches.
Our initial dataset assembly strategy employed Phylo-TreePruner [49], which helps exclude paralogous sequences and contamination missed by the initial orthology inference approach. PhyloTreePruner examines single-gene trees and, if there are two or more sequences from a taxon that do not form a clade, the tree is pruned to the largest subclade in which all taxa are represented by just one sequence. Only the subset of sequences corresponding to that subtree is retained for concatenation and species tree reconstruction. Unfortunately, the PhyloTree-Pruner algorithm can result in the unnecessary exclusion of large numbers of sequences when even a single taxon has two or more sequences that do not form a clade in single-gene trees (Thálen and Kocot, unpublished data). Aside from paralogy, putative single-gene trees with two or more sequences from the same taxon that do not form a clade may also be caused by the presence of very short and/or mis-aligned contigs, low-quality contigs, or incorrect single gene trees. This problem is exacerbated as the number of sampled taxa increases (Thálen and Kocot, unpublished data).
Use of PhyloTreePruner with its strict orthology inference approach on this rather species-rich dataset resulted in exclusion of large subtrees worth of sequences for many of the orthogroups identified by HaMStR and a final concatenated dataset with just 15.33% matrix A B completeness. Because the HaMStR "model organisms" core ortholog set used in this study is known to consist of genes that are single copy across diverse metazoan phyla [48], paralogy is unlikely to be problematic with this dataset (although taxon-specific gene duplications are possible). Thus, we re-ran our pipeline using SCaFoS [50] to select sequences for concatenation. SCaFoS excludes highly divergent sequences (i.e., it is still able to exclude, non-nematode contamination) and selects the best sequence for each taxon based on average pdistance. As noted above, this resulted in a larger and much more complete dataset (64.99% matrix occupancy) . Despite substantial differences in matrix completeness, analysis of the SCaFoS-based matrix resulted in a very similar topology to that of the PhyloTreePruner-based matrix. Of significance, analysis of this more complete data matrix resulted in strong support for relationships among the major lineages of Nematoda, placing a monophyletic Enoplia sister to all other nematodes with maximal support, and supporting the monophyly of Enoplida. Our SCaFoS-based phylogeny supports the "traditional" view of early nematode evolution with Enoplia sister to the rest of Nematoda, a topology used as a basis for the long-standing yet poorly explored hypothesis that the phylum arose in the marine environment [22,24,26,35]. The alternative hypothesis of the primarily terrestrial Dorylaimia as the sister to the rest of Nematoda [22], receives no support from either of our analyses.
Placement of Enoplia as sister to the rest of Nematoda, however, does not deny the possibility of a terrestrial origin of Nematoda [22] as early-branching clades are equally represented by marine, freshwater and terrestrial taxa (Fig. 4). Enoplia splits into predominantly marine Enoplida and predominantly freshwater/terrestrial Triplonchida, while its sister clade (unnamed, containing the rest of Nematoda) consists of primarily marine Chromadoria and primarily freshwater/terrestrial Dorylaimia. A comprehensive hypothesis of nematode origin and early evolution must build on a greatly expanded phylogenomic dataset with better sampling of Enoplia and Dorylaimia and closely related phyla (Nematomorpha, Tardigrada, Priapulida, Kinorhyncha and Loricifera, Onychophora). This would better enable ancestral character state reconstruction analysis for Nematoda and Ecdysozoa as a whole.

Relationships within major nematode clades
In terms of relationships within major nematode clades, our results are largely consistent with earlier studies based on the 18S rDNA gene [27,30,36,53] and previous phylogenomic studies [40,42]. One exception is the topology within Dorylaimia, which is somewhat different: 18S rDNA-based trees place Dorylaimida as the earliest branching clade [36,53], although relationships among Mononchida, Mermithida, Trichinellida and Dioctophymatida vary. Our results place a clade containing Dorylaimida, Mermithida and Mononchida sister to a clade with Dioctophymatida and Trichinellida. Our recovery of Mermithida as the sister taxon of Mononchida is in agreement with 18S rDNA based phylogenetic studies (e.g. [27], but in discordance with morphology-based theories, which suggest closer affinities between Mermithida and Dorylaimida [55,56] or Mermithida and Trichinellida (=Trichocephalida) [57]. Another exception is in the branching pattern of Rhabditida: our analysis places Spirurina as a sister to Tylenchina + Rhabditina (in full agreement with all 18S rDNA-based and most phylogenomic studies), while [42] recovered Tylenchina as a sister to Rhabditina + Spirurina, albeit with relatively low bootstrap support. Fig. 4 Simplified nematode phylogeny based on Fig. 2 indicating marine versus freshwater/terrestrial distribution for each order, considering the distribution of the majority of species. Notes: * includes equal number of marine, freshwater and terrestrial taxa, with molecular phylogenies suggesting terrestrial clades to be earlier (deeper); ** based on distribution of hosts, marine taxa may be of secondary origin; *** based on distribution of hosts; **** based on distribution of hosts and free-living stages "Minor" problems in nematode phylogeny Early radiation within the phylum Nematoda is the most challenging problem but not the only one in the systematics of this group of animals. There are a number of "orphaned" nematode taxa for which phylogenetic affinities and thus placement in the classification remain unclear. Such are the phylogenetic relationships of nematode families Teratocephalidae [22], Chambersiellidae [58], Brevibuccidae [22], Myolaimidae [59], Aegialoalaimidae [60], Cyartonematidae [61], Aulolaimidae [62,63], Paramicrolaimidae [60,64], Haliplectidae [60], Richtersiidae [65], Rhabdodemaniidae [51,66], Thalassogeneridae [67], suborder Ceramonematina [60] and orders Benthimermithida [68, 69], Marimermithida [70] and Rhaptothyreida [71]. They often possess unusual morphologies [59,63,64] or are highly specialized parasites [69,70], and have no clear place in morphologybased classifications.
Acquisition of transcriptome or genome data from the understudied taxa is needed in order to resolve these "minor" phylogenetic issues that could not be clarified in phylogenetic studies based on rDNA loci or morphology, which have provided contradictory results depending on the data or methodology used. Besides finally achieving stable classification, many of these taxa are important for understanding of morphological character evolution, transitions between marine and terrestrial lifestyles, and evolution of symbiosis in the marine environment.

Phylogeny of Ecdysozoa
Although taxon sampling of the present study focused on Nematoda, we aimed to broadly sample relevant outgroups using only high-quality, publicly available data plus new transcriptomes from a nematomorph and a kinorhynch. Relationships among ecdysozoan phyla have varied somewhat dramatically among studies (reviewed by [72]), prompting numerous conflicting phylogenetic hypotheses. Our results find no support for some traditionally hypothesized groups including Nematoida (Nematoda + Nematomorpha), Panarthropoda (Arthropoda, Onychophora, and Tardigrada), or Cycloneuralia (Scalidophora + Nematoida). Interestingly, we recover Tardigrada as the sister taxon of Nematoda. A close relationship of Tardigrada to Nematoda has been recovered in other recent phylogenomic studies [73][74][75][76][77], but data from representatives of Nematomorpha have been limited. Interestingly, the PhyloTreePruner-based analysis recovers the traditionally hypothesized placement of Onychophora as the sister taxon of Arthropoda with strong support (bs = 98) but in the SCaFoS-based analysis, it is recovered as the sister taxon of a clade of all other nonscalidophoran ecdysozoans with maximal support. The limited taxon sampling for key ecdysozoan clades (e.g., just one onychohoran, one nematomorph, no heterotardigrades, no loriciferans, etc.) further demonstrates the need for high-quality genomic and transcriptomic resources from this part of the animal tree.
Expand sampling of free-living nematodes to learn more about parasites The origin and evolution of animal parasitic nematodes from their free-living ancestors has been an active area of research for 80 years [78][79][80][81][82][83]. Two simplified scenarios describe evolutionary pre-adaptations and morphophysiological changes leading towards parasitism via commensalism in aquatic environments [84,85] and via a saprobiontic lifestyle in terrestrial environments [80,86,87]. We are just beginning to understand the genomic changes involved in these processes [42,88]. Furthermore, many other important questions about parasite biology remain unanswered, such as how parasites locate and invade hosts, suppress host immune response, acquire nutrients, etc. [40].
Comprehensive understanding of morphological, ecological, behavioral and genomic adaptations involved in the evolution of a parasitic lifestyle can not be achieved without thorough comparison between parasites and their close, free-living relatives [19,40,89]. One of the complications, however, is that animal parasitic nematodes evolved independently at least 18 times [90], if not more [40,86], and one cannot expect the same underlying mechanism to be behind these numerous independent events. Moreover, the majority of animal parasitic clades have no identified, closely related freeliving taxon suitable for comparative analysis [19]. These include all parasites from the subclass Dorylaimia and the most diverse and economically important Spirurina. Even the closest relative of such a well-researched taxon as the entomopathogenic genus Steinernema remains unclear [58,91]. Thus, further expanding sampling of free-living nematodes in phylogenomic studies will be an integral part of any future research aiming to understand the evolution of parasitismit will help elucidate sistergroup relationships of those parasitic taxa for which the closest free-living relatives are yet unidentified and provide much needed comparative data for identification of parasitism-related genetic modifications.
With 97 published and nine new nematode genomes and transcriptomes, our phylogenetic analyses, which are by far the most comprehensive to date, cover less than 0.5% of the approximately 23,000 valid nematode species [92]. For comparison, the latest phylum-wide 18S rDNA-based phylogeny [53] included 1215 sequences or just about 5% of the known diversity. Of the 108 nematode species included in our analyses, 80 belong to Rhabditidaa clade with over 13,400 known species including most economically and medically important parasites as well as the model species Caenorhabditis elegans and satellite model Pristionchus pacificus. Of the Rhabditida species included in our analyses, 50 are parasites of animals, 12 are plant parasitic, and the remaining 18 are thought to be free-living inhabitants of soil or saprophytic communities (although some are phoretically associated with invertebrates). The next largest set of species, 11 in number, represent exclusively the parasitic order Trichinellida (with about 400 known species). The remainder of the phylum, consisting of 17 orders and including free-living (in particular almost all known marine species), plant-and animal parasitic nematodes (with about 9200 species in total), is unevenly represented in our analysis: nine orders are represented by 17 species, while eight orders are not included at all. Out of 108 species included in this phylogenetic analysis, 63 are animal parasitic and 14 are associated with plants, while only 31 are free-living, of which 22 are fresh water and soil inhabitants and only nine are marine. Thus, vast habitat diversity, and the morphological and molecular adaptations that allow nematodes to live in those environments, remains unrepresented in transcriptome-based phylogenies.

Recommended sampling strategies
Three possible sampling strategies to increase and diversify nematode genomic and transcriptomic datasets can be suggested, depending on the research goals. Those researchers who are interested solely in the origin and early evolution of animal parasitism can find interesting models among free-living Enoplida [93], Chromadorida [94], Monhysterida [95,96] and Plectida [97][98][99] species with parasitic lifestyles but with morphology retaining many features of their close, free-living relatives. Phylogenetic analysis and subsequent ancestral character state reconstruction would elucidate features of freeliving ancestors of parasites and generate new hypotheses regarding the evolution of parasitism. Secondly, studies aimed at improving general nematode phylogeny and classification must focus on the species described above in the "Minor" problems in nematode phylogeny and taxa to which they were once believed to be related to. Finally, large taxonomic categories currently represented by single or few genomes/transcriptomes (Triplonchida, Mononchida, Dorylaimida, etc) also deserve attention, and further sampling of those taxa would elucidate relationships in those clades and likely spur research into yet more unanswered questions.

Conclusion
This study represents the largest phylogenomic analysis of nematodes to date, and furthers our understanding of nematode relationships. We have also, however, revealed how poorly sampled the current dataset is relative to the tremendous diversity of nematodes on Earth. Sequencing and re-sequencing of more species and broad scale comparative studies can also reveal and correct misidentified or mislabelled datasets (the case of Heterorhabditis indica). Transcriptome sequencing of nematodes is still strongly biased toward parasitic and "model" taxa, particularly those in the Rhabditida, neglecting the freeliving clades that hold the key to the origins of the phylum. Our understanding of nematode early evolution and various pathways towards parasitism will be improved only by broader sampling and sequencing of free-living taxa.
Total RNA was extracted from all samples but Gordius sp. using the Ambion RNAqueous-Micro Kit. For Anaplectus granulosus, Euchromadora sp. and Tobrilus sp., 1000 μL of lysis solution was added directly to the original sample (nematodes in 100 μl of nuclease-free water), while individual specimens of the remaining nematodes and the kinorhynch were manually transferred from RNAlater or nuclease-free water to lysis solution. Subsequent steps of RNA extraction and DNAse treatment followed the manufacturer's protocol. RNA was extracted from the nematomorph Gordius sp. using the Omega Bio-Tek EZNA Mollusc RNA kit using a rotor-stator homogenizer for homogenization and oncolumn DNAse treatment.
For Anaplectus granulosus, Bathylaimus sp., Euchromadora sp., Odontophora sp., Pontonema sp., Symplocostoma sp. and Tobrilus sp., library preparation and cDNA synthesis was performed using the Clontech SMARTer PCR cDNA Synthesis Kit following manufacturer's instructions. Resulting double-stranded cDNA was purified using the QIAquick PCR Purification Kit. Concentration of double-stranded cDNA was measured using Qubit dsDNA HS Assay Kit and Qubit 3.0 Fluorometer. Final library preparation and transcriptome sequencing were performed at the Swedish National Genomics Infrastructure in Stockholm, Sweden using the Illumina TruSeq PCR-free protocol and an Illumina HiSeq 2500 in high-output mode with V4 2 X 125 bp paired-end reads.
For Oncholaimidae sp. and Thoracostomopsidae sp., total RNA (not quantified; < 1 ng) was sent to Macrogen Inc. (Seoul, South Korea) for cDNA library preparation with the SMARTer low input RNA kit and sequencing using on the Illumina HiSeq 2500 using HiSeq SBS V4 with 2 X 100 bp paired-end reads. For Gordius sp., total RNA (1 μg) was sent to Macrogen for Illumina TruSeq RNA library preparation and sequencing using the Illumina HiSeq 2500 using HiSeq SBS V4 with 2 X 100 bp pairedend reads.
Dataset assembly and analysis followed the approach of Kocot et al. [47]. Publicly available genomic data [101,102] were downloaded as predicted proteins if available (Additional file 2: Pycnophyes sp., Gordius sp., Oncholaimidae sp. and Thoracostomopsidae sp. as well as publicly available transcriptomes available only as raw reads were quality filtered, adapter-trimmed, and assembled using Trinity 2.2.0 with the --trimmomatic and --normalize_reads flags [104] on the University of Alabama UAHPC cluster. Transcripts were translated with TransDecoder 2.0.0 or 2.0.1 [106] using the UniProt SwissProt database (accessed on September 20th, 2016; The Uniprot Consortium 2014) and PFAM (Pfam-A.hmm) version 27 [107].
For orthology inference, HaMStR 13 [48] was used with the "model organisms" core-ortholog set. Translated transcripts for all taxa except Caenorhabditis elegans were searched against the 1031 profile hidden Markov models (pHMMs) using the "-central" flag and otherwise with the default options. Sequences matching a pHMM were compared to the proteome of Caenorhabditis elegans using BLASTP with the default search settings of HaMStR. If the Caenorhabditis elegans amino acid sequence contributing to the pHMM was the best BLASTP hit in each of these back-BLASTs, the sequence was then assigned to that putative orthology group (simply referred to as "gene" henceforth). Redundant sequences that were identical (including partial sequences that were identical at least where they overlapped) were then removed with UniqHaplo (http:// raven.wrrb.uaf.edu/~ntakebay/teaching/programming/ perl-scripts/uniqHaplo.pl), leaving only unique sequences for each taxon. Each gene was then aligned with MAFFT 7.215 using the automatic alignment strategy with a "maxiterate" value of 1000 [108]. Alignments were then trimmed with BMGE (−g 0.5) to remove ambiguously aligned regions and any alignments shorter than 50 bp were deleted. Sequences that did not overlap with all other sequences in the alignment by at least 20 amino acids were deleted, starting with the shortest sequences not meeting this criterion. This step was necessary for downstream single-gene tree reconstruction. Finally, genes sampled for fewer than 10 taxa after these steps were discarded.
In some cases, a taxon was represented in an alignment by two or more sequences (splice variants, lineagespecific gene duplications [=inparalogs], undetected paralogs, or exogenous contamination). To screen for evidence of paralogy or contamination and select just one sequence for each taxon, an approximately maximum likelihood tree was inferred for each remaining alignment using FastTree 2 [109] using the -slow and -gamma options. PhyloTreePruner [49] was then employed to use a tree-based approach to screen each single-gene alignment for evidence of paralogy or contamination. First, nodes with support values below 0.95 were collapsed into polytomies. Next, the maximally inclusive subtree was selected where each taxon was represented by no more than one sequence or, in cases where more than one sequence was present for any taxon, all sequences from that taxon formed a clade or were part of the same polytomy. Putative paralogs and contaminants (sequences falling outside of this maximally inclusive subtree) were then deleted from the input alignment. In cases where multiple sequences from the same taxon formed a clade or were part of the same polytomy, all sequences except the longest were deleted. Concatenation of remaining sequences to assemble the data matrix henceforth referred to as the "original full dataset" was performed using FASconCAT-G [110].
Because PhyloTreePruner can result in the unnecessary exclusion of large numbers of sequences when even a single taxon has unstable or contaminant sequences, we also ran our pipeline using SCaFoS [50] instead of PhyloTreePruner. The default settings were used to exclude highly divergent sequences and select the best sequence for each taxon based on average p-distance to all other sequences in the alignment.
Maximum likelihood (ML) analyses were conducted in RAxML 8.2.8 [111] and IQ-TREE [112]. Because of the very large number of taxa in our matrices, for the RAxML analyses, data matrices were partitioned by gene but the PROTGAMMALG model was specified for all partitions rather than empirically inferring the bestfitting model for each partition. A preliminary run of PartitionFinder 2 [113] found that the LG model was the best fit for the vast majority of partitions. The tree with the best likelihood score after 10 random addition sequence replicates was retained and nodal support was assessed with rapid bootstrapping with the number of replicates determined by the autoMRE criterion. IQ-TREE analyses were performed using IQ-TREE 1.5.5 with the site heterogeneous PMSF model [63] (−m LG + C20 + G + F) with the RAxML bipartitions tree provided as the required guide tree (−ft). Nodal support was assessed with 1000 rapid bootstraps (−bb 1000).
Several taxonomy assignment approaches were used to identify the transcriptome of Heterorhabditis indica. At first, transcriptome database for Heterorhabditis indica available at http://insilico.iari.res.in/hindica/was mined for possible ribosomal DNA sequences using built-in BLAST search and 18S rDNA sequence of Plectus aquatilis (chosen to be equally distantly related from both Heterorhabditis and Heterodera) as a target. Four recovered transcripts were then compared with the publicly available sequences from the nucleotide collection of NCBI GenBank using blastn suite (alignment-based taxonomy assignment approach, see review in [114]). One of the recovered transcripts (labelled as Locus_123_ Transcript_1/1) showed high similarity (> 99% identity, E-value = 0) to several 18S sequences from different species of the genera Heterodera and Globodera, with Heterodera glycines (GenBank acc. Number AY043247) having the highest identity score, albeit with partial overlap. The other two transcripts (labelled as Locus_90_ Transcript_1/2 and Locus_90_Transcript_2/2 respectively) also showed high similarity (99% identity, Evalue = 0) to several sequences from different species of the genera Heterodera and Globodera, partially overlapping various reference sequences that may include partial 18S, ITS1, 5.8S, ITS2 and partial 28S, with Heterodera cajani (GenBank acc. Number AY274389) having the highest identity score. Similar results were obtained by mining the transcriptome assembly downloaded directly from GenBank.
The longest section of 18S rDNA sequence mined from the Heterorhabditis indica transcriptome database (588 base long partial 5′ section from the Locus_123_Tran-script_1/1) was then used in tree-based taxonomy assignment approach (see review in [115]) to double-check the results of alignment-based taxonomy assignment. This section was added to a selection of 18S rDNA sequences downloaded from SILVA database [116] and representing all major clades of Rhabditida, including all available nearfull length sequences for identified species from the genera Heterorhabditis, Heterodera and Globodera. The alignment was created using MUSCLE at https://www.ebi.ac.uk/ Tools/msa/muscle/ under default settings and trimmed to a size of a fragment from the Heterorhabditis indica transcriptome. A phylogenetic tree was inferred using RAxML-HPC2 under default settings with 1000 bootstrap replicates.

(DOCX 22 kb)
Abbreviations bs: Bootstrap support; ML: Maximum likelihood Acknowledgements OH acknowledges support from the National Genomics Infrastructure funded by Science for Life Laboratory, the Knut and Alice Wallenberg Foundation and the Swedish Research Council, and SNIC/Uppsala Multidisciplinary Center for Advanced Computational Science for assistance with massively parallel sequencing and access to the UPPMAX computational infrastructure. OH also acknowledges the use of public servers usegalaxy.org (Center for Comparative Genomics and Bioinformatics at Penn State, the Department of Biology and at Johns Hopkins University and the Computational Biology Program at Oregon Health & Science University) and galaxy.ncgas-trinity. indiana.edu (National Center for Genome Analysis Support, Pervasive Technology Institute at Indiana University) for analysis of some of the data. Sampling in the Skagerrak was conducted by OH using vessels ("Skagerak" and "Oscar von Sydow") and facilities of the Sven Lovén Centre for Marine Sciences in Kristineberg. Further thanks are to Dr. Sarah Atherton (Swedish Museum of Natural History) who collected samples containing Euchromadora sp. and Symplocostoma sp. KMK thanks Dr. Christoph Held, the Alfred Wegener Institute, and the scientists and crew of the R/V Polarstern PS 96 cruise, which provided samples used in this work. KMK also thanks Deb Crocker and Robert Griffin for assistance with the University of Alabama High-Performance Computing cluster. Dr. Andreas Hejnol (SARS Centre, University of Bergen), Dr. Philipp Schiffer (CLOE, University College London), and Dr. Christopher Kraus (Zoological Institute, Universität zu Köln) kindly provided unpublished transcriptome data. The authors gratefully acknowledge use of the resources of the Alabama Water Institute at The University of Alabama.
Authors' contributions KMK, OH, and AS collected specimens, OH and AS identified specimens, KMK and OH performed RNA extraction and library preparation, and KMK analyzed the data. All authors contributed to the writing of the manuscript and approved the final version prior to submission.
Funding AS received funding from the Virginia Military Institute to support field work, identification of specimens, and manuscript preparation. Nematode sampling and transcriptome sequencing was supported primarily by two grants to OH from the Swedish Taxonomy Initiative, Artdatabanken: "Systematics of Swedish free-living nematodes of the orders Desmodorida and Araeolaimida" (Dha 2013-140) and "Systematics of poorly known marine nematodes of the class Chromadorea from Sweden" (SLU.dha.2017.4.3-102), and by a grant from Riksmusei Vänner: "Transcriptome sequencing of marine nematodes." KMK received funding from The University of Alabama to support field work, lab work, transcriptome sequencing, data analysis, and manuscript preparation.