The general role of NATs in the detoxification and metabolic activation of aromatic amine xenobiotics, ranging from dietary components to common environmental toxins and pharmaceuticals, has been well documented (reviewed in [34–36]). The possibility that the NAT enzymes could be broadening or changing their substrate specificity in accordance to the high diversity of xenobiotics compounds environmentally available suggests that their metabolic function could be under adaptive evolution and makes them a promising subject for evolutionary analyses. Numerous studies have indeed identified the signature of different selective pressures in genes involved in the metabolism of exogenous substances [37–44]. However, in contrast to the evolutionary processes affecting the NAT2 gene in humans, which have been the subject of extensive research, little is known about the role of molecular adaptation in the evolution and diversification of the NAT gene family as a whole on a longer evolutionary time scale.
The ongoing sequencing of entire genomes from various organisms is providing unparalleled opportunity to trace the evolutionary history of the NAT gene family by enabling the inference of the phylogenetic relationships among NAT sequences over a wide range of taxa. Several previous surveys of public genomic databases have retrieved NAT-like sequences and documented the distribution of NAT genes across all major clades of life [1, 2, 45]. The last survey by Glenn et al.  provided an exhaustive annotation of NAT-homologous sequences recovered through inspection of 2445 genomes encompassing all major taxa from bacteria and archaea to protists and fungi, to animals. The same investigators also performed phylogenetic analyses of the retrieved NAT-homologous protein sequences, providing a broad perspective of NAT evolution. Another study, focusing specifically on the phylogeny of NATs in fungi, has also been published since then . In the present study, we expanded the previous survey by Glenn et al.  by performing a comprehensive search of NAT-like sequences in the genomes of vertebrate species and compiled an up-to-date dataset of 77 vertebrate NAT sequences from 38 distinct species, among which 26 were identified for the first time (Figure 1). In contrast to the previous dataset of 55 vertebrate sequences in Glenn et al. , we decided to exclude one lizard and three fish sequences from analysis (Anolis carolinensis NAT3, Oryzias latipes NAT3, Fugu rubripes NAT1, Tetraodon nigroviridis NAT1) because they aligned poorly with other vertebrate NATs and introduced too many gaps and alignment ambiguities. The deduced protein sequence of these four genes clustered together in a monophyletic clade with three invertebrate NAT protein sequences at the basal position of the vertebrate NAT phylogeny reconstructed by Glenn et al. . The authors suspected the possibility that this clustering might be an artifact stemming from long branch attraction, supporting our decision to exclude these particular sequences from our phylogenetic analysis of vertebrate NAT genes. It is also important to note that, because the vertebrate genomes we considered are in various stages of draft sequencing, assembly, and annotation, we cannot exclude that the present survey missed additional extant NAT sequences. This warrants some caution in the conclusions that may be drawn from the present phylogenetic analysis. The increased availability of genome sequence data from diversified taxa is likely to continue to improve our understanding of the evolutionary history of the NAT gene family in vertebrates.
Our phylogenetic analysis of vertebrate NAT sequences nevertheless demonstrated that the NAT gene family has evolved under a dynamic process called birth-and-death evolution. This process, which operates through three major mechanisms - neofunctionalization, subfunctionalization, and pseudogenization, is thought to be an important source of genetic diversity and evolutionary change that affords functional diversification over short timescales . Our results are in accordance with the previous observations by Thomas  that genes encoding enzymes that function as xenobiotic detoxifiers are often phylogenetically unstable genes that undergo rapid birth-and-death evolution, possibly in response to changing environmental conditions. It is noteworthy that Martins et al.  found similar patterns of gain and loss in fungi, arguing for a complex dynamics of these genes in a broader range of organisms.
Concerted evolution via interlocus gene conversion is increasingly recognized as a major feature of evolution in small multigene families (e.g., [37, 47–49]). By homogeneizing the sequence of paralogous gene copies, the converted paralogs may come to resemble one another more than they do to orthologous sequences in other species. Therefore, gene conversion may potentially alter the relationships among paralogs and lead to the conclusion of independent duplications instead of multiple gene conversions in multiple species. The work carried on the CYP1A1 subfamily  provides a good example of how gene conversion can obfuscate gene orthology relationships and lead to incorrect conclusions when based solely on the results of traditional phylogenetic analyses. However, identification of gene conversion events remains a challenging task as the methods commonly employed for detecting such events can have a high false-negative rate, particularly when gene conversion is frequent and covers a large portion of the duplicates . It is thus advocated to consider the occurrence of gene conversion using various recombination detection algorithms and to combine information coming from both phylogenetic analysis and fine-scale synteny maps. We have used six different methods to evidence possible recombination events between paralogous pairs. Only four events, each involving a NAT1-NAT2 pair, were detected by at least two methods (one each in the fish Oryzias latipes and rat Rattus norvegicus, and two in the bat Myotis lucifugus, Table 1), thus suggesting that concerted evolution has played only a minor role in the diversification of the vertebrate NAT gene family. This conclusion is further supported by: (i) the identical phylogenetic relationships inferred from NAT sequences of which the regions putatively involved in gene conversion events were truncated (Additional file 2: Figure S1), and (ii) the fine-scale synteny comparisons between the three avian species analyzed (Figure 2), as well as between human and mouse (Additional file 3: Figure S2), which evidenced multiple gene duplications events occurring independently in specific lineages. Thus, our analysis of vertebrate NAT sequence data suggests that gene conversion is unlikely to have played a major role in the patterns of divergence of NAT sequences and rather supports a scenario of multiple independent gene duplications.
In humans, there are two NAT isoenzymes encoded at two polymorphic loci. NAT2 polymorphisms modify individual cancer risk and drug response, or susceptibility to adverse drug reactions [15, 16, 51]. Although less well-established, human NAT1 also exhibits genetic polymorphism and several-albeit as yet inconclusive-studies have suggested that variant NAT1 genotypes are associated with susceptibility to a number of diseases including various cancers  and birth defects [52–56]. Given their high level of polymorphism and their association with variable responses to environmental toxins and drugs, these genes are good candidates to test for positive selection and several studies have investigated the possible role of natural selection in shaping genetic variation at these loci (e.g., [17, 23] etc.). Besides their role in phase II metabolism of xenobiotics, several studies have explored the possible endogenous roles of these enzymes. While no endogenous substrate has been identified to date for NAT2, several lines of evidence support the role of human NAT1 in folate catabolism, thereby providing an explanation for its postulated association with congenital defects linked to a disruption in folate metabolism . The widespread tissue distribution of NAT1 and its early expression in development further support a physiological role of this enzyme which might be essential to normal embryonic development (see  for review). Therefore, since NAT1 and NAT2 appear to have distinct functional roles in humans, it is worthwhile to investigate whether these genes have experienced different selective regimes throughout evolution. The identification of the true orthologous sequences to human NAT1 and human NAT2 in Simiiformes and their determination in 13 additional simian species (Figure 4) provided us the necessary power to enable a separate analysis of the two sets of sequences and evaluate the changes in selective pressures experienced by the two gene copies after their functional divergence. Interestingly, two distinct evolutionary patterns emerged for the two paralogs (Table 3). Our analysis suggested a dominant role of purifying selection in NAT1 protein evolution, acting for a conservation of biochemical functions which is in agreement with the role of NAT1 in endogenous metabolism and homeostasis. Note, however, that a signal of positive selection was detected in the human lineage for this gene and that the pattern of evolution of NAT1 is likely to be mosaic with some evidence of positive selection in certain lineages (in humans but maybe also in several other species not included in the present analysis). By contrast, in most of the species investigated, NAT2 was predicted to evolve under diversifying selection to change its amino acid sequence over time, probably in response to changes in xenobiotic exposure. This finding is consistent with the observations made at a population level within the human species by previously published studies which supported an adaptive evolution of the NAT2 gene through either balancing or directional selection [17, 18, 20, 21, 23, 24]. Similarly, the low level of polymorphism reported at NAT1 within the human species [17, 23] is consistent with the action of purifying selection evidenced at the interspecies level in the present study. Evolutionary analyses thus strongly support a differential role of the two isoenzymes and the involvement of NAT1 in endogenous metabolic pathways. Although the folate catabolite pABG is the only endogenous substrate identified to date, one cannot dismiss the possibility that other as yet unknown endogenous substrates and physiological roles may exist for NAT1.
Besides an evaluation of the selective forces acting on members of the NAT gene family, our ML-based phylogenetic analysis allowed us to estimate the strength of natural selection acting at a codon level and to shed light on episodes of adaptive evolution at specific sites and domains of the protein. Several amino acid sites were predicted to be under positive selection with high PP (PP ≥ 95%) throughout vertebrate evolution: codons 97, 98, 104, 214, 286. Interestingly, four of these codon positions were shown to be involved in CoA binding according to the recent structural studies performed on human NAT1 and NAT2 isoenzymes . The residues 97P and 98 V are involved in hydrophobic interactions with the adenine ring of CoA, whereas the amide nitrogen of residue 104 G and the hydroxyl group of 214 T form hydrogen bonds with the pyrophosphate group of CoA. It is noteworthy that the fifth codon position predicted to be under adaptive evolution is the site of a well-known polymorphism in the human NAT2 protein: G286E (c.857 G > A), which defines the NAT2*7 slow haplotype series, is one of the four major nonsynonymous substitutions encountered in human populations and is particularly common in Asia . Functional studies of the G286E variant in mammalian cells demonstrated reduced affinity to both substrate and cofactor acetyl-CoA, resulting in reduced catalytic activity towards some substrates (such as sulfamethazine and dapsone) but not others (such as 2-aminofluorene and isoniazid) . Codon 286 is located on the C-terminal tail in the third domain directly adjacent to the active site. Because the C-terminal tail has an important role in defining the size and shape of the active site cavity, a significant amino acid change at this position is likely to alter active site access and substrate selectivity, which is consistent with the substrate-dependent activity changes observed experimentally for the G286 variant. Such a significant change to a C-terminal residue adjacent to the active site is also likely to affect acetyl-CoA binding and 68C acetylation . Note that the branch-site tests identified four additional sites (11, 100, 102 and 272) as having evolved under positive selection along particular lineages of the vertebrate phylogeny or in a specific clade (B4, B12 and C2, Figure 3). Interestingly, codon 102 is also known to be involved in CoA binding and codon 272 is located in the C-terminus region of the protein. Moreover, two codon sites (173 and 191) were pinpointed to be under diversifying selection throughout simian NAT2 evolution. While the functional and/or structural significance of these two codons have not yet been explored, they are likely to be important in the function of the NAT2 protein and further investigations are warranted to define their potential relevance. In contrast to these positively selected codons, very low values of ω, suggesting rather strong functional constraints, were observed for the three sites of the catalytic triad: 0.054, 0.051 and 0.050 for Cys68, His107, Asp122, respectively, in the vertebrate dataset. These three sites were all conserved in the entire set of NAT coding sequences considered, thereby confirming the importance of these three residues in the activation of the active site cysteine residue. Interestingly, domain I was found to be more constrained than the other domains of the protein, consistent with the lower variability displayed by this region (mean Shannon entropy (H) = 0.70) as compared to domain II (H = 0.84) and domain III (H = 0.93). Conversely, higher ω values were observed for the interdomain, 17-residue insert and C-terminus regions of the protein. In particular, the C-terminal tail was shown to evolve under more relaxed selective constraints in the four datasets examined and many sites pinpointed to be under adaptive evolution were located in this region. This is consistent with the association of this region with different acetyl-CoA binding properties and substrate specificities.