Structural and evolutionary adaptation of rhoptry kinases and pseudokinases, a family of coccidian virulence factors

Background The widespread protozoan parasite Toxoplasma gondii interferes with host cell functions by exporting the contents of a unique apical organelle, the rhoptry. Among the mix of secreted proteins are an expanded, lineage-specific family of protein kinases termed rhoptry kinases (ROPKs), several of which have been shown to be key virulence factors, including the pseudokinase ROP5. The extent and details of the diversification of this protein family are poorly understood. Results In this study, we comprehensively catalogued the ROPK family in the genomes of Toxoplasma gondii, Neospora caninum and Eimeria tenella, as well as portions of the unfinished genome of Sarcocystis neurona, and classified the identified genes into 42 distinct subfamilies. We systematically compared the rhoptry kinase protein sequences and structures to each other and to the broader superfamily of eukaryotic protein kinases to study the patterns of diversification and neofunctionalization in the ROPK family and its subfamilies. We identified three ROPK sub-clades of particular interest: those bearing a structurally conserved N-terminal extension to the kinase domain (NTE), an E. tenella-specific expansion, and a basal cluster including ROP35 and BPK1 that we term ROPKL. Structural analysis in light of the solved structures ROP2, ROP5, ROP8 and in comparison to typical eukaryotic protein kinases revealed ROPK-specific conservation patterns in two key regions of the kinase domain, surrounding a ROPK-conserved insert in the kinase hinge region and a disulfide bridge in the kinase substrate-binding lobe. We also examined conservation patterns specific to the NTE-bearing clade. We discuss the possible functional consequences of each. Conclusions Our work sheds light on several important but previously unrecognized features shared among rhoptry kinases, as well as the essential differences between active and degenerate protein kinases. We identify the most distinctive ROPK-specific features conserved across both active kinases and pseudokinases, and discuss these in terms of sequence motifs, evolutionary context, structural impact and potential functional relevance. By characterizing the proteins that enable these parasites to invade the host cell and co-opt its signaling mechanisms, we provide guidance on potential therapeutic targets for the diseases caused by coccidian parasites.


Background
Toxoplasma gondii is an intracellular parasite that infects a wide range of hosts, including an estimated one-third of the world's human population [1]. The resulting disease toxoplasmosis can be serious in pregnant women and immunocompromised individuals, and as an opportunistic infection associated with AIDS and cancer patients [2]. T. gondii and its evolutionary relatives, the Coccidia, form a clade of parasitic protozoa involved in many human and veterinary diseases such as toxoplasmosis and coccidiosis. Coccidians are a lineage within the protozoan phylum Apicomplexa, which also includes the deadly malaria pathogen Plasmodium falciparum. Thus, T. gondii also serves as an experimentally tractable model organism for studying the shared and contrasting biological properties of the Apicomplexa and other intracellular parasites [3,4].
Apicomplexans contain a unique system of apical organelles called the apical complex, consisting of rhoptries, micronemes and dense granules [5]. At the initiation of host cell invasion, the contents of the rhoptries http://www.biomedcentral.com/1471-2148/ 13/117 are injected into the host cell and the forming parasitophorous vacuole which protects the intracellular parasite [6]. Once there, the parasite proteins can disrupt host cell signaling and defense mechanisms and assist in recruiting host organelles [7].
Proteomic profiling of T. gondii rhoptries [8] and analyis of apicomplexan genomic sequences [9][10][11][12] revealed that many of the proteins secreted by coccidians are protein kinases, a class of enzymes that regulate cell signal transduction through phosphorylation. This expanded, rapidly evolving family of kinases and pseudokinases has been termed the rhoptry kinase (ROPK) family [10], or ROP2 family, in reference to a representative member of the family [9]. While rhoptry kinases appear to be unique to the Coccidia, the involvement of lineagespecific protein kinase families in host-parasite interactions is observed across the Apicomplexa [13]. Several rhoptry kinases have been shown to be involved in virulence and alteration of host cell transcription [7,14]. These include ROP18, a key modulator of parasite growth and virulence which is localized to the parasitophorous vacuole membrane (PVM) [15,16], and ROP5, another PVM-associated protein which assists ROP18 in blocking the host immune response [17][18][19][20][21]. ROP16 localizes to the host cell nucleus and interacts with the STAT3 and STAT6 immune-response signaling pathways [22][23][24][25][26], and ROP38 has been implicated in the modulation of host MAPK signaling [10].
Protein kinases are a diverse family of enzymes which have been successfully targeted for inhibition in human cancers, and show promise for treating infections by protozoan pathogens as well [27]. ATP-competitive smallmolecule inhibitors have been developed to specifically target catalytically active protein kinases in parasitic protozoa [28]. Since many of the ROPKs appear to also be catalytically active, there may be an opportunity to target these kinases for infectious diseases. However, the "catalytic triad" of residues considered essential for kinase enzymatic activity [29] is altered in about half of the identified ROPKs [10]. Pseudokinases have been observed to perform important functions in other systems, typically through inducing allosteric changes in other interacting partners (e.g. [30,31]; reviewed in [32][33][34]). The overall expansion of pseudokinases in the ROPK family underscores observations that some catalytically inactive ROPKs nonetheless play important, functional roles through interaction with other proteins [18,19,35]. Structural studies showed that the pseudokinase virulence factors ROP2, ROP8 and ROP5 do indeed form a protein kinase fold; ROP2 and ROP8 were indicated to be unable to bind ATP [36], while ROP5 was shown to bind ATP in an atypical, noncatalytic conformation [37]. An interplay between ROP5, the active kinase ROP18 and a host immunity-related GTPase has been identified [18,19], demonstrating the potential for complex interplay between rhoptry kinases and the host cell signaling pathways. However, the full extent of the diversity in ROPK family, in terms of function, potential interacting partners, protein structure and structural mechanisms, is poorly understood. With the availability of molecular sequence and structural data from multiple strains of T. gondii and related apicomplexans, we can use comparative methods to examine the molecular evolution of ROPKs and identify functional shifts that may point to distinct regulatory roles and mechanisms.
We catalogued the rhoptry kinases in several fully sequenced coccidian genomes, including Toxoplasma gondii, Neospora caninum, Sarcocystis neurona and Eimeria tenella, and compared them to the broader eukaryotic protein kinase (ePK) superfamily and to each other to study the patterns of diversification and neofunctionalization in the ROPK family and its subfamilies. We propose previously unidentified rhoptry kinases in each of these genomes, including several putative new ROPK subfamilies. We studied the variation in these subfamilies in light of the solved structures of ROP2, ROP8 and ROP5 proteins, and relative to "typical" eukaryotic protein kinases. Both pseudokinases and catalytically active kinases appear to be prevalent throughout the ROPK family. We found a striking co-evolution of structural inserts within the canonical protein kinase domain and the residues that interact with them. Most noteworthy among these is a pattern of residues surrounding the ROPK-specific αC' helix in the kinase "hinge" region. We also recovered another pattern of co-conserved cysteines that form a disulfide bond in the substrate-binding C-lobe. We then discuss some possible functional consequences of these distinguishing features of the ROPK family.

Results
To examine the molecular evolution and functional shifts in ROPKs, we used the genomic, mRNA and proteomic sequences of multiple T. gondii strains, Neospora caninum, Sarcocystis neurona and Eimeria tenella to develop profiles for 42 subfamilies of ROPK, reflecting orthology as well as chromosomal patterns of tandem repeats (see Methods).
We used these sequence profiles to perform an analysis of evolutionary constraints, applying statistical tests of contrasting conservation between gene clades to identify potential sites of subfunctionalization and neofunctionalization in the ROPK family and each ROPK subfamily. We then mapped the sites and regions of interest onto solved structures of ROP2, ROP8 and ROP5 to examine the structural and possible functional roles these features may play within the parasite proteins. http://www.biomedcentral.com/1471-2148/13/117

Global trends in the ROPK family
We used a set of HMM profiles derived from our subfamily sequence alignments to scan the translated gene model sequences available for T. gondii strains GT1, ME49 and VEG, N. caninum and E. tenella and classify putative ROPK genes into the identified subfamilies. We found 37, 55 and 38 ROPK genes in T. gondii strains GT1, ME49 and VEG, respectively, 44 in N. caninum and 27 in E. tenella (Additional file 1). The elevated ROPK counts in T. gondii ME49 relative to the other strains is probably due to differences in sequencing depth and the quality of assembly and gene model annotation; we also found genomic evidence of unannotated orthologs in the other strains. As suggested by Reese and Boyle [35], ROPK genes are often present in expanded loci (sites of gene duplication, usually in tandem array) and are probably undercounted in annotated genomes.
By incorporating sequences from multiple coccidian species into HMM profiles, we were able to identify several putative ROPKs that were not identified in previous computational surveys [9,10]. These include the proposed subfamilies ROP47, ROP48, ROP49 and ROP50, present in T. gondii and N. caninum, and the E. tenellaspecific subfamilies ROPK-Eten1, ROPK-Eten2a, ROPK-Eten2b, ROPK-Eten3, ROPK-Eten4, ROPK-Eten5 and ROPK-Eten6. We suggest these to be likely rhoptry kinases on the basis of sequence homology, phylogenetic placement, signal peptide presence, and existing experimental evidence. Protein or mRNA expression has been previously observed for at least one member of each of these proposed subfamilies, indicating that they are not pseudogenes. ROP47, ROP49 and ROP50 are predicted to contain a signal peptide. The gene coding for ROP48 has only been annotated in T. gondii strain ME49 (TGME49_234950, numbered TGME49_034950 in ToxoDB prior to version 8.0), but we identified genomic regions with 95% sequence identity to this protein sequence on chromosome X of strains VEG and GT1 as well. Recently, a proteomics study observed two E. tenella proteins expressed during the sporozoite stage and localized in the rhoptries: ETH_00027700, which we assigned to the ROPK-Eten1 subfamily, and ETH_00005190, which we assigned to the ROPK-Unique category [38]. A search of the available S. neurona ESTs and genomic scaffolds indicates that ROPKs are prevalent in this species as well, though we cannot assign a specific number until the assembly is complete. The subfamilies that have clear representatives in all four of the surveyed species are ROP21/27 and ROP35.
In S. neurona, rhoptries are present in the sporozoite [39] and bradyzoite [40] stages but absent from schizonts and merozoites [41]. Surprisingly, we found S. neurona genomic regions and expressed sequence tags (ESTs) from the schizont and merozoite stages that appear to code for rhoptry kinases. Of the ESTs currently available in the NCBI GenBank EST database, we identified seven putative rhoptry kinases [GenBank:BM303139.1, BM303688.1, BQ749596.1, BQ750005.1, BU085181.1, CO748650.1, CV193082.1], all obtained from the S. neurona merozoite stage, evidence that these genes are indeed expressed despite the absence of rhoptry organelles during this life stage. We also examined genomic open reading frames (ORFs) for signal peptides using the program Sig-nalP [42] and identified likely signal peptide regions and cleavage sites in several of the ORFs that we predicted to encode rhoptry kinases, suggesting that at least some of these are likely to be exported.
Both pseudokinases and catalytically active kinases appear to be prevalent throughout the ROPK family, in roughly equal numbers of subfamilies. The pseudokinase subfamilies are distributed throughout the phylogenetic tree, rather than forming any distinct clade, suggesting that the evolutionary pressures that lead to the degeneration of paralogs into pseudokinases have applied throughout the ROPK family.

Phylogenetic clustering reveals distinct sub-clades
We inferred a phylogenetic tree from the consensus sequences of each of the ROPK subfamilies to illustrate evolutionary patterns within the ROPK family ( Figure 1). Several distinct clades emerged, which we examined more specifically: rhoptry kinases with homology to the Nterminal extension (NTE) observed in ROP2, ROP8 and ROP5 structures (discussed below); an expanded clade of seven subfamilies specific to E. tenella; and a basal clade of divergent, ROPK-like protein kinases, including ROP35 and BPK1, which we refer to as ROPKL here.
Within the E. tenella-specific clade, the putative ROPK proteins ETH_00028855, ETH_00020620 and ETH_00000075, which we placed in the subfamilies Eten2b, Eten3 and Eten4, respectively, were recently observed to be expressed solely in merozoites [38]. The emergence of this gene clade reflects the significant phylogenetic and phenotypic divergence of the oocyst-forming E. tenella from the other tissue-cyst-forming coccidian species we have examined here [43]. E. tenella also contains several putative ROPKs outside this clade, more closely related to the ROPKs found in T. gondii and N. caninum, which we placed in the ROPK-Unique category (Additional file 1).
The previously identified proteins in the ROPKL clade are ROP33, ROP34, ROP35 and ROP46. The clade also contains the brazyzoite-expressed pseudokinase BPK1 [44]. The gene models of the ROPKL proteins in T. gondii ME49, the best-annotated strain, all contain at least one intron, in contrast to most other ROPK genes, which are typically encoded by a single exon. http://www.biomedcentral.com/1471-2148/13/117 Figure 1 Phylogeny of rhoptry kinase subfamilies. Predicted or known active kinases are labeled in bold text, and kinases that may have a noncanonical catalytic mechanism are marked with an asterisk. Newly proposed ROPK subfamilies are labeled in italic text. The clade indicated in red contains the ROPK subfamilies with a homologous N-terminal extension to the kinase domain (NTE). The clade in green is specific to E. tenella. The divergent "ROPKL" clade is shown in blue. Branch labels indicate bootstrap support. The grid along the right side indicates the species in which each subfamily appears: T. gondii (Tg), N. caninum (Nc), S. neurona (Sn) and E. tenella (Et).

Known or likely catalytic kinases
In our analysis, we consider the catalytically essential residues to be the aspartate in the catalytic loop ("HRD" motif, D166 PKA ) and the aspartate in the Mg-binding loop at the start of the activation segment ("DFG" motif, D184 PKA ); we categorize the ROPK subfamilies missing either of these residues as pseudokinases. Additionally important residues involved in ATP positioning or conformational changes necessary for catalytic activity include a glycine in subdomain I (G52 PKA ), lysine in subdomain II ("VAIK" motif, K72 PKA ), glutamate in subdomain III (E91 PKA ) and asparagine in the catalytic loop (N171 PKA ) [29,45,46], as well as the F-helix aspartate which positions the catalytic loop ("DxW" motif, D220 PKA ) [47]. While catalysis has been observed in kinases that lack one or more of these residues, their absence usually indicates a noncanonical mechanism or impairment of activity [31,48,49].
The subfamilies ROP11, ROP16, ROP17, ROP18, ROP19/29/38, ROP20, ROP21/27, ROP25, ROP28, ROP30, ROP31, ROP32, ROP35, ROP39 and ROP41 were previously suggested to be active kinases based on the conserved catalytic triad [10]. Phosphoryl transfer has been demonstrated experimentally for ROP18 [50] and ROP16 [24], and molecular modelling simulations have shown that ATP could dock in a typical conformation to ROP11, ROP16, ROP17 and ROP18 [36]. Our analysis additionally found the catalytically essential residues conserved in ROP33, ROP34 and ROP46, suggesting these may also be active kinases. Of the E. tenella-specific subfamilies we identified, ROPK-Eten1 also retains all of the key residues needed for catalysis ( Figure 2). http://www.biomedcentral.com/1471-2148/13/117 Figure 2 Conserved motifs of catalytically active rhoptry kinase subfamilies. Sequence logos of key regions in the kinase domain of the broader ePK superfamily and of predicted active ROPK subfamilies as they occur in the coccidian species examined. Letter height at each sequence position indicates greater conservation of that character in a multiple sequence alignment of a large set of ePK sequences (first row) and the annoted genomic sequences of each ROPK subfamily. Row labels indicate subfamily names, with the number of sequences in each alignment shown in parentheses. The ePK-conserved motifs shown are the glycine-rich loop in subdomain I, catalytic lysine in subdomain II, αC glutamate in subdomain III, catalytic loop in subdomain VIb, "DFG" in subdomain VII, "APE" in subdomain VIII, αF "DxW" in subdomain IX, and arginine in subdomain XI. The adjacent sequence sites surrounding each motif are included for context. Asterisks indicate above ePK motifs indicate the catalytic triad. Generated using the WebLogo [82] and ReportLab [83] libraries.

Known or likely pseudokinases
Kinases that lack one or more of the residues necessary for catalysis are likely to be non-catalytic pseudokinases. The apparent pseudokinase ROPK subfamilies are ROP2/8, ROP4/7, ROP5, ROP22, ROP23, ROP26, ROP36, ROP37, ROP40 and ROP42/43/44, as identified previously [10]. We include BPK1, previously noted as a T. gondii brazyzoite-expressed pseudokinase [44], in the ROPK family based on sequence similarity. Additionally, our proposed subfamilies ROP47, ROP49, ROP50, and the E. tenella-specific ROPK-Eten4, ROPK-Eten5 and ROPK-Eten6, are also missing key aspartates involved in the kinase catalytic mechanism and are likely to be pseudokinases ( Figure 3). ROP50 does have an aspartate at the HRD+3 position ( Figure 3), so in absence of a structure we cannot rule out that this nearby residue may play a compensatory role in catalysis.

Noncanonical kinases
The subfamilies ROP24, ROP45 and the proposed ROP48, ROPK-Eten2a and ROPK-Eten2b have most of the residues necessary for catalysis, but with some differences in other typically conserved residues that suggest the mechanisms may be noncanonical ( Figure 4).
In most active ePKs, an asparagine in the catalytic loop (N171 PKA ) coordinates a magnesium ion to position ATP in the active site [29]. This residue varies among some ROPKs: In ROP24, ROP45 and ROP48, the asparagine is replaced by a basic residue (lysine, histidine and lysine, respectively). The closely related E. tenellaspecific subfamilies ROPK-Eten2a and ROPK-Eten2b have the catalytic loop motifs HNDLKLDG and HNDLKLSS, respectively, each replacing the ePK-conserved asparagine with a different residue type. Such replacements are rare in catalytically active kinases; in an alignment of ePK sequences (not shown), we observed only two cases in which the "HRD" motif is conserved without the accomanying asparagine, both of which have been shown to have noncanonical catalytic mechanisms: CASK [49], which replaces the asparagine with a cysteine, and Type II PAK [51], which has a serine.
The ePK-conserved lysine in subdomain II (β3) is replaced with arginine in ROP45, ROPK-Eten2a and ROPK-Eten2b, though the conserved C-helix glutamate is retained, suggesting the necessary salt bridge could still form in the active state of these kinase as in other ePKs. In ROP24, however, the lysine is retained but the corresponding C-helix glutamate is instead alanine, http://www.biomedcentral.com/1471-2148/13/117 precluding a salt bridge. The DFG motif is replaced with the sequence GFT, though a potentially compensatory acidic residue appears at the DFG+1 position. These observations suggest that the activation mechanism [52,53] in ROP24 could be different from that of other ePKs. ROP48 retains the β3 lysine, αC glutamate and DFG motif; however, the substrate-binding lobe is quite divergent, with a dramatically shortened activation loop and F-helix, and the F-helix DxW motif is replaced with ESS, which suggests that the positioning of the catalytic loop occurs differently from other ePKs.
The E. tenella-specific subfamily ROPK-Eten3, in contrast to all the other identified ROPK subfamilies, appears to comprise both active and inactive kinases. The locus appears as a tandem repeat of 5 similar genes, with pairwise identity ranging from 32% to 52% (mean 41%), only one of which (ETH_00020585) retains the key residues indicating catalytic function ( Figure 4).

ROPK-conserved inserts within the protein kinase domain
ROPK-and subfamily-specific inserts within the kinase domain are widespread, suggesting unique functional adaptations [36,37,50]. We found six conserved inserts in the ROPK domain relative to the PK domain ( Figure 5). They are: The ROPKLs appear to have large deletions in this region, and may be missing the αG helix structure altogether. We note that the αG-αH loop is extended in many other protein kinases, most notably CMGC kinases [54].

Distinguishing ROPK-specific conserved sites in the protein kinase domain, and corresponding structural features
We evaluated shifts in site-specific residue conservation between the ROPK family and overall PK superfamily by performing a goodness-of-fit test of residue frequences in the two sequence sets at each aligned column of the PK domain (see Methods). The same comparisons were also performed with each subfamily versus the other ROPKs (Additional file 2).

Hinge region
The most statistically significant sites distinguishing ROPKs from PKs overall are in the kinase hinge region. The residue P358 ROP2 is typically a glutamate in most eukaryotic protein kinases (e.g. E121 PKA , E565 FGFR ), where it contributes to the opening/closing motion of the kinase during activation by forming a lobe-bridging salt bridge interaction [55]. In fibroblast growth factor receptor kinase (FGFR), for example, the equivalent residue E565 hydrogen-bonds with K641 in the β8 strand conditionally upon phosphorylation of the FGFR activation loop [56] (Figure 6D,E). In ROP2, the residues equivalent to E565 and K641 are P358 and F424, respectively ( Figure 6A,B). Since proline and phenylalanine are not charged residues, the ROP2 structure is incapable of forming the same interaction. The residue P358 ROP2 is conserved as a proline throughout most of the ROPK family, with the exception of subfamilies ROP18 (methionine), ROP21/27 (aspartate, though a Phe appears in the β8 strand), ROP26 (serine), ROP32 (histidine), ROP41 (lysine), and the E. tenella-specific subfamilies (retained as glutamate, though only ROPK-Eten1 also retains a basic residue in the β8 strand) (Additional file 2).
The residues at sites P358 ROP2 and P326 ROP2 appear to have instead taken on another structural role. In ROPKs, the residue immediately N-terminal to P358 ROP2 , a site known as the kinase "gatekeeper" residue, is a large, usually hydrophobic residue oriented toward the αC'-β4 strand and, in the ROP2 structure, packing against the ROPK-conserved P326; the hydrophobic residue immediately N-terminal to P326 (most commonly valine but also varyingly leucine, alanine, phenylalanine, isoleucine and methionine in ROPKs) is likewise oriented toward the linker in the ROP2 structure, packing against P358 ( Figure 6C). These four residues thus form a stable packing "box" bridging the αC'-β4 and β5-αD loops.

F-helix "WC" motif and disulfide bridge
A distinctive "WC" motif appears at the end of the αF helix (Figure 7) in most ROPKs. The cysteine (C485 ROP2 ), together with another ROPK-conserved cysteine (C506 ROP2 ) [9] in the αG-αH insert described above, forms a disulfide bond which has been proposed to stabilize the two helices [50]. The tryptophan (W484 ROP2 ) appears to pack against the extended αD and αE helices, pushing the αE helix futher outward. Thus the "WC" motif couples two ROPK-specific inserts to the substratebinding lobe of the kinase core. There are no other known http://www.biomedcentral.com/1471-2148/13/117 protein kinase families or subfamilies in which cysteines at the end of the F-helix and in the αG-αH loop co-occur in positions that could potentially interact. Additionally, both the WC motif and the αG-αH cysteine are absent from the E. tenella and ROPKL clades. Another site in the αF helix (W482 ROP2 ) is conserved as a glutamate in most ePKs (E230 PKA ), but unconserved in ROPKs, suggesting that a selective constraint that conserves glutamate at this site in most ePKs has been lost in the ROPK family. In at least some other ePKs, it appears that this glutamate can interact with a basic residue on the polar/charged surface of the amphipathic αD helix (R133 PKA ), as well as a conserved tyrosine in the P+1 pocket (Y204 PKA ) at the end of the activation segment http://www.biomedcentral.com/1471-2148/13/117 ( Figure 7D,E). Notably, the mutation of E230 to glutamine in PKA not only disrupted substrate recognition and phosphoryl transfer, but also resulted in higher temperature factors in the αD helix, particularly in R133 [57]. However, in ROPKs the interaction between the F and D helices occurs somewhat differently: in ROP5, R455 interacts with E345 and Y427, and in ROP2, W482 packs with H365, while the P+1-pocket Tyr replaced by F446, a side chain not capable of hydrogen bonding ( Figure 7A,B).

N-terminal extension to the protein kinase domain
Structural studies of ROP2, ROP8 and ROP5 revealed another feature common to each of these proteins, an N-terminal extension (NTE) to the canonical protein kinase domain consisting of at least two additional helices and a beta sheet, with the region between the two helices varying between ROP2/8 and ROP5 [36,37,50]. The NTE has also been suggested to be present in ROP18, ROP4/7 and ROP17 based on sequence homology, though its presence does not appear to be universal among rhoptry kinases [37,50]. We investigated the distinguishing features of NTE-containing rhoptry kinases to determine whether other ROPKs may also contain the NTE, and to look for additional conserved features that characterize this gene clade (see Methods). In addition to ROP2/8 and ROP5, we found significant matches in ROP4/7, ROP17 and ROP18, as expected, and also a number of additional subfamilies which appear to form a clade ( Figure 1): ROP23, ROP24 (originally known as ROP2L8 [6]), ROP31, ROP40, ROP42/43/44, and the proposed ROP47. Four proteins in the ROPK-Unique (species-specific) category also showed evidence for NTE homology: TGME49_296000 (TGME49_096000 in ToxoDB prior to version 8.0), also known as ROP2L12 and previously identified as a pseudogene [6]; its orthologs TGVEG_050080 and TGGT1_054010; and the E. tenella protein ETH_00005190. A small number of sites in the NTE sequence region show strong conservation (Figure 8). http://www.biomedcentral.com/1471-2148/13/117 Having identified the NTE-bearing clade, we then compared this clade to all other identifid ROPKs to identify clade-specific residue conservation patterns. In the solved structures of ROP2, ROP8 and ROP5, several of these distinctive sites in the NTE clade are spatially located around the NTE itself, primarily near the conserved β0 and α' secondary structure elements. In ROP2, V330 and P333 in the β4 sheet β4-β4' loop are positioned on either side of the β0 sheet of the NTE, close to the conserved S244; in ROP5, the equivalent residues are V310 and Q313. In each of the solved crystal structures of ROP2 [PDB:2W1Z], ROP8 [PDB:3BYV] and ROP5 [PDB:3Q60], the β0 sheet passes directly between these two side chains, suggesting a structural selective constraint in NTE-bearing ROPKs.
Three significantly contrasting sites in the E-helix may also have some bearing on the NTE conformation or placement: H378 near the αE N-terminus, oriented toward the NTE in the ROP2 structure [PDB:2W1Z]; V382, a small, nonpolar residue oriented toward the extended αD; and Q388 in the middle of the αE helix, where in the ROP2 structure it interacts with the backbone of the conserved G198 at the N-terminus of the NTE α' -though in the ROP5 structure the equivalent residue is I368 which despite having the same orientation cannot form an identical interaction.
Also in the αE helix, a hydrophobic residue (L391 ROP2 , A371 ROP5 ), in place of a usually basic residue outside the NTE clade, is oriented toward a helix which extends beyond the kinase C-terminus in the ROP2, ROP8 and ROP5 structures, previously described as the αH' helix [36]. Though this short, weakly conserved region is difficult to detect by sequence analysis, the conservation of the hydrophobic residue in the αE helix and the presence of this helix in the available structures does suggest a correlation between the presence of the NTE and C-terminal αH' helix.

Discussion
We classified the ROPKs into likely active kinases, likely pseudokinases, and predicted kinases that may be active, but with a noncanonical catalytic mechanism, based on differences in ePK-conserved residues surrounding the ATP binding pocket. Our alignment shows that conserved residues in or near the key ePK-conserved motifs, including the histidine of the canonical "HRD" motifs, are well aligned for each of these categories, so it is unlikely that the absence of the key aspartates in predicted pseudokinases is due to misalignment. Structural investigation of the unusual motifs in noncanonical subfamilies ROP24 and ROP45 in T. gondii could reveal novel kinase mechanisms of activation, ATP positioning and catalysis. Relatedly, analysis of the equivalent motifs in the ROPK pseudokinases could improve our understanding of pseudokinases in general.
Our phylogenetic tree of ROPK subfamilies revealed three specific clades of interest: the NTE-bearing ROPKs, the only clade for which crystal structures have been solved or even homology models reliably constructed; an E. tenella-specific expansion of ROPKs; and the divergent, intron-bearing ROPKLs. Notably, each of these clades contains both predicted active kinases and pseudokinases, indicating a pattern of evolution in which, in a parsimonious interpretation, pseudokinases repeatedly emerge from an ancestral state shared with active kinases, rather than a single or small number of expansions of pseudokinases.
We were unable to find conclusive published evidence that the ROPKL proteins are indeed localized to the rhoptry during the tachyzoite stage of coccidians and expelled during invasion at the same time and through the same mechanism as other ROPKs. ROP35 protein expression has been detected during the T. gondii tachyzoite stage [59] and the E. tenella merozoite stage (ETH_00005905) [38]. Signal peptides were predicted for ROP33, ROP50 and BPK1, but not ROP35, while the gene models of ROP34 and ROP46 contain a short or nonexistent N-tail to the kinase domain which could indicate a trunctated gene model. However, transcription levels across the cell cycle do not match the distinctive two-peaked pattern of T. gondii rhoptry proteins in any of the T. gondii ROPKLs [60]; the secretory organelle of BPK1 was not identified in the study that described the protein [44]. Our HMM profile search and gene trees indicated that the ROPKL proteins show stronger sequence similarity to typical ROPKs http://www.biomedcentral.com/1471-2148/13/117 than to any other characterized protein kinase family, leaving open the question of how deep their functional similarity goes.
A common theme we observe in structural features unique to the ROPK family is the interaction between ROPK-specific inserts or structural motifs, including the N-terminal extension (NTE), and conserved sites within the kinase domain that show contrasting selection in ROPKs. Two regions in particular, the kinase hinge region surrounding the αC' helix and and the dusulphide bridge at the end of the αF helix, suggest several possible functional or mechanistic consequences.
Our observations in the ROPK hinge region raise several hypotheses. The αC' insert in the αC-β4 loop has possible structural analogues in other kinases. The vacciniarelated kinase (VRK) family has a similar insert which packs hydrophobically against the αE helix and was proposed to promote a closed conformation of the kinase domain in the pseudokinase VRK3 [61]; the authors of that study suggested that related active kinases that retain the same feature would be constitutively active. Comparison of the structure of VRK3 [PDB:2JII] to that of ROP2 [PDB:2WIZ] indicates that the ROPK-conserved site L396 ROP2 (Figure 6A,B) may perform a similar role to the VRK3-conserved F296 VRK3 in hydrophobically coupling the two lobes of the kinase domain. Interestingly, the ATP-bound and apo structures of the pseudokinase ROP5 show very little overall conformational change [37]. As another example, crystal structures of the yeast SRPK protein Sky1 conserve a short αC' helix insert, and the flexibility of this region is indicated to be critical for interlobe closure [62]. Together with the ROPK-specific conservation of prolines in the αC-β4 loop and linker, this could indicate the possibility that these differences modulate interlobe closure (the kinase hinging mechanism) in ROPKs.
Another hypothesis regarding the function of the αC' helix, not necessarily conflicting with the above hypothesis, is that it could serve as a binding interface or proteinprotein interaction site. We observed that the αC' helix does not pack hydrophobically against the N-lobe of the kinase domain in the available ROP2 structures; instead, there appear to be water molecules in between [PDB: 3Q60] [37]. The B-factors are somewhat higher than in the immediately surrounding areas, and the symmetry of the ROP2 structure suggests that the insert may have been stabilized in this structure by crystal packing. Given that the same region is disordered in the available ROP5 structures, it appears possible that αC' may be relatively flexible, capable of unfolding from the helical secondary structure into a mobile loop. For comparison, in VRK3, a surface patch centered on the αC-αC' region has been proposed as a binding site [61].
In the kinase C-lobe, a pair of ROPK-conserved cysteines form a disulfide bridge between the end of the αF helix and the αG-αH loop, which is extended in most ROPKs. A conserved tryptophan adjacent to the αF cysteine packs hydrophobically against the αD and αE helices, which are also extended in ROPKs; thus the "WC" motif appears to couple both ROPK inserts to the kinase C-lobe. Notably, this stabilization occurs in the surface region of the protein that was identified as polymorphic between ROP5 alleles in T. gondii [37], and was recently shown to be the interface of an interaction with the host (mouse) immunity-related GTPase (IRG) protein [19]. Reese et al. proposed an allosteric network involving the NTE and αF helix to link the polymorphic surfaces in the C-lobe and kinase active site in ROP5 [37]. The variability of this site in ROPKs may therefore be justified by its involvement in that network, which itself appears to be variable in ROPKs. We can hypothesize that, at least in ROP5, the increased structural stability provided by the WC motif in this region permits these subfamilyspecific mutations to proliferate at this surface without compromising the folding or stability of the kinase C-lobe [63]. This hypothesis assumes that the disulfide bridge is indeed maintained throughout the lifespan of the protein; while it appears as such in the available solved structures, we note that once the protein is inside the host cell, the cytosolic environment is not conducive to disulfide bond formation. The two cysteines involved are co-conserved in not only the PVM-associated ROP2, ROP8, ROP5 and ROP18, but also ROP16, which has been shown to be localized to the host nucleus [22], among other ROPKs.
We also searched for sites that showed conservation specific to the NTE-bearing ROPK clade, rather than ROPKs as a whole. Interestingly, only a small number of strongly contrasting sites emerged as specific to this clade. This could indicate that the mechanistic roles of the NTE vary across even the NTE-bearing clade of ROPKs.
More structural information will be essential to further understand the ROPK family. Currently, only ROPKs from the ROP2/8 and ROP5 subfamilies within the NTE clade have been solved [36,37,50]. While these structures have been invaluable in understanding ROPK mechanisms and possible functions, the low sequence identity and presence of indels across subfamilies makes it difficult to produce reliable homology models for ROPK subfamilies outside this clade. We can suggest several important ROPKs outside the NTE clade which appear to be active kinases, are highly expressed [10], and from which we could gain important insights from the solved crystal structure. ROP16 was indirectly implicated in virulence differences between T. gondii strains in mice [15], and also shown to to modulate the host STAT3 and STAT6 pathway response [22][23][24][25][26], but the precise mechanisms of this action remain to be discovered. Peixoto et al. [10] found evidence that ROP38 is involved in modulating the MAPK cascade; the ROP19/29/38 subfamily was also found to be independently duplicated in T. gondii and N. caninum, thus the other subfamily members could easily be modeled if a ROP38 structure were available. Finally, ROP35 is a representative member of the divergent, poorly understood ROPKL clade; the presence of several indels relative to other ROPKs at structurally important locations in the sequence suggest that a crystal structure would almost certainly reveal surprising variations on the ePK fold and catalytic mechanisms.

Conclusion
In this study, we developed novel bioinformatic methods to study patterns of diversification and neofunctionalization in the rhoptry kinase family, and integrated the results of a systematic, multi-species analysis with the structural context provided by the solved structures. Our phylogenetic analysis revealed a subfamily-level structure shared across species, as well as lineage-specific expansions within the ROPK family and three distinct subclades of ROPK. We applied general knowledge of protein kinase mechanisms to categorize each rhoptry kinase as a likely active, likely pseudokinase, or potentially active but with an atypical catalytic mechanism. We determined the sequence and structural features that distinguish these subfamilies from each other, as well as those that distinguish the ROPK family as a whole from typical ePKs. Where possible, ROPK-specific motifs were placed into structural context to develop functional hypotheses.
This work sheds light on several important but previously unrecognized features shared among rhoptry kinases, as well as the essential differences between active and degenerate protein kinases or pseudokinases. Our studies provide specific hypothesis for further characterizing ROPK structure and function and also inform ongoing efforts to design protein kinase inhibitors for global diseases caused by coccidian parasites.

Data collection
The sequences of translated gene models, unannotated genomes and ESTs from the species Toxoplasma gondii, Neospora caninum, Eimeria tenella were retrieved from ToxoDB version 8.1 [64]. Pre-release genomic sequences and ESTs of Sarcocystis neurona were provided by the laboratories of Dan Howe, Christopher Schardl and Jessica Kissinger.
After constructing the initial ROPK subfamily profiles (below), additional ROPK sequences were identified in the NCBI databases est_others and nr and added to the profiles. To obtain putative ROPK sequences from the unannotated T. gondii and S. neurona genomes, we used the program exonerate (https://www.ebi.ac.uk/~g uy/exonerate/; also see [65]) to align the ROPK subfamily consensus sequences to each genome scaffold sequence, omitting introns according to likely splice sites. A script using Biopython [66] was then used to extract the highestscoring putative protein sequences from the exonerate output and combine identical sequences and sequence fragments.

Subfamily classification
We previously constructed a database of HMM profiles for every known protein kinase family and subfamily defined in KinBase [67], as well as several apicomplexanspecific kinase families [11]. The ROPK profile in this set was initially constructed from annotated ROPK sequences in ToxoDB, similar to the techinique described by Peixoto et al. [10]. Sequences were aligned using MAFFT version 6.940 [68] with a "seed" alignment of the protein kinase domain constructed using published PDB structures [PDB: 2W1Z, 3BYV, 3DZO, 3Q5Z, 3Q60] [36,37,50] and the structure alignment program TM-align (May 2012 release) [69]. Finally, HMM profiles were constructed from each sequence alignment and compiled into an HMM profile database (Additional file 3). We used this HMM profile database to search the protein and translated EST sequences described in the previous section; those which scored as stronger matches to the ROPKspecific HMM profile than to our ePK profiles were taken as an initial set of putative rhoptry kinases.
We developed a program called Fammer to partially automate the construction and curation of hierarchical protein subfamily sequence profiles for use with HMMer 3.0 [70] and MAPGAPS 1.0 [71], and to use these HMM and MAPGAPS profiles for sequence search, classification and alignment. The Fammer software package, including source code, documentation and the ROPK sequence profiles used in this study, is available at http://github.com/ etal/fammer.
The full-length ROPK sequences identified in each annotated coccidian genome and translated EST set were clustered using OrthoMCL version 2.0.3 [72]. We manually trimmed the sequences in each OrthoMCL cluster to the canonical protein kinase domain and aligned the sequence sets with Fammer version 0.1.0 to create an initial set of ROPK subfamily profiles, as well as a set of "unique" or orphan ROPKs which matched the ROPK HMM profile but were not placed into a larger cluster by OrthoMCL.
Iteratively, we performed the following steps to refine the ROPK subfamily classification. We constructed a phylogenetic tree of the consensus sequences of each putative ROPK subfamily, using FastTree version 2.1.5 [73], and merged ortholog groups which were separated by short branches in the tree and, for subfamilies that appeared in multiple copies within a single genome http://www.biomedcentral.com/1471-2148/13/117 (e.g. ROP2/8, ROPK-Eten3), showed co-localization in the chromosome. Existing descriptions of the annotated T. gondii proteins were used to assign names to subfamilies. Unannotated subfamilies that were phylogenetically placed basally to the known ROPKs, indicating closer relationship to other ePKs, were removed. We visually inspected each subfamily sequence set for potential outlier sequences, on the basis of conserved motifs in key regions of the kinase domain, and moved any of these to the "unique" sequence set. We used the Fammer build command to realign all sequences and to construct an HMM profile database of all subfamily profiles, then used this database with the Fammer scan command to reclassify the "unique" or outlier ROPK sequences. We included a profile of non-ROPK protein kinase sequences in this HMM database in order to identify and remove false positives in the "unique" set as well as subsequent searches of the coccidian proteome, genome and EST sequences. Finally, we used the Fammer refine command to perform leave-one-out validation of each subfamily profile versus the "unique" sequence set, following the approach described by Hedlund et al. [74]. This process yielded 42 stable subfamilies of ROPK, along with a "ROPK-Unique" profile set of unclassified orphan sequences. We then identified the ROPK complement in each annotated proteome by running the Fammer scan command with the final ROPK HMM profile database, each coccidian species' proteome sequences, and an expectation-value cutoff of 10 −10 .

Subfamily tree inference
We used the curated alignment of consensus sequences from each ROPK subfamily profile and the non-ROPK protein kinase profile as input to infer phylogenic trees. To quickly examine the structure of the ROPK family during profile refinement, we used FastTree [73] with the WAG scoring matrix, gamma model of rate variation and pseudocount correction for gaps. To infer the final tree shown in Figure 1, we first used the GUIDANCE server [75] with 100 replicates of PRANK and removed columns with less than 5% support, in order to remove alignment columns that were likely to have been misaligned while retaining most of the potentially phylogenetically informative columns. We then used a script to remove columns that were more than 30% gap characters. This filtering yielded an alignment of 279 columns, slightly less than the length of the top-level ROPK HMM profile (288 columns). We inferred the tree from this alignment using PhyML (December 2011 release) [76], with the LG scoring matrix, gamma model of rate variation, empirically estimated amino acid frequencies and 100 bootstrap runs, taking the output of FastTree as the user-supplied starting tree. Finally, we used script based on the Bio.Phylo module of Biopython [77] to reroot the tree with ePK as the outgroup, collapse all splits with less than 25% bootstrap support, colorize the specific clades of interest and visualize the tree. The alignment of subfamily consensus sequences and the inferred tree have been deposited in TreeBase (http://www.treebase.org/; Study ID: 14212).

Analysis of evolutionary constraints
To identify sites of contrasting conservation between ROPK subfamilies, and between all ROPKs and the broader protein kinase superfamily, we compared aligned sites between two given sequence sets by applying a multinomial log-likelihood test (G-test) [78] of the residue compositions of each column in the two sets. The test statistic G is derived from the frequencies of each amino acid type as observed in the "foreground" set, O i , and as expected based on the "background" set, E i , including pseudocounts taken from the amino acid frequencies of the full alignment.
To adjust for the non-independence of sequences in each set due to phylogenetic relatedness, the aligned sequences in each set are weighted according to the Henikoff heuristic [79], and the amino acid counts in each column are adjusted according to these sequence weights, an approach also used in PSI-BLAST [80]. The test statistic G follows the chi-squared distribution with 19 degrees of freedom (for the 20 amino acid types).
We implemented this test in a program called CladeCompare, available at http://github.com/etal/ cladecompare. The output of the program includes (i) a table of the probabilities (p-values) of each site in the combined alignment, (ii) a list of the significantly contrasting sites after adjusting for multiple testing using the Benjamini-Hochberg false discovery rate method [81], and (iii) images of paired "background" and "foreground" sequence logos to illustrate the contrast at significant sites, generated using the WebLogo [82] and ReportLab [83] libraries.

Detection of the N-terminal extension in additional subfamilies
To identify which ROPK subfamilies share sequence homology to the NTE region observed in the ROP2, ROP8 and ROP5 structures, and suggested to be present in ROP18, ROP4/7 and ROP17, we used the CHAIN program [84] with the previously identified NTE-bearing sequences as the query set and the complete set of fulllength ROPK sequences as the main set. CHAIN identified a "foreground" partition corresponding to the clade highlighted in Figure 1.
We then constructed an alignment of the sequence regions N-terminal to the kinase domain in the identified http://www.biomedcentral.com/1471-2148/13/117 using the "accurate" mode of T-Coffee [85], built an HMM profile from this alignment, and used HMMer 3.0 [70] to search the full-length ROPK sequences. This recovered the same ROPK subfamilies identified by CHAIN, confirming the presence of homologous NTE regions in those subfamilies.

Structural analysis
Sites of interest were mapped onto PDB protein structures with a script and visualized in PyMOL [86] for manual inspection.

Availability of supporting data
The data sets supporting the results of this article are available in the TreeBase repository, http://purl.org/ phylo/treebase/phylows/study/TB2:S14212.