Exploring the evolution of protein function in Archaea
© Goncearenco and Berezovsky; licensee BioMed Central Ltd. 2012
Received: 24 November 2011
Accepted: 24 February 2012
Published: 30 May 2012
Skip to main content
© Goncearenco and Berezovsky; licensee BioMed Central Ltd. 2012
Received: 24 November 2011
Accepted: 24 February 2012
Published: 30 May 2012
Despite recent progress in studies of the evolution of protein function, the questions what were the first functional protein domains and what were their basic building blocks remain unresolved. Previously, we introduced the concept of elementary functional loops (EFLs), which are the functional units of enzymes that provide elementary reactions in biochemical transformations. They are presumably descendants of primordial catalytic peptides.
We analyzed distant evolutionary connections between protein functions in Archaea based on the EFLs comprising them. We show examples of the involvement of EFLs in new functional domains, as well as reutilization of EFLs and functional domains in building multidomain structures and protein complexes.
Our analysis of the archaeal superkingdom yields the dominating mechanisms in different periods of protein evolution, which resulted in several levels of the organization of biochemical function. First, functional domains emerged as combinations of prebiotic peptides with the very basic functions, such as nucleotide/phosphate and metal cofactor binding. Second, domain recombination brought to the evolutionary scene the multidomain proteins and complexes. Later, reutilization and de novo design of functional domains and elementary functional loops complemented evolution of protein function.
Protein evolution and evolution of protein function, in particular, is a long-standing topic of keen interest in both experimental and theoretical aspects [1–3]. Recent advances in genomics and proteomics provided a wealth of sequences and structures, making it possible to unravel intricate evolutionary connections in the realm of protein function. Specifically, it became feasible to follow in detail convergence and divergence of protein function in case of speciation and adaptation [4, 5], switching between natural and latent enzymatic activities , evolution of promiscuous functions , and recombination of functional domains into proteins with new functions . There is still, however, an enigmatic question about the very emergence of the first enzymatic domains from primordial functional peptides. The ultimate goal would be to draw a picture of the emergence of functional domains/folds, their fate upon formation of proteomes and involvement into adaptation and speciation. First, it should be understood how protein structure started from combining the primitive peptides/proteins with elementary functions into folds with complex enzymatic activities. Then, the fusion and recombination of these folds into multidomain structures and protein complexes should be explored . Further, the reutilization of already existing structures and the invention of new domains/folds with unique functions should be analyzed.
In order to dig as deep as to the emergence of the first enzymatic domains/folds, one has to hypothesize short peptides that preceded enzymes in the protein-RNA world. Existence of conserved functional motifs [10–12] in a big number of protein superfamilies suggests that they originated from ancestral peptides. Though severely changed, structural and functional “signatures” of these ancestors could survive in contemporary functional motifs. The first task, therefore, would be to define the unit of protein function and to use this definition for decomposing contemporary enzymatic functions into sets of elementary ones. Structurally, it has been shown that closed loops (or polypeptide chain returns) with a characteristic size 25 – 30 residues can be a common basic structural element of all globular proteins [13–18]. This element is apparently a consequence of the polymer nature of the polypeptide chains. Closed loops in modern proteins are also presumed to be units of protein domains , playing an important role in co-translational protein folding [20–22]. Functionally, the notion of elementary functional loops (EFLs), closed loops possessing the residues important for binding, activation, and catalysis has been introduced [10, 23, 24]. The EFLs are presumably descendants of primordial ring-like functional peptides of the protein-RNA world, which can be reconstructed in the form of sequence profiles with specific functional signature(s) and structure(s) of the closed loop (polypeptide chain return) . The same elementary function can serve as a unit of different enzymes, forming their biochemical functions in combinations with other EFLs. As a result, descendants of a particular prototype can be found in unrelated folds and functions. Therefore, evolutionary connections unraveled by prototypes and EFLs go beyond homology on the functional superfamily level, illuminating the very process of building functional domains from the elementary units [10, 11].
We use here the archaeal superkingdom as a model system for exploring the emergence and molecular evolution of the protein function. Archaea is an ancient superkingdom and has a compact structure with a clear division into four phyla: Crenarchaeota, Euryarchaeota, Korarchaeota, Nanoarchaeota. Archaeal species thrive in different extreme environments, such as high temperature and salinity, and most of them are anaerobes. All the above allows to explore the emergence of new functions in the process of speciation as well as in response to demands of the environment. One can trace, for example, evolutionary relations between the most common and ancient functions and those that emerged later. Earlier introduced notions of the archaeal “core” and the “shell” and the Last Archaeal Common Ancestor (LACA)  provide an excellent background for this analysis. The database of the archaeal Clusters of Orthologous Groups of proteins (arCOGs) gives the basic functional annotation for each cluster as well as its distribution across the archaeal lineages .
Some archaeal species possess unique enzymatic functions and even unique metabolic pathways. For example, methanogenesis  is a unique pathway found only in the archaeal divisions Methanobacteriales Methanococcales Methanomicrobiales and Methanosarcina belonging to Euryarchaeota kingdom. Methanogens reduce carbon-containing compounds, such as CO2, formate, and acetate to methane in anaerobic conditions . Overall, the carbon is subsequently transferred between three carbon-carriers: methanofuran (MF), tetrahydromethanopterin (H4MPT), and coenzyme M (CoM-SH) via seven major enzymatic steps of the methanogenesis pathway. These steps are very similar between all kinds of methanogens , and the main methanogenic enzymes are oxidoreductases and transferases . One-carbon metabolism is considered to be one of the most ancient ones, and presumably of a prebiotic origin . At the same time, genomic and geological evidences suggest that methanogenesis pathway evolved at rather late stages of archaeal evolution (2.8 billion years ago). There are in total more than 200 genes required for methane formation . The majority of the proteins coded by these genes are involved into various coenzyme and cofactor biosynthesis, synthesis of prosthetic groups and ion transport. Despite the diversity of enzymes involved in the methanogenesis, they all evolved from the one ancestral set of enzymes . We analyze two enzymes of the methanogenic pathways, identify structural folds and elementary functional loops, and discuss the plausible scenario of their emergence.
We analyze proteins involved into methanogenesis pathway by using the set of profiles of elementary functional loops (EFLs) obtained for the whole archaeal superkingdom (the complete list of profiles is provided in Additional File 2). Elementary functional loops are represented by the sequence profiles in the form of 30-residue long position-specific scoring matrices (PSSMs). Additional File 3: Figure S1 shows the methanogenesis-related arCOGs and their connections to the non-methanogenic ones via profiles of EFLs. Since arCOGs are in many cases multidomain proteins or protein complexes (e.g. methanogenic enzymes formyl-MF dehydrogenase Fmd/Fwd, H4MPT S-methyltransferase Mtr, and methyl-CoM reductase Mcr), we split them into individual domains and consider functions and evolutionary connections of each domain separately (Additional File 3: Figure S2). There are novel folds, such as the folds of methenyl-H4MPT cyclohydrolase (Mch, 3rd step) and N(5)-N(10)-methenyl-H4MPT dehydrogenase ( Mtd, 4th step in the pathway) enzymes, which emerged in response to demand for new/specific function. Highly designable folds, such as β/α-barrel, Rossmann fold, and ferredoxin are abundant in the methanogenic enzymes. For example, β/α-barrel fold is present in Fmd/Fwd subunit A (catalyzing the 1st step in the pathway) and Mer (5th step); Ferredoxin fold – in Ftr (2nd step) and in the other [Fe-S] cluster-containing enzymes, such as McrA and Fmd; Rossmann fold – in the Hmd enzyme (4th step). There are also several rare coenzymes and cofactors working almost exclusively in methanogenesis, such as molybdopterin/tungsteen-pterin (MPT), coenzyme B (CoB-SH), coenzyme F420, and corrinoid cofactor F430. Despite their unique chemistry, some of these cofactors belong to wide groups of structurally similar chemical compounds. They may invoke therefore the similar chemistry of the recognition and binding, resulting in common elementary functional loops. In this case, the corresponding EFLs can be reused as building blocks of the new folds and biochemical functions. There are also cases where several folds fuse and make up multidomain enzymes, e.g. the unique two-domain structure of Mch. Some folds can assemble into protein complexes, such as the homoxehamer of the above-mentioned enzyme.
We consider here protein function starting from the level of elementary functional loops to functional domains and their combinations in mutidomain proteins and complexes. Below we analyze enzymes catalyzing the first (Fwd) and the last (Mcr) steps of the methanogenic pathway, and heterosulfide reductase (Hdr) enzyme linking these steps by reducing cofactors involved into them. For the details on these enzymes, see description in Additional File 3. The goal of this part is to show different ways of the function emergence and evolution such as reutilization of the abundant folds in forming specific methanogenic enzymes, the role of elementary functional loops as building blocks of the new enzymes, and formation of the protein complexes.
Hdr enzyme is a protein complex composed, in the most general case, of three subunits: HdrA, HdrB, and HdrC. In Methanothermobacter HdrABC forms a complex with another enzyme [NiFe]-hydrogenase Mvh  and acts as an electron acceptor. HdrABC uses the electrons obtained from Mvh to reduce ferredoxin and heterosulfide. The structure of heterosulfide reductase (Hdr) has not been resolved yet, however it is possible to explore the functions of its subunits using sequence profiles of the elementary functional loops. The subunit A of the Hdr contains several ferredoxin reductase-type FAD-binding motifs. In the -[RH]x[FY][TS]- motif the R/H form hydrogen bonds to the phosphate oxygen atom . Additionally, HdrA contains four motifs for binding [4Fe-4S] clusters with the common signature -CxxCxxCxxxC- represented by profile 500. The C subunit of Hdr contains two [4Fe-4S] cluster-binding EFLs with the same signature, as in subunit A. HdrB is the catalytic domain, which uses another [4Fe-4S] cluster (bound with a different cysteine-rich signature) and also contains several redox-active cysteine residues.
The task of this part of the paper is to find evolutionary connections going beyond homology in enzyme superfamilies. Three main reasons for the evolutionary connections between the arCOGs and their functions can be named: (i) domain fusion and recombination, as some arCOGs are multidomain proteins and/or protein complexes; (ii) proteins in connected arCOGs are distant homologs diverged from the same ancestral domain/fold; (iii) common elementary functions are present in different non-homologous arCOGs. The latter describes, for instance, common steps in the biochemical transformations or the binding of chemically similar substrates or cofactors. The connections between proteins sharing elementary functions can originate from the primordial evolution, hence they are not restricted within a (super) family or fold. We consider here functional domains and links between them provided by the elementary functional loops and their prototypes. Functions of individual domains rather than those of the whole proteins are analyzed, since the first enzymatic domains were presumably formed from the simple primordial peptides with elementary functions . We start from the arCOGs in the archaeal core (preferably single-domain arCOGs), as they represent the most common protein functions (Figure 1). Then we identify the elementary functional loops, which served as building blocks of these domains and provided key steps of their biochemical functions. Below we show how EFLs unravel intricate connections between protein superfamilies with different biochemical functions.
Two clusters of arCOGs representing aminoacyl tRNA synthetases (aaRS) are determined based on the sets of the corresponding profiles. The catalytic domains of aaRS are highly conserved, and there are several distinct signatures connecting synthetases of the same class. For example, class I aaRS (see Figure 4, cluster 1) is characterized by the profile 292 with the signature -GxKMSKSxG-. The elementary functional loop containing this characteristic signature is a part of the active site in aaRS class I, where the second lysine stabilizes the aminoacyl-adenylate. Figure 5 shows the structure of the corresponding EFL. The elementary function of this EFL is the adenine moiety recognition and binding via a hydrogen bond . Additionally, aaRS of the class I are also connected by the elementary functions of heavy metal and ion transport possessed by the profile 177 with the signature -GDGxxD-. This functional signature describes interactions with Ca2+ and Mg2+ ions (according to PDBeMotif database of protein-ligand interactions ). Class II aaRS is also interconnected by several typical profiles (Figure 4, cluster 4). For example, the profile 280 has a characteristic glycine-rich signature -Px(2)GxGxGxxRL-, similar to the nucleotide binding signatures. Figure 5 contains the example of the EFL corresponding to the profile 280, where an elementary functional loop from a Lysil-tRNA synthetase is shown together with substrates Lysine and ATP.
It is known that structural repeats are typical for many proteins, including β-propellers, PKD domains, WD40 domains, and cell surface proteins [40–42]. Although the corresponding arCOGs are not related functionally, profiles 251 (−WxFGDGx(11)Y-) and 191 (−PxIxGx(2)IVWxD-) represent repeating motifs which serve as conserved structural building blocks (Figure 4, cluster 2). Figure 5 illustrates the example of a typical structural repeat comprising Immunoglobuline-like fold of PKD domain with the signature of the profile 251.
Transcriptional regulators belonging to different families (DtxR, HTH Helix-turn-helix, TrmB, cBS, Lrp/AsnC transcription initiation factors) share several functional signatures (Figure 4a, cluster 3). One of them is exemplified by the profile 423 with the signature -IAx(9)Vx(6)LxxxGxV-. The “HTH regulator fused to ferredoxin domain” (green circle) is the example of the connection as a result of domain fusion. In this case the arCOGs is a multidomain protein. Figure 5 shows the Catabolite gene activator protein (CAP), which is a complex of the transcription factor with DNA. The EFL corresponding to the profile 423 (green) provides the interface between the protein and DNA. Presumably, this elementary function of DNA-binding is also used by the other arCOGs where matches of the profile 423 were found.
Helicases, ATPases involved in replication, replicative SHII helicases, and recombinases have common elementary functional loops responsible for the interactions with nucleic acids (Figure 4, cluster 5). For instance, the profile 45 has a typical signature of helicases − Lx(3)Px(3)GKTLxAExA- [43, 44]. This profile connects several protein superfamilies: RecA-superfamily ATPase implicated in signal transduction, protein implicated in ribosomal biogenesis, superfamily II helicase, and replicative SFII helicase superfamily. Figure 5 shows an example of the elementary functional loop representing the profile 45 in RecA-superfamily, where its function is the ADP binding.
Methylases/methyltransferases are involved in the addition/transfer of methyl chemical group via nucleophilic and radical mechanisms . The group of methylases (Figure 4, cluster 6) has several characteristic signatures, for instance profiles 203 (−VLDxGxGxGx(6)A-) and 429 (−VxGxDx(8)A-). Figure 5 shows an example of the EFL representing the profile 429 in chemotaxis receptor methyltransferase with a bound S-adenosylmethionine (SAM) analog. It indicates that the same elementary function is shared between the SAM-dependent methyltransferases, different RNA methylases, including tRNA and rRNA methylases, and biosynthesis enzymes with the methyltransferase activity.
The ATP binding cassette is a common component of the ABC transporters cluster (Figure 4, cluster 7). The cluster includes ATPase component, transport systems for metal ions, amino acids, drugs, and small peptides. The ATP binding cassette consists of several highly conserved  consecutive functional signatures. We found profiles corresponding to the major functional loops in the ABC transporters. The profile 225 (−Gx(6)GxxGxGKT-) corresponds to the Walker A motif (also called P-loop), which interacts with the phosphate groups of the nucleotide in the ATP. The profile 195 (−LSGGxxQRxxxAxxLxxxPx(5)DEPxxxLD-) contains several signatures. First, it includes the Walker B motif, which coordinates the Mg2+ ion and provides a water molecule polarization. The profile 195 also contains a typical signature of all nucleotide hydrolases (−LSGG-), acting as a γ -phosphate sensor. Additionally, this profile includes a D-loop signature with conserved (−LD-) residues. The profile 294 (−GxxxQx(12)N-) represents the so-called Q-loop with a highly conserved glutamine, providing a nucleophilic attack of the γ-phosphate in the ATP . Figure 5 shows the structure of MalK – an ATPase subunit of the sugar ABC transporter in the archaeaon Thermococcus litoralis and is a representative example of the “ABC transporters” cluster . It contains three major elementary functional loops, representing profiles characteristic for the ABC transporters: 225, 195, and 294. Finally, elementary functions working in ABC transporters are also present in other biochemical functions where ATP binding is part of the reaction. For example, elementary functional loops of the profiles 225 and 195 work in the Cysteine desulfurase activator ATPase.
Heavy metals, such as Mo, W, Co, Mg, and Cu are widely used in enzymes in various biochemical and cellular processes as cofactors in the catalysis, activators, and electron donors/acceptors in redox reactions . Cluster 8 in the Figure 4 shows several profiles with the elementary function of a metal binding. For instance, the profile 14 with the signature − GMxCx(2)Cx(3)Vx(8)GV- characterizes elementary function of the copper binding (example of EFL structure is shown in Figure 5). Highly conserved cysteines and a methionine residue constitute the functional signature of this profile. In the graph in Figure 4 the profile 14 connects the Copper-ion-binding protein superfamily with the Cation-transport ATPase superfamily. Both superfamilies share the elementary function of the copper binding. Recently we showed that the profile 14 is a derivative of the ancient prototype with the generic signature -CxxC- and the general elementary function of the metal and metal-containing cofactor binding . Another example, the profile 10 with the signature − Vx(3)GDGxNDAxALx(2)Ax(2)GxA- binds various ions and inorganic compounds (AlF4, BeF4, K+, Ca2+, MgF4, Mg2+, and Na+, according to PDBeMotif ). In the arCOG graph (Figure 4a) the profile 10 connects several protein superfamilies: HAD superfamily hydrolase, Cation transport ATPase, and Heavy metal associated domain (HMA) superfamily.
Above examples of elementary functions include binding, activation, and elementary reactions, which presumably existed in the prebiotic RNA-protein world and served as basic units in the formation of the first enzymatic domains. The binding of metals with generalized -CxxC- and Aspartic-rich signatures (e.g. profiles 10, 14, and 177) and the phosphate group binding characterized by glycine-rich signatures (e.g. -GxxGxG-) are the examples of abundant and presumably ancient elementary functions.
Contemporary proteins are sophisticated molecular machines built of hundreds or thousands amino acid residues. Structurally, they consist of the independent and compact domain(s) formed by the continuous polypeptide chain or several protein chains interacting and forming a protein complex. This work attempts to draw a picture of protein evolution starting from the prebiotic evolution of protein-like molecules with elementary functions and spanning into the contemporary evolution of protein structure and function.
We relied here on the concept of elementary functional loop (EFL) as a presumed basic unit of the protein function. We derived sequence profiles of EFLs using the set of complete proteomes from the archaeal superkingdom. Our analysis shows that in the earliest stages of protein evolution or even earlier in the prebiotic world, combinations of primitive peptides/proteins with elementary functions, such as nucleotide/phosphate (Figures 4 and 5, clusters 1, 4, 5, and 7) or metal cofactor (Figures 4 and 5, cluster 8) binding apparently formed the first enzymatic domains. The most designable folds (such as β/α-barrel and Rossmann fold) apparently served as scaffolds for biochemical functions of the first enzymatic domains. The enzymes with different folds can contain elementary functional loops that diverged from the ancestral peptides with particular functions. Therefore, functional relations between enzymatic domains could have been established already in the predomain evolution when the first functional folds have been formed. Figure 4 contains examples of common biochemical functions (represented by arCOGs) clustered around the key elementary functional loops. In many cases there is one or a few EFLs, which determine the clustering of enzymes.
We used methanogenesis pathway as a case study in order to show how enzymes with new functions can be formed from elementary functions and via reutilization of already existing functional domains. Methanogenesis is only observed in Archaea, moreover only in few lineages, and is characterized by several unique folds and unusual cofactors . We considered enzymes catalyzing the first (Fwd) and the last (Mcr) steps of the methanogenesis in detail. The subunits FwdB and FwdD exemplify how binding of two molecules of molybdopterin dinucleotide cofactor (MGD) is achieved by the mutual work of two Rossmann-like (di)nucleotide binding folds fused together. These two folds (Figure 2, cyan and green) form the catalytic domain together with a [Fe-S] cluster-binding (Figure 2, blue) and a beta-barrel (Figure 2, orange) folds with elementary functions supporting the MGD binding. The Mcr enzyme catalyzing the last step in methanogenic pathway has several different subunits evolved from the two folds: a ferredoxin-like fold and an all-α fold, which are used in several copies for building of the enzyme. Mcr is also an example of utilizing metal- and nucleotide-binding signatures involved in interactions with a unique F430 cofactor. Using another enzyme Hdr, we show how very similar elementary functions of [Fe-S] cluster binding can be used in different combinations in order to build a complex enzyme with oxidoreductase activity.
To conclude, clear phylogenetic structure with four well-characterized phyla, a long evolutionary history bordering to the origin of life in the prebiotic world, and a diversity of colonized environments made Archaea an attractive subject for the studies of the evolution of protein function. We were able to analyze major ways of the emergence and evolution of the protein function and to show how to reconstruct evolutionary relations between different enzymes. The future task we foresee is two-fold: (i) to obtain a set of elementary functions, which would exhaustively describe chemical transformations existed in a prebiotic world; (ii) to determine the original set of enzymatic domains that formed from the above elementary functions and served as a seed in the evolution of the protein function.
We used arCOG database comprising 41 archaeal proteomes . The definition of the core differs slightly from the original one and includes arCOGs present at least in 39 species. We excluded from consideration Nanoarchaeum equitans, which is an obligatory symbiont and lacks a large number of core proteins due to its lifestyle. If we were to include N. equitans in the core, the number of core arCOGs would be only 79 instead of 166. We also missed Termoproteus tenax, because its genome was not publicly available at the time of the study. The shell (arCOGs present at least in 10 species), orphans (less than 10 species), and LACA (Last Ancestral Common Ancestor) groups of arCOGs are defined according to the database.
We used HMM library from Superfamily database  based on ASTRAL/SCOP release 1.75 [32, 51] in order to detect SCOP folds in arCOGs . A complete list of detected SCOP folds in the core, shell, and orphan arCOGs is provided in Additional File 1.
We used 30-residue long segments from the sequences of arCOGs as origins for deriving profiles of elementary functional loops (EFLs). The origins were iteratively matched against 68 non-redundant (70% sequence identity) archaeal proteomes until they converged into sequence profiles. Afterwards, the profiles were clustered in order to remove any remaining redundancy. The procedures for converging and clustering profiles are described in detail elsewhere [10, 11]. The computational pipeline yielded 525 sequence profiles with distinct functional signatures. We refer to the profiles by their serial numbers or by PROSITE-like patterns uniquely identifying their signatures. Additional File 2 contains logos of all the profiles. The list of archaeal proteomes is provided in Additional File 4.
Using profile-sequence search  we looked for the matches between the derived profiles and non-redundant arCOG sequences with the expected number of false hits less than one. In order to increase the robustness of profile-arCOG matches we excluded connections having less than 15 matches (five matches for methanogenic arCOGs). The resulting connections were visualized using Cytoscape 2.7.0 .
Sequence profiles of elementary functional loops were used to find matches in CDD and SCOP domains with known structure [32, 53]. For many protein families functionally important residues are known, and the role of the latter in binding , intermolecular interactions , and mechanism of catalysis  was used to assign the profiles their elementary functions.
The methanogenic enzymes were identified by taking KEGG orthologous groups from methane metabolism pathway  and finding the corresponding COGs and arCOGs. The major enzymes of the methanogenic pathway and some common enzymes involved in co-factor biosynthesis are listed in Additional File 3.
We modeled two subunits FwdB (UniProt AC: P61154) and FwdD (UniProt AC: Q58568) from M. jannashii based on two structural templates: formate dehydrogenase H (FdhH) from E.coli (PDB ID: 1fdo chain A) and FwdD from Archaeoglobus fulgidus (PDB ID: 2ki8 chain A). We used SWISS-MODEL server  with a fully automated modeling procedure. FwdB contains two α/β/α Rossmann-like folds and a small Fe-S cluster-binding domain, whereas FwdD is a beta-barrel fold. The template FdhH contains the domains homologous to both subunits FwdB and FwdD in one chain, therefore, we were able to model the FwdB-FwdD complex assuming that domain interactions are conserved. We assembled the complex by aligning A. fulgidus-based model of subunit FwdD to the corresponding beta-barrel domain of FdhH template (from E.coli).
INB is indebted to Riva Z. L’vina for the help in writing this manuscript. The present study was supported by the national Functional Genomics Programme (FUGE) of the Research Council of Norway.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.