The proteomic complexity and rise of the primordial ancestor of diversified life
© Kim and Caetano-Anollés; licensee BioMed Central Ltd. 2011
Received: 2 February 2011
Accepted: 25 May 2011
Published: 25 May 2011
Skip to main content
© Kim and Caetano-Anollés; licensee BioMed Central Ltd. 2011
Received: 2 February 2011
Accepted: 25 May 2011
Published: 25 May 2011
The last universal common ancestor represents the primordial cellular organism from which diversified life was derived. This urancestor accumulated genetic information before the rise of organismal lineages and is considered to be either a simple 'progenote' organism with a rudimentary translational apparatus or a more complex 'cenancestor' with almost all essential biological processes. Recent comparative genomic studies support the latter model and propose that the urancestor was similar to modern organisms in terms of gene content. However, most of these studies were based on molecular sequences, which are fast evolving and of limited value for deep evolutionary explorations.
Here we engage in a phylogenomic study of protein domain structure in the proteomes of 420 free-living fully sequenced organisms. Domains were defined at the highly conserved fold superfamily (FSF) level of structural classification and an iterative phylogenomic approach was used to reconstruct max_set and min_set FSF repertoires as upper and lower bounds of the urancestral proteome. While the functional make up of the urancestral sets was complex, they represent only 5-11% of the 1,420 FSFs of extant proteomes and their make up and reuse was at least 5 and 3 times smaller than proteomes of free-living organisms, repectively. Trees of proteomes reconstructed directly from FSFs or from molecular functions, which included the max_set and min_set as articial taxa, showed that urancestors were always placed at their base and rooted the tree of life in Archaea. Finally, a molecular clock of FSFs suggests the min_set reflects urancestral genetic make up more reliably and confirms diversified life emerged about 2.9 billion years ago during the start of planet oxygenation.
The minimum urancestral FSF set reveals the urancestor had advanced metabolic capabilities, was especially rich in nucleotide metabolism enzymes, had pathways for the biosynthesis of membrane sn1,2 glycerol ester and ether lipids, and had crucial elements of translation, including a primordial ribosome with protein synthesis capabilities. It lacked however fundamental functions, including transcription, processes for extracellular communication, and enzymes for deoxyribonucleotide synthesis. Proteomic history reveals the urancestor is closer to a simple progenote organism but harbors a rather complex set of modern molecular functions.
Cellular organisms in the contemporary living world have been classified into superkingdoms Archaea, Bacteria, and Eukarya  ever since archaebacteria were discovered over three decades ago . Every newly recorded species of the over a million that have been described (belonging to the ~107-108 that probably exist on Earth [3–5]) has failed to escape from the boundary of the three-superkingdom natural system. This system confers a rigid universal taxonomic structure for a universal 'tree of life', a phylogeny that describes how lineages on Earth diversified from a primordial ancestor. In phylogenetics, the most basal and ancient (plesiomorphic) node of an evolutionary tree defines a common ancestor of the organismal set (taxa) under study. Generally, this node is a hypothetical entity, the first of a chain of ancestors giving rise to each and every organism in the tree. The tree of life defines the last universal common ancestor (LUCA), an organism responsible for the emergence of Earth's primary lineages [6–12]. However, the current tree of life is not universal, i.e. not all primary lineages are represented in the tree. The tree describes the evolution of organisms with ribosome-containing cells (ribocells) and does not incorporate viruses or other lineages that lack ribosomes (virocells), have biological boundaries that are difficult to define, or are evolutionarily highly mobile [13, 14]. The tree of life may also have reticulations because of horizontal gene transfer (HGT), convergent evolution, and recruitment processes that complicate the genetic make up of lineages. While the terms "network" or "rhizome" have been proposed, reticulations often affect the history of organismal components but maintain the integrity of lineages , and all lineages most probably had a single evolutionary origin . LUCA is believed to be a cellular entity, even though its make up has been considered contentious [6, 17]. Its cellular status was probably attained progressively, starting with molecular components drawn from the emerging and fuzzy biochemistry of primordial Earth and ending with more complex biological machinery needed to sustain the integrity of lineages in an increasingly diversified world. In particular, cell-defining ribosomes evolved from the start as ribonucleoprotein ensembles  but emerged before the loss of the first protein fold in a superkingdom 2.6 billion years (giga-annum; Ga) ago, a first indication of clear organism diversification [19, 20]. Conceptually, the primordial ancestor by definition cannot represent a formal lineage; no prior phylogenetic ancestor precedes it, being first and last of a gradually evolving community of primordial organisms. Consequently, we regard this primordial ancestral entity as the most basal node of the tree of life and name it for simplicity the 'urancestor' (root 'ur' = primitive), a primitive organism that evolved in a time when improving the molecular make up was the main focus of the evolutionary progression.
The emergence of the urancestor probably represents a singular evolutionary chain of events responsible for the universal genetic code and widely shared biological structure. However, defining the complexity (or simplicity) of its molecular features (characters) is problematic, as it often depends on the levels of molecular structure analyzed (e.g., molecular sequence, motifs in sequence, structural motifs, structural domains), the rooting of the tree of life, and the methods used for ancestral character state reconstruction. While it is likely that the urancestor had functional ribosomal RNA (rRNA) and some transfer RNA (tRNA) molecules [6, 8, 21], as revealed by phylogenetic analysis of RNA and protein structure [18, 22]), and was probably endowed with functions associated with DNA replication , its translational apparatus has been considered to be rudimentary [6, 24]. Penny and Poole  later on expanded urancestral complexity to include besides translation and DNA replication, transcription, cell division, and regulatory elements of information processing, suggesting a functionally complex entity much akin to that of modern life. Comparative genomic studies that included the first archaeal genome that was sequenced, Methanococcus jannaschii, resulted in arguable conclusions . Broader comparative analyses of entire genomic repertoires were still inconclusive, supporting the simple 'progenote' model proposed by Woese  or the complex 'cenancestor' (sensu ) model that equated the urancestor to modern organisms [7–12]. For example, Koonin  traced evolutionary histories of orthologous genes in ~100 genomes using parsimony thinking and phylogenetic trees of ribosomal proteins and rRNA. Although the numbers of urancestral genes ranged from tens to ~1,800 depending on composite parameter values of rates of differential HGT and gene loss, the gene set included most of the translation apparatus and a few transcriptional components but lacked DNA replication regardless of parametric values. In contrast, Ouzounis and colleagues initially identified ~300 ancient proteins from homologues of M. jannaschii open reading frames . Except for regulatory elements related to information (i.e. translation, transcription, etc), proteins encompassed Penny and Poole's urancestral functions. A more recent study of 184 genomes identified 669 orthologous protein families, which cover 561 detailed functional classes that are involved in almost all essential biological processes of extant life, including translation, transcription and its regulation, DNA replication, recombination, and repair, transport and membrane-associated functions, electron transfer, and metabolism . Similarly, comparison of protein fold structures among lineages of the three superkingdoms supported an urancestor with functional complexity similar to that of extant life [11, 27].
Mutation and chromosomal rearrangement change the sequence of nucleic acids and proteins continuously. The high pace of sequence change can complicate phylogenetic analysis and character state reconstruction. Differential evolutionary rates among organismal lineages, horizontal gene transfer (HGT), and non-orthologous gene displacement can generate phylogenetic artifacts such as long-branch attraction and unrecognized paralogy [8, 9, 15, 28]. This compounds with difficulty in identifying homology by sequence alignment , the troublesome task of assigning orthologous relationships , and the problem of saturation of substitutions . These phylogenetic artifacts can be responsible for the variable rootings of the tree of life. Finally, origins of genes can be highly variable depending on reference trees that are used for tracing their evolutionary histories. Except for a couple of studies [11, 27], all surveys of urancestral molecular repertoires that were fully based on sequence conservation must be considered susceptible to these problems.
In this study we focus on the three-dimensional (3D) structure of protein domains that result from the folding of polypeptide chains. These structures are grouped into fold families (FF), fold superfamilies (FSFs), and folds (F) in a robust hierarchical classification scheme, the structural classification of proteins (SCOP). In particular, we focus on FSFs, groups of FFs (protein structures that are homologous based on sequence identity) that share structural and functional features suggestive of a common evolutionary origin. SCOP currently defines ~2,000 FSFs, and ~7 × 107 proteins present in ~1,124 completely sequenced genomes (October 2009) have been assigned to at least one FSF by scanning with hidden Markov models (HMMs). The relatively small number of FFs, FSFs, and Fs present in nature indicates protein structure is much more conserved than sequence and is refractory to evolutionary change . In fact, a comparative analysis of structurally aligned protein domains and aligned protein sequences showed structures are 3-10 times more conserved than sequences . The structures are therefore good phylogenetic markers for deep events in evolutionary history. The large and growing number of genomic sequences and their associated FSF assignments also guarantees a broad organismal census. Moreover, the evolutionary conservation and deep phylogenetic signal of FSFs has been repeatedly verified by recovery of reliable phyletic patterns describing the evolution of the three superkingdoms [34, 35] and by successful exploration of the origin of modern metabolic networks . All of these features and the fact that domains diversify mostly by vertical descent [37–39] make domain structure extremely useful for the evolutionary study of proteomes and urancestral protein repertoires.
In this study we conducted a genomic census of protein domains at FSF level of structural complexity in organisms that have been completely sequenced and used this information to build trees of life describing the evolution of proteomes. We first analyzed 645 complete proteomes in 49 archaeal, 421 bacterial, and 175 eukaryotic species (the Total set). Using an E-value cutoff of 10-4, we extracted reliable proteomic HMM hits for 1,531 FSFs, 1,446 of which covered ca. 80% of the total FSFs defined in SCOP. We then generated intrinsically rooted phylogenomic trees directly from FSF abundances in the 645 proteomes. A most parsimonious reconstruction of a tree of life (Total-tree) showed that organisms in the three superkingdoms formed distinct groups (Additional file 1, Figure S1). Eukaryotic proteomes were placed in a monophyletic group that was supported by a 50% bootstrap support (BS) value. In turn, archaeal and bacterial proteomes were para- and polyphyletically grouped, respectively. A manual analysis of organismal lifestyles using various sources of information revealed that proteomes from obligate and non-obligate parasitic organisms (225 in total) were distributed throughout the tree but generally occupied the most ancient branches of the superkingdom groups (e.g. Mycoplasma spp. and Rickettsia spp. in Bacteria; Cryptosporidium spp. and Plasmodium spp. in Eukarya). Basal organisms included notable obligate parasites (N. equitans in Archaea, Chlorobium chlorochromatii in Bacteria, and Encephalitozoon cuniculi in Eukarya).
The genomes of organisms that are parasitic or that establish symbiotic relationships with other organisms have frequently experienced reductive evolution, discarding enzymatic and cellular machineries in exchange for resources from their hosts. Since the inclusion of these genomes can lead to incorrect phylogenetic trees [15, 42], we excluded genomes from all but 420 FL organisms. This FL set included 48 archaeal, 239 bacterial, and 133 eukaryotic organisms. The exclusion of non-FL genomes resulted in 26 FSFs that were absent in the proteomes of FL organisms. We therefore reconstructed a most parsimonious rooted tree that described the evolution of FL organisms from genomic abundances of the remaining 1,420 FSF. The tree of FL proteomes (FL-tree) also supported the trichotomy of the superkingdoms (Figure 1A). Unlike the Total-tree, the FL-tree showed that archaeal species formed a polyphyletic group and were basal in the tree. The proteomes of Bacteria and Eukarya formed strong monophyletic groups supported by 88 and 100% BS values, respectively. A sister-clade relationship between Bacteria and Eukarya existed but was poorly supported (57% BS). While statistics describing the skewness  of the Total-tree were close to zero (g 1 = -0.065), the FL-tree had more significant phylogenetic structure (g 1 = -0.199, P < 0.01). Our focus therefore centered on FL-tree reconstructions.
In the FL-tree, the number of bacterial proteomes dominates those of organisms in other superkingdoms. It is thus possible to argue that the bias of sampled proteomes per superkingdom can cause long-branch attraction, which can eventually lead to incorrect deep phylogenetic relationships. We thus sampled equal numbers of proteomes per superkingdom and replicated trees of proteomes. We first chose only one proteome per genus in Archaea because the numbers of proteomes vary depending on individual archaeal genera. As a result, we selected 34 out of 48 archaeal proteomes. We then classified bacterial and eukaryal proteomes at the phylum level of the two superkingdoms. While some phyla contained large number of proteomes (e.g., 50 proteomes in Firmicutes), some other contained very few (e.g., only two proteomes in Acidobacteria). We thus randomly sampled bacterial and eukaryal proteomes whose numbers are proportional to the size of corresponding phyla and then adjusted the number of sampled proteomes per superkingdom to the sample size of Archaea (34 proteomes). By following the same steps described in Methods, the genomic abundances of 1,370 FSFs in the balanced set of 102 proteomes resulted in one most parsimonious tree (Additional file 1, Figure S2). The phylogenetic relationships among the three superkingdoms that we found in the FL-tree of 420 proteomes ([Archaea, [Bacteria, Eukarya]]; Figure 1A) were consistently present in the newly reconstructed tree (Additional file 1, Figure S2). Furthermore, monophyly of Bacteria and Eukarya, and paraphyly of Archaea were maintained. These results indicate our phylogenetic approach based on genomic abundance is robust against uneven sampling of proteomes in superkingdoms.
In general, some of character-state changes on a given phylogenetic tree can be ambiguous. That is, these changes can occur on different branches without increase or decrease in the number of steps of the tree. In the root branch of the FL-tree, 50 out of the 352 FSFs had character-state changes that were ambiguously assigned. However, the ambiguity sharply decreased in the early iterations of every chain as shown in Additional file 1, Figure S3. Consequently, all of the 152 FSFs that were obtained from iteration 36 onwards of chain 26 had purely unambiguous character-state changes.
To remove FSFs that might be derived from ancient HGT or convergent evolutionary processes, we used information in a rooted tree of domain structure that describes the evolution of 1,420 FSFs (taxa) and was reconstructed from FSF abundance in the proteomes of 420 FL organisms (characters). In this tree, the most basal 130 FSFs were present in the three superkingdoms and appeared before the rise of the first FSF that was completely lost in a superkingdom (Archaea) (Figure 1B). The intersection of this ancient set of 130 FSFs with the set of 152 FSFs defined 70 urancestral FSFs. Iteration analysis of this new urancestral FSF set with the same numbers of chains and iterations showed there was no further decrease in the number of FSFs, although tree lengths varied slightly from 10,271 to 10,284 steps. We therefore define the sets of 152 and 70 urancestral FSFs as the maximum (max_set) and minimum (min_set) boundaries of the urancestral FSF repertoire, respectively. The biological rationale for the use of the iterative analysis and the tree of domain structures can be found in Discussion.
Under the 3-superkingdom taxonomic system, an individual FSF can be unique to superkingdom Archaea (A), Bacteria (B) or Eukarya (E), or can be shared by two (AB, BE, or AE) or all three superkingdoms (ABE). The initial 352_set of FSFs that were basal and plesiomorphic consisted of 314 ABE, 10 AB, 27 AE, and 1 A FSF (Figure 2A). As expected, the max_set of 152 urancestral FSFs (and the min_set of 70 FSFs embedded in it) had only ABE domains. To describe the popularity of urancestral FSFs in proteomes, we adopted two different indexes: (1) a distribution index f, the fraction of organisms that contain a certain FSF (in a 0-1 scale); and (2) the mean of G, the sum of g values of all sampled proteomes for a particular FSF divided by the total number of proteomes. The 352_set had f and mean of G values that ranged 0.055-1 and 0.114-13.781, respectively (Figure 2C). The max_set has corresponding values that ranged 0.694-1 and 1.529-13.781 and the min_set had values that ranged 0.964-1 and 3.807-13.781. Clearly, min_set FSFs are ubiquitous and highly represented in proteomes.
The enrichment of molecular functions in the urancestral FSF sets
Functional sub-categories (categories)
Urancestral FSFs (sample)
Extant FSFs (bkgd)
Rate for sample
Rate for bkgd
Small molecule binding (General)
Nucleotide m/tr (Metabolism)
Nucleotide m/tr (Metabolism)
We also compared the abundance of urancestral FSFs (FSF reuse) to that of extant organisms (Figure 5B). While it is difficult to infer multiple appearances of individual FSFs in organisms before the advent of the three superkingdoms, proteome trees were generated from FSF abundance levels (g values) and provide histories of character state changes for individual characters (FSFs). We therefore counted how many gains or losses of individual urancestral FSFs occurred in the root branches of the trees. In the process of identifying the two urancestral FSF sets, we obtained most parsimonious trees of proteomes in iterations that generated the 70 and 152 FSF sets. In these trees, min_set and max_set FSFs appeared 303 and 507 times without any event of loss, respectively. Using a similar rationale, we counted gains and losses of FSFs in individual proteomic lineages by traveling from the root to terminal leaves of the proteome tree. Total abundance levels for FSFs in each of the 420 FL proteomes (i.e. the total number of FSFs in a proteome) ranged from 1,076 to 81,230. The proteome that had the lowest total abundance level was Ignicoccus hospitalis, a chemolithoautotrophic and hyperthermophilic archaeon (Figure 5B). As expected, the FSF abundance levels of the urancestral sets were simpler than those of any FL proteome in existence today.
Using a strategy similar to the one described above for domain structure, we again regarded the two urancestral FSF sets as taxa and evaluated their phylogenomic positions on the proteome trees. Inspired by a recent phylogenomic analysis of molecular functions that was derived directly from ontological data , we defined the 49 functional sub-categories as phylogenetic characters and calculated individual g values for FSFs pooled by function. Characters in this analysis have character states that describe the number of distinct FSFs that are assigned to a particular pair of a functional sub-category and proteome. As expected, phylogenomic trees reconstructed from transformed g values for FL proteomes and artificial urancestors showed again that the urancetor was basal in the most parsimonious trees (Figure 7B). While the basal placements of the functional urancestral sets were weakly supported (BS < 50%), over 100 character-state changes between the urancestor and the set of the 10 most ancient taxa (all archaeal in origin) supported the functional simplicity of the urancestor of life.
Ribosomal proteins used as markers of ribosomal evolution in the urancestor
SCOP fold superfamily
Nucleic acid-binding proteins
Ribosomal protein S5 domain 2-like
Translation proteins SH3-like domain
Alpha-L RNA-binding motif
S13-like H2TH domain
Ribosomal protein L6
Glucocorticoid receptor-like (DNA-binding domain)
Translational machinery components
Ribosomal protein L7/12, oligomerisation (N-terminal) domain
A number of recent studies have focused on the complexity of the urancestor, its primordial functions, alternative rootings of the tree of life, and the rampancy of HGT [6–12, 23, 46]. However, results are not congruent, the rooting of the tree of life is still controversial, and the make up of the urancestor remains shrouded in mystery. It is indeed difficult to characterize an entity that existed billions of years ago using information in molecules of life that are modern. When using for example nucleic acid or protein sequences, the pervasive effects of mutation can cloud (saturate) any significant evolutionary signal. Some early studies focused on genomic sequence conservation but suffered for example from the effects of genetic losses and take-overs (e.g., non-orthologous gene displacement [9–11]). The analysis we here report is based on genomic content of protein domain structures at FSF level of the structural hierarchy. FSFs are much more conserved than protein sequence and are highly shared by organismal lineages. About half of FSFs (682 out of 1,420) in the 420 FL proteomes we analyzed in this study are common to the three superkingdoms (Venn diagram; Figure 2A). It is clear that FSFs are more robust against genetic losses and take-overs than corresponding sequences and carry deep evolutionary signatures [36, 42, 47, 48]. Furthermore, the distribution of FSFs in proteomes meets crucial phylogenetic marker criteria since: (i) it rarely changes relative to speciation events (SCOP domains and FSFs are discovered at average rates of once every ~0.1 and ~5 million years, respectively ), (ii) is minimally affected by HGT [37, 38], and (iii) is not under active natural selection (structural designs spread at rates of gene duplication and are vastly unaffected by change at sequence level ). We caution however that secondary adaptations in organism lifestyles such as parasitism could significantly affect FSF abundances in lineages undergoing reductive evolution. Given all these features, FSFs are well suited for inferring the make up of the urancestor.
Current top-down phylogenetic strategies used to build a tree of life generate unrooted trees and use deep taxa as outgroups a posteriori to root phylogenies. One major technical limitation of determining if given genes have an origin in the urancestor using this approach is the need of a universal tree that is accurately rooted. In some traditional studies that were based on sequence conservation, gene genealogies were largely dependent on the position of the root in trees that are used as reference (guide trees)[6, 8, 9, 15]. However, guide trees built from rRNAs or some ancient proteins, such as elongation factors and aminoacyl-tRNA synthases (aRSs), produced rooting scenarios that were not congruent. Inconsistencies of this kind and the possibility of unrecognized paralogy leading to incorrect gene trees have made urancestral gene assignments unreliable . Although a recent technical advance makes it possible to infer ancestral states of a given gene content using the parsimony method (GeneTRACE ), it still requires organismal or genomic trees as guides and does not explore the effects of gene abundance. Due to these technical limitations, the first use of FSFs to make inferences of the urancestor was solely based on the distributions of FSFs in genomes (analogous to the f index we use here) without any phylogenetic consideration .
In contrast, our bottom-up phylogenetic strategy uses the Lundberg method  to generate rooted phylogenomic trees without the need of outgroups (Figure 1 and Additional file 1, Figures S1 and S2). Evolution's arrow is established directly by the evolutionary model, the rationale and assumptions of which have been recently reviewed . Operationally, the tree reconstruction algorithm finds the shortest unrooted tree(s) without specifying character polarity and then roots the tree(s) by invoking a hypothetical ancestor defined by ancestral character states and selecting the rooted topology that minimizes overall tree length (see Methods). In this study, a phylogenomic tree that describes the evolution of 420 FL proteomes revealed the three superkingdoms as distinct groups and placed Archaea at the root, with a rooting that was internal (paraphyletic) to the superkingdom (Figure 1A). A tree of life describing the evolution of a balanced set of proteomes corresponding to the three superkingsdoms revealed the same diversification patterns (Additional file 1, Figure S2), suggesting that biases in taxon sampling do not affect the rooting of trees. The archaeal rooting of the tree of life has been reliably obtained in numerous studies with different proteomic sets [27, 35, 42] and is congruent with phylogenetic analysis of the structure of tRNA [22, 52], 5S rRNA  and RNase P , and of tRNA paralogs [55–58]. While its significance is not the focus and will not be discussed in this paper, a rooting in Archaea (see discussion in ) departs significantly from the 'canonical' bacterial rooting of the tree of life, which is traditionally derived from analyses of the sequence of ancient gene paralogs (e.g., ATPases, aaRSs, elongation factors). It thus questions the bacterial-like origin of cellular life inferred from sequence comparisons. In turn, the phylogenomic tree that describes the evolution of 1,420 FSF domain structures showed that the most ancient FSFs at the base of the tree (the basal_set) were shared by the three superkingdoms and were mostly universal (Figure 1B). Remarkably, the first loss of ancient FSFs occurred exclusively in archaeal lineages, an observation that also supports the ancestrality of Archaea. Again, patterns of distribution of FSFs in the trees were obtained congruently with numerous proteomic sets and releases of SCOP as genomic sequences and structures were acquired with time [27, 35, 42, 47, 59, 60].
Trees of proteomes and trees of FSFs are generated from the same genomic structural census but represent two sides of the same story (Figure 1). They describe the evolution of proteomes or the evolution of the FSF structures that make up the potein complement, respectively. Proteomes at the root of the tree of life are populated by FSFs that are shared by all three superkingdoms and proteomes at its crown are enriched in 'signature' FSFs that are unique to individual lineages. Relatively few signature FSFs exist that are specific to superkingdoms. Remarkably, proteome comparisons reveal these signatures are very unequally divided among superkingdoms  and already suggest (following parsimony thinking) that bacterial and archaeal lineages evolved from a primordial eukaryotic-like lineage by reductive loss . Phylogenomic analysis confirms this reductive evolutionary tendency, showing that the first diversified lineage to emerge by loss of FSFs gives rise to Archaea, which in turn has the least number of signature FSFs and expresses the lowest levels of diversity and reuse of FSFs in nature . The results we here report confirm once again these patterns (Figure 1B), indicating that the urancestral proteome is populated by ancestral sets of FSFs at the base of the tree of life that appeared in the tree of FSFs before the reductive evolutionary tendency in Archaea was evident .
In order to define the proteomic make up of the urancestor, we first identified a set of 352 primitive (plesiomorphic) FSFs (the 352_set) at the root branch of the tree of FL proteomes. All FSFs of the 352_set exhibited only gains in genomic abundance, most (314 FSFs) were common to the three superkingdoms, and interestingly, all were present in Archaea (see the four cells that are occupied in the Venn diagram; Figure 2A). However, tracing the 352_set FSFs in the tree of domain structures revealed the set was not conservative enough to define the urancestor. A timeline describing the age of each FSF unfolded directly from the tree shows that the FSFs that are not universally shared by superkingdoms appeared for the first time in evolution in the order: BE (nd = 0.210), AB FSFs (nd = 0.415), B (nd = 0.433), E (nd = 0.538), A (nd = 0.538) and AE (nd = 0.589) FSF groups (Figure 1B). Remarkably, no FSFs of the ancient BE-specific group were present in the 352_set, which contains besides the universal basal_set the more derived AB, AE and A groups. Similarly, the single A-specific FSF in the set (related to transcriptional regulation; d.236.1) cannot be part of the urancestor (Figure 2A). The FSF is not only absent in Bacteria and Eukarya but is also absent in nearly 50% of archaeal lineages examined (Additional file 1, Table S1). These results suggest many non-universal FSFs in the 352_set should not be considered urancestral and were the result of the 'modern effect'.
In order to decrease the number of false negatives, FSFs in the 352_set were assigned as initial characters in a parallel and iterative exercise of tree building and plesiomorphic character selection, with the goal of selecting for the most parsimonious tree of proteomes and the minimum number of FSFs. In each of 30 chains, 50 cycles of iteration dramatically reduced the space of urancestral FSFs (Figure 2A). The iterative procedure resulted in a more realistic max_set of 152 FSFs, which: (i) were common to the three superkingdoms (Figure 2A), (ii) excluded FSFs with decreased organismal distribution and genomic abundance levels (i.e with smaller values of f and mean of G; Figure 2C), and (iii) excluded FSFs that had ambiguous character-state changes in the root branch of the tree of proteomes (Additional file 1, Figure S3). Consequently, the iteration strategy works well to selectively filter false positives assigned to the root branch by the modern effect, decreases biases introduced by the archaeal rooting, and mitigates uncertainties in character-state reconstructions.
We note that real urancestral FSFs that have evolved with intensive losses in numerous proteomic lineages are still possible and their origins will be seen as more derived under the parsimony criterion. These false negatives in the urancestral set cannot be dissected from FSFs that diverged more recently. Since the initial 352_set includes a significant number of of FSFs in the proteomes examined (~25%), this initial large coverage shields against exclusion of unknown false negatives and makes the max_set a maximum bound for the urancestral proteome. On the other hand, false positives resulting from ancient HGT events [6, 40] can still occur. For example, FSFs that appeared soon after organismal diversification but transferred extensively to different lineages may be regarded as urancestral. The gap that exists between the discovery of urancestral FSFs and FSFs that emerged at the start of organismal diversification can be identified in the tree of protein domains, since this tree unfolds the evolutionary order of appearances of each of the 1,420 FSFs that are present in the modern proteomes we sampled (Figure 1B). Tracing the max_set FSFs in the tree of domain structures revealed that the set was not conservative enough to accurately define the urancestor and that many FSFs had low proteomic distribution (f) and abundance (mean of G) levels. We therefore defined a more conservative min_set by intersecting the max_set derived from phylogenetic iteration and the basal_set derived from the tree of domain structure (Figure 1B). The min_set excluded 82 FSF with relatively smaller f and mean of G values. This set can be considered a lower bound for the urancestral proteome.
We here define the urancestor as an entity that accumulated genetic information in a period that spans the emergence of life and the emergence of diversified cellular life. We also consider the urancestor as a primordial isoform of the modern ribonucleoprotein world, regardless of it being a single organism or a communal population , especially because it contains fully functional ribosomes (see below). We therefore compare the proteomic and functional sets of the two worlds, the ancient world of the urancestor and the modern world of extant organisms, and make inferences about biological complexity using information in molecules that are modern.
We find that the upper bound urancestral FSF max_set contains almost all essential biological processes, including crucial metabolism and transport activities linked to amino acids, nucleotides, carbohydrates, polysaccharides, and coenzymes, and functions associated with the Information (translation, DNA replication/repair, transcription, RNA processing), Intra-cellular processes (transport, protein modification, proteases), Regulation (e.g. kinases/phosphatases, DNA binding, RNA binding), and General (small molecule binding, protein interaction) categories (Figure 3, Additional file 1, Table S2). As expected, the set lacks the Extra-cellular processes category, which includes molecular functions linked to definition of self and inter-cellular interactions (toxins, cell adhesion, immunity, etc). Although some of the sub-categories (i.e. transcription, RNA processing, and RNA binding) were not present in the urancestral min_set, the functions of the two urancestral sets are similar and suggest a functional complex entity [8, 10]. However, the numbers of urancestral FSFs participating in individual subcategories were always smaller than those of FSFs in modern proteomes (Figure 4, Additional file 1, Table S2). Consequently, the functional repertoire of the urancestor while exhibiting almost all essential functions should be regarded as being simpler than the repertoire of modern proteomes. We suggest FSFs of this limited repertoire acted as melting pot for new molecular functions when organismal lineages emerged, with founder biological activities being primitive and relatively non-specific . The development of the ribosome illustrates such an origin .
The numbers of the FSFs in major categories of the min_set, especially in Information, were smaller than those of the max_set (Figure 4). In general, informational genes tend to form multi-component complexes stabilized by protein-protein interactions. For this reason, it has been thought that these genes are refractory to HGTs . The robustness of informational genes against transfer was previously contrasted with the rampant transfer among lineages of ancient metabolic (operational) genes [6, 8]. However, a recent study reveals HGT does not exhibit functional preferences and occurs randomly . In turn, analysis of HGT in trees that describe the evolution of function directly from ontological data are congruent with our analysis and suggests a preferential role of HGT in shaping information-related functions . Similarly, a recent comparative statistical analysis of homoplasy levels in trees of proteomes reveals information-related domains at FF level suffered limited but comparatively significant levels of lateral exchange . These FFs were discovered quite early in protein evolution before and after the start of the diversified world. Thus, phylogenomic analysis of HGT of both biological functions and FSF structures can be used to explain why the min_set excludes preferentially these ancient HGT-susceptible FSFs. To further examine the role of ancient HGT processes on urancestral proteomic make up, we tested which of 49 functional sub-categories were preferentially enriched in the max_set and min_set relative to extant FL proteomes. Remarkably, only two functional categories, translation (Information) and nucleotide m/tr (Metabolism) were enriched in the max_set while metabolism-related transferases (Metabolism), small molecule binding (General), and nucleotide m/tr (Metabolism) functions were enriched in the min_set (Table 1). We note that the three sub-categories enriched in the min_set include FSF belonging to primordial metabolic folds (see below). While we do not know how many real urancestral FSFs that evolved without major HGT effects were excluded in the max_set to min_set transition, it is apparent that the proportion of horizontally transferred FSFs in the min_set is smaller than that in the max_set. Consequently, the absence of translation and presence of nucleotide-related metabolic activities in the enriched functions of the min_set provides statistical support to operational genes being more robust against ancient HGTs than translation-related informational genes, a result that is in contrast with previous proposals [6, 8]. We hypothetize that translation was necessarily simple and flexible during its early metabolic urancestral inception . Fewer molecular interactions between components in a simpler translation system left HGT unchecked and free to shape the spread of FSFs that were recruited for the new translation functions. The numbers of translation FSFs are 6 and 34 in the min_set and max_set, respectively, which represent 6.7% and 38.2% of modern translational FSFs. While ancient HGT processes appear to have shaped the evolution of ancient translation-related genes during urancestral history, HGT may have not affected metabolism to such levels, especially because metabolism was already quite developed when translation materialized in evolution . A detailed phylogenomic analysis of protein domain structure in metabolic networks reveals that the nine most ancient folds were responsible for the explosive appearance of most modern enzymatic functions . A succession of recruitment gateways, each mediated by the discovery of a new fold showed metabolism originated in enzymes of nucleotide metabolism harboring the P-loop-containing NTP hydrolase fold (c.37), probably in pathways linked to the purine metabolic subnetwork . Crucial FSFs of these primordial metabolic folds are part of the urancestral min_set and many are part of nucleotide metabolism subnetworks [e.g., P-loop-containing NTP hydrolases (c.37.1); ribulose-phosphate binding barrel (c.1.2); NAD(P)-binding Rossmann-fold domain (c.2.1); ribonuclease H-like (c.55.3)]. The congruent enrichment of nucleotide m/tr in both urancestral sets suggests operational genes encoding nucleotide metabolism were built in the urancestor and diverged vertically in primary lineages of the superkingdoms. Interestingly, highly conserved protein-encoding sequences related to nucleotide biosynthetic pathways, including putative phosphoribosyl pyrophosphate synthase and thioredoxin enzymes, were previously identified as being important part of the urancestral set . This is also consistent with a study of physical clustering of genes in bacterial genomes, which also reveals the most ancient group of genes is related to metabolism . Due to statistical limitations of the hypergeometric distribution significant enrichments could not be resolved for the remaining 5 informational and 14 operational categories. A more comprehensive study will be needed to evaluate the extent of HGT in whole sets of ancient operational and informational genes.
Finally, a comparison of structural and functional components of the urancestor and FL proteomes revealed the complexity of the make up of the urancestor relative to extant organisms. In terms of FSF repertoires, the numbers of distinct FSFs (diversity) of the urancestral sets were significantly smaller than those of each and every one of the FL organisms we analyzed (Figure 5A), even if FSF reuse was considered (Figure 5B). Our estimates therefore indicate that the FSF repertoire of the urancestor (70-152 FSFs) and its reuse in domains (303-507 domains) was at least 5 and 3 times smaller than that of extant FL organisms, respectively. Furthermore, the inclusion of artificial urancestral proteomes resurrected in silico in tree reconstructions generated trees of proteomes that always placed the urancestor at their base (Figure 6). These results support the distant relationship that exists between the urancestor with the simplest of extant proteomes, confirming the relative simplicity of the reconstructed ancestral entity. Similarly, phylogenetic reconstructions derived from functional data confirm urancestral functions were quantitatively and phylogenetically simpler than functions in any extant FL proteome (Figure 7). Consequently, the actual repertoire of the urancestor inferred from FSFs in modern proteomes, while relatively complex in the number of molecular functions it embodies, is closer to the simple progenote model and distant from the complex cenancestor model. We note however that proteins in the relatively simple FSF repertoire of the urancestor could have been non-specific, harboring a multiplicity of functions. These would have increased the effective complexity of this primordial organism. Furthermore, the complex functional repertoire we reveal suggests the urancestor was a quite advanced version of the progenote, with a multiplicity of metabolic and biosynthetic functions.
In a previous study, we used phylogenies of FSF and FF to study the emergence of the translation apparatus . Dissection of first appearance of fundamental innovations in molecular machinery (evolutionary landmarks) associated with metabolism and translation revealed translation had metabolic origins. It appeared after the discovery of a large number of metabolic functions but before enzymes necessary for the synthesis of DNA. A clear timeline of molecular diversification was apparent, with domains associated with aminoacylation appearing first, immediately followed by molecular switches and regulatory factors important for tRNA shepherding and RNA transport. Additional file 1, Table S1 shows landmark domains that interact with RNA (some of which have metabolic roles) were present in the min_set, including class I (c.26.1, nd FSF = 0.064) and II catalytic (d.104.1, nd FSF = 0.128) and anticodon-binding (a.27.1; nd FSF = 0.141) domains of aRSs, GTP-binding (c.37.1, nd FSF = 0) and elongation factor (b.43.3, nd FSF = 0.128) domains of translation factors, and even ribonuclease P and PH domains (d.14.1, nd FSF = 0.059) crucial for endo- and exoribonucleolytic cleavage of RNA and nucleotydiltransferase activities necessary for damage repair. In contrast, none of domains present in ribonucleotide reductase enzymes responsible for producing the deoxyribonucleotide components necessary for DNA-linked functions, the ferritin-like domain (a.25.1, nd FSF = 0.242), N-terminal domain of cbl (a.48.1, nd FSF = 0.685), and the the PFL-like glycyl radical enzyme domain (c.7.1, nd FSF = 0.279), were present in the min_set. Only one of these domains, c.7.1, was present in the max_set. We note that the reduction of ribonucleotides to deoxyribonucleotides involves the production of an active site thiyl radical that requires contacts with cysteins in all protein domains of the catalytic subunit of the oligomeric enzymatic complex , suggesting modern ribonucleotide reductase functions is indeed derived. We also note that the active site domains of class III ribonucleotide reductases share the c.7.1 domain and the associated radical-based chemistry with pyruvate formate-lyase enzymes, a link proposed to have mediated the RNA-to-DNA biological transition . However, phylogenomic analysis at FF level  suggests the pyruvate formate-lyase domain (c.7.1.1; nd FF = 0.518) emerged later than its ribonucleotide reductase counterpart (c.7.1.2; nd FF = 0.235). It is therefore likely that the urancestor stored genetic information as RNA and not DNA.
The set of FSFs of ribosomal proteins that are universal establish crucial contacts with substructures of the rRNA subunits in the ribosome and appear much later than aRSs and regulatory factors . A careful phylogenetic analysis of ribosomal history directly from protein and RNA structure established the relative time of appearance of ribosomal proteins and rRNA substructures in the ribosome . The study reveals that proteins and RNA co-evolved form the start and structures supporting protein synthesis appeared in a fundamental major transition once processivity functions involving interactions with transfer and templating RNA were already functional. A set of four FSFs were recruited into ribosomal function during this initial period, including ancient ribosomal proteins with OB-fold and related SH3-like small β-barrel folds. Remarkably, all of these FSFs belong exclusively to the urancestral min_set (Table 2). A set of 6 additional FSFs that associated with the ribosome immediately after the major transition but before the appearance of the L7/L12 protein complex , are all exclusively included in the urancestral max_set (Table 2). The L7/L12 complex crucially stimulates the GTPase activity of elongation factor G, a ribosomal factor that catalyzes elongation and enhances ribosomal processivity [68, 69]. Primordial protein synthesis was therefore active in the urancestor and the processivity and efficiency of the ribosome was actively improved during urancestral evolution.
In order to place the history of the urancestor and of the ribosome in a timeline, we used a molecular clock of protein domain structure to define evolutionary timescales . Using a clock derived from the tree of FSFs of Figure 1B but using the calibration points of Wang et al. , the appearance of the youngest FSFs in the urancestral min_set and max_set suggests organismal diversification was established sometime between 2.9 and 2 Ga ago (Figure 8). Remarkably, the earliest date coincides with the discovery of arobic metabolism and the start of planet exigenation  that lead to the Great Oxidation Event , a geological time where oxygen reached 1% of present atmospheric levels. We note that integration of molecular, physiological, paleontological, and geochemical data suggests that a diversified clade of cyanobacteria with marked heterocyst and cell differentiation appeared no later than 2.1 Ga ago  and a number of FSFs linked to events of organismal diversification and the fossil record  (Figure 8) are only compatible with the existence of an urancestor before that time. This indicates the urancestral max_set is not conservative enough to accurately define the urancestor and that the min_set is clearly more appropriate. Remarkably, the first and second transitions in ribosomal evolution occurred before the youngest age of min_set FSFs, 3.04 and 2.41 Ga ago, respectively, suggesting efficient ribosomal protein synthesis was pre-requisite for organismal diversification and emerged prior to the aerobic metabolism and the start of planet oxygenation (Figure 8).
To date there are no known exceptions to a G1P backbone chemistry in Archaea and a G3P backbone chemistry in the other two superkingdoms. However, some extremophilic bacteria also contain membrane ether lipids (the typical archaeal trait), including di-glycerol ether lipids, tetraether non-isoprenoid lipids, and mixed ester-ether lipids, but they are all of the sn1,2 kind (reviewed in ).
Since DGGGPS is not stereospecific, the chirality of the isoprenoid ether lipids in Archaea appears entirely determined by GGGPS . Remarkably, the 6-phosphogluconate dehydrogenase C-terminal domain-like (a.100.1; nd FSF = 0.110) FSF of the bacterial and eukaryal G3PDH (a.100.1.6; nd FF = 0.233) and the FMN-linked oxydoreductase domain (c.1.4; nd FSF = 0.114) FSF of the GGGPS (c.1.4.1; nd FF = 0.041) necessary for downstream membrane lipid biosynthesis in Archaea were present in the urancestral min_set and appeared in the timeline of domain architectures quite early in evolution. A min_set G3DPH suggests primordial sn1,2 fatty acid ester lipids of some kind were already present in the urancestor. In turn, a min_set GGGPS enzyme harboring the only TIM β/α-barrel fold-containing enzyme with a prenyltransferase function  suggests chirality-enabling enzymatic activities downstream of G1DPH were already present in the urancestor.
We note that CDP-archaeol synthase activity, which is downstream of G1PDG, does not require specificity for ester or ether bonds nor of glycerol phosphate enantiomers  and that full conversion of bacterial-like lipids to archaeal-like lipids requires the crucial discovery or recruitment of G1PDH metalloenzyme activities necessary for the production fo the G1P backbone . The urancestral GGGPS catalyses the first CDP-archaeol biosynthesis pathway-specific step of isoprenoid ether lipids in Archaea but displays a strong preference for the G1P substrate . G1PDH does not yet have a crystallographic structural entry. However, molecular modeling suggests the enzyme was derived from glycerol dehydrogenase (EC 18.104.22.168; e.22.1.2; nd FSF = 0.288; nd FF = 0.191) , which is present in the three superkingdoms and has the dehydroquinate synthase-like domain (e.22.1) FSF. A total of 110 UniProt entries with EC 22.214.171.1241 functions were analyzed with HMMs of structural recognition. All sequence entries were indeed assigned the e.22.1 FSF. Two entries had also the HAD-like domain (c.108.1). While this analysis confirms the original structural assingment , we find the G1PDH FSF was not part of any of the urancestral FSF sets. This strongly suggests isoprenoid ether lipids derived from G1P were not present in the urancestor. Instead, and as suggested by Glansdorff et al. , sn1,2 isoprenoid ether lipids were probably synthetized by the activity of the urancestral GGGPS, which at that time could have exhibited a preference for G3P. Recruitment of a G1PDH in emerging archaeal lineages displaced the possible use of G3P by GGGPS and enabled the synthesis of sn2,3 isoprenoid ether lipids using the G1P backbone. It is significant that the molecular clock of FSFs established that the GGGPS c.1.4 FSF appeared 3.27 Ga ago and that the G1PDH e.22.1 FSF appeared 2.45 Ga ago at a time that coincides with the Great Oxidation Event (arrows, Figure 8). In this regard, cyclic and acyclic phytanes and biphytanes, which are present in sediments and petroleum in 2.7-Ga-old metasedimentary rocks in several parts of the world [80, 81] and are biomarkers of methanotrophic pelagic microbes, are thought to derive from archaeol and caldarchaeol molecules of Archaea . The existense of GGGPS prior to this date supports the existence of a primordial CDP-archaeol biosynthetic pathway and of ether and ester membrane lipids in the urancestor, prior to the loss of the first FSF in a superkingdom (Archaea) that marks the start of the tripartite world and the earliest date of organismal diversification (~2.9 Ga ago; Figure 8). The synthesis of an enantiomeric alternative of G3P in primordial archaeal lineages, 800 million years later and during late planet oxygenation, provided the proper molecular chirality and backbone necessary for cell membrane diversification. Our findings are important and support the proposal that the urancestor had sn1,2 ester and ether fatty acid lipids and that discovery and recruitment of new enzymatic activities resulted in the synthesis of sn3,4 isoprenoid ether lipids and the emergence of thermophilic archaeal lineages . Under this scenario, the ester and ether lipids in the urancestor provided already adaptations to adverse conditions (high temperature, pressure, etc), a trend that was later exploited by the emerging Archaeal lineages in a primordial quest towards extremophily.
The history of proteomes and the protein world was reconstructed from a census of protein domain structure at FSF level of structural complexity in hundreds of FL proteomes. Applying an iterative approach of character state reconstruction, we identified the most parsimonious repertoire of domain structures that was present in the urancestor. Upper and lower bounds of the repertoire defined conservative limits for the diversity of domains and associated functions and showed the uracestor had a functionally complex but relatively simple repertoire, with numbers of FSFs probably orders of magnitude below that of extant organisms. Since the lower bound is more compatible with a molecular clock of FSFs and the fossil record, we can make reasonable inferences about the molecular make up of this primordial organism. The urancestor had an advanced metabolic network, especially rich in nucleotide metabolism enzymes, had primordial pathways for the biosynthesis of membrane glycerol ether and ester lipids, crucial elements of translation, including aRSs, regulatory factors, and a primordial ribosome with protein synthesis capabilities. It lacked however transcription and in advanced evolutionary stages stored genetic information in RNA (not DNA) molecules. As this ancient organism expanded its protein biosynthetic functions ~3 Ga ago, it added crucial ribosomal proteins that enhanced the reliability and processivity of the ribosome and crucial enzymes that diversified its membrane lipid make up. We here propose the enhancement of the primordial ribosomal machinery and the make up of the cellular containment, which coincided with the rise of planetary oxygen, enabled the rise of lineages and a truly diversified world of organisms.
We downloaded SUPERFAMILY ver. 1.73  MYSQL database that assigns protein architectures to sequences in the genomes of over 1,000 organisms, 645 of which have been completely sequenced. Manual inspection of organismal lifestyles showed there were 48 archaeal, 239 bacterial, and 133 eukaryotic FL species (420 in total), 71 bacterial and 22 eukaryotic species that were parasitic (93 in total), and 1 archaeal, 111 bacterial, and 20 eukaryotic species that were obligate parasitic (132 in total) in the 645 set. According to our definitions, only one parasitic/obligate parasitic organism was present in Archaea, Nanoarchaeum equitans. Given this information, we excluded genomes from all but the 420 FL organisms. SUPERFAMILY has built HMM libraries for SCOP-defined FSFs. Proteomes deposited in the database were searched against the HMM libraries using the iterative Sequence Alignment and Modeling System (SAM) method , which has generated FSF assignments covering on average ~ 60% of amino acid residues of individual proteomes . From the local MYSQL, we obtained all FSFs assignments for each of the 645 proteomes. The E-value of 10-4 has been known as an ideal cutoff to minimize rate of false positives in the HMM searches . FSF assignments that fulfill the E-value cutoff were extracted from the individual proteomes. Domain structures were identified with SCOP concise classification strings (ccs) (e.g., c.37.1.12, where c represents the protein class, 37 the F, 1 the FSF, and 12 the FF).
In this study, phylogenomic trees describing the evolution of proteomes are reconstructed using a census of the abundance of FSF domains in proteomes  instead of simply their occurrence . We have shown that a comparison of the two methods produces phyletic patterns that are congruent [35, 42]. Since there has been a hypothetical agreement that genetic components of the urancestor were redundant and versatile [6, 85], we chose to build trees of proteomes derived from FSF abundance to identify both the FSF set and the levels of reuse of these FSF domains. This is not possible with an FSF occurrence-based approach.
In this equation, g ab describes the g value of FSF a in proteome b and g max denotes the maximum g value in the matrix. The round function normalizes g for particular FSF in a proteome relative to its maximum value, and standardizes values to a 0-23 scale. The 24 transformed values represent character states and were linearly ordered and encoded using an alphanumeric format of numbers 0-9 and letters A-N that are compatible with PAUP* ver. 4.0b10 . The matrix was used to build trees of proteomes by defining taxa and characters as proteomes and FSFs, repectively. The matrix was also transposed to build trees of FSF architectures, with taxa and characters representing FSFs and proteomes, repectively. Trees of proteomes were built by polarizing character states from '0' to 'N' using the ANCSTATES command of the PAUP*, with '0' being primordial. Trees of FSF architectures were built by polarizing states from 'N' to '0', with 'N' being the most ancient. These trees were rooted without invoking outgroup taxa using the Lundberg method ; the most ancient proteome or architecture was positioned at the base of their corresponding trees. Our evolutionary model considers that the abundance and diversity of individual FSF architectures increases progressively in nature, even when expanding FSFs suffer loss in individual lineages, and even when FSFs are selectively or differentially constrained during evolution. Consequently, we consider ancient architectures are more abundant and widely present in the protein world than younger ones, supporting the polarization from 'N' to '0'. We also assume that proteomes have built their FSF repertoires progressively in evolution, increasing both the diversity and abundance of the FSF make up. Genomes that are ancient developed their repertoires earlier from a pool of FSF architectures that was comparatively simpler. Consequently, their repertoires are today simpler than those that developed their repertoires more recently from a more complex and diverse pool. This supports the '0' to 'N' polarization. We note that convergent evolution (e.g. due to HGT) and rampant expansion of genes by gene or genome duplications in specific lineages can bias architectural complexity, especially in ancient organisms. While the extent of these effects remains controversial [6, 8], convergent evolution of domains appears to be rare [37, 38] and the effect of HGT to be limited at this high levels of hierarchical complexity [39, 45, 64]. Moreoever, trees of proteomes match for the most part established organismal classification (e.g., [35, 42]). We therefore postulate that these evolutionary mechanisms do not overwhelm the vertical inheritance of protein architectures along individual organismal lineages. Details on character argumentation and absence of circularity in assumptions were comprehensively discussed in previous publications and are not developed further here [35, 36, 42, 47].
Phylogenomic trees were reconstructed from the transformed and polarized matrices using the maximum parsimony method in PAUP* with 1,000 replicates of random taxon addition, tree bisection reconnection (TBR) branch swapping, and maxtrees unrestricted. The trees generated were rooted by the Lundberg method, which does not require the need of an outgroup taxon. Phylogenetic confidence was evaluated by the nonparametric bootstrap method with 1,000 replicates (resampling size matches the number of the genomes sampled; TBR; maxtrees, unrestricted). The consensus of most parsimonious proteome trees was obtained using the Python library SumTrees with the option of 50% majority-rule . The degree of phylogenetic signal for taxa was measured using the skewness (g 1 ) test with a tree length distribution obtained from 1,000 random trees .
Since trees are rooted and are highly unbalanced, we unfolded the relative age of protein domains directly for each phylogeny as a distance in nodes (node distance, nd) from the hypothetical ancestral architecture at the base of the trees in a relative 0-1 scale. Given a rooted tree, we calculated nd by counting the number of internal nodes along a lineage from the root to a terminal node (a leaf) of the tree on a relative 0-1 scale with the following equation: nd a = (# of internal nodes between nodes r and a)/(# of internal nodes between nodes r and m), where a means a target leaf node, r is a hypothetical root node, and m is a leaf node that has the largest possible number of internal nodes from the node r. Consequently, the nd value of the most ancestral taxon is 0 while that of the most recent one is 1. nd can be a good measure of age given a rooted tree since the emergence of protein domains (i.e. taxa) is displayed by their ability to diverge (cladogenesis or molecular speciation) rather than by the amount of character state change that exists in branches of the tree (branch lengths).
Phylogenetic characters (FSFs) defining the root of the tree of proteomes can be used to identify the urancestral repertoire of protein domains. However, FSFs appearing after the advent of the diversified world (the 'modern effect') were by definition absent in the urancestor and must be excluded to avoid bias in the identification of urancestral FSFs. We developed an iteration analysis to refine the urancestral set of FSFs. First, we extracted FSFs positioned at the root branch of the tree of proteomes, which was reconstructed prior to each step of iteration. As an initial character set, the FSFs were imported into each of 30 chains, each of which proceeded with the following steps: (i) reconstruct a proteome tree; (ii) extract FSFs that are in the root branch; (iii) update the character set with the FSFs; (iv) iterate steps 1 to 3 a total of 50 times.
We then gathered all of the most parsimonious trees that were generated from the 1,500 iterations, and selected one tree with minimal tree length and the smallest set of FSF domains. Prior to iteration analysis, simulation with hundreds of iterations per chain showed that decreases of tree length and the number of the plesiomorphic FSFs become stationary after ~40 iterations. FSF sets in the root branches were progressively stabilized as iterations progressed in individual chains and ambiguous plesiomorphic FSF assignments were reduced. In addition, multiple chains made it possible to explore as many parsimony islands that were possible to avoid the risk of the search of optimal trees being trapped by sub-optimal portions of a given tree space .
Nearly 80% of FSFs defined in SCOP have more than one family . Because a single family can be regarded as a functionally orthologous unit , FSFs with more than one family can be considered multi-functional. For example, the P-loop-containing NTP hydrolase (c.37.1) is the most abundant FSF in nature and has more than 20 distinct families, most of which are involved in tens of metabolic pathways . In spite of the difficulty of assigning functions to FSFs, Vogel and Chothia  suggested a coarse-grained functional classification for known FSFs based on information in various resources, including COG and GO databases, and literature surveys. When an FSF had multiple families, the most dominant function was assigned to the FSF. While this coarse-grained classification cannot provide a complete functional profile for individual FSFs, the classification is highly suitable to infer the originating functions of FSFs because the most ancient function of a FSF tends to be the most abundant and widely present in contemporary proteomes. For urancestral FSFs uncovered by the iteration analysis, we annotated their function using the coarse-grained functional classification. Although it is likely that functional recruitments or take-overs had arisen during evolution of the urancestral lineage, we do not consider the role of these processes. Their dissection is per se a difficult problem.
Observed values M and k indicate the number of FSFs that are assigned to a sub-category in the background and sample, respectively. The values N and n are the total number of FSFs of the two sets regardless of the functional classification. The probability P(X = k) implies the chance that a random variable X has k FSFs for a given functional category. Referring to the equation and previous literature , we calculated P values for the individual sub-categories that have k/n larger than M/N, and evaluated statistical significance with 95% confidence level (P < 0.05).
Enzymes that are related to metabolic pathways of glycerol cell membrane lipids were identified using the Kyoto Encyclopedia of Genes and Genomes (KEGG; http://www.genome.jp/kegg/). We retrieved UniProt Knowledgebase Identifiers (UniProtKB IDs) from KEGG, the Enzyme Commission (EC) numbers of the enzymes. Protein sequences that correspond to the UniProt IDs were obtained from the UniProt (http://www.uniprot.org/). To assign SCOP protein domains at the FSF level of classification, we searched the protein sequences of the individual enzymes against HMMs of total FSFs that are provided by the locally downloaded SUPERFAMILY database. The FSFs were assigned to given protein sequences using the SAM method with an E-value cutoff of 10-4.
horizontal gene transfer
last universal common ancestor
Structural Classification of Proteins.
We thank Dr. Liudmila S. Yafremava for helping with assignments of organismal lifestyles and members of the GCA research team for valuable comments. Research was supported by the National Science Foundation (MCB-0749836), CREES-USDA, and the International Atomic Energy Agency in Vienna. Any opinions, findings, and conclusions and recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.