Chaperonin genes on the rise: new divergent classes and intense duplication in human and other vertebrate genomes

Background Chaperonin proteins are well known for the critical role they play in protein folding and in disease. However, the recent identification of three diverged chaperonin paralogs associated with the human Bardet-Biedl and McKusick-Kaufman Syndromes (BBS and MKKS, respectively) indicates that the eukaryotic chaperonin-gene family is larger and more differentiated than previously thought. The availability of complete genome sequences makes possible a definitive characterization of the complete set of chaperonin sequences in human and other species. Results We identified fifty-four chaperonin-like sequences in the human genome and similar numbers in the genomes of the model organisms mouse and rat. In mammal genomes we identified, besides the well-known CCT chaperonin genes and the three genes associated with the MKKS and BBS pathological conditions, a newly-defined class of chaperonin genes named CCT8L, represented in human by the two sequences CCT8L1 and CCT8L2. Comparative analyses from several vertebrate genomes established the monophyletic origin of chaperonin-like MKKS and BBS genes from the CCT8 lineage. The CCT8L gene originated from a later duplication also in the CCT8 lineage at the onset of mammal evolution and duplicated in primate genomes. The functionality of CCT8L genes in different species was confirmed by evolutionary analyses and in human by expression data. Detailed sequence analysis and structural predictions of MKKS, BBS and CCT8L proteins strongly suggested that they conserve a typical chaperonin-like core structure but that they are unlikely to form a CCT-like oligomeric complex. The characterization of many newly-discovered chaperonin pseudogenes uncovered the intense duplication activity of eukaryotic chaperonin genes. Conclusions In vertebrates, chaperonin genes, driven by intense duplication processes, have diversified into multiple classes and functionalities that extend beyond their well-known protein-folding role as part of the typical oligomeric chaperonin complex, emphasizing previous observations on the involvement of individual CCT monomers in microtubule elongation. The functional characterization of newly identified chaperonin genes will be a challenge for future experimental analyses.


Background
Hsp60-like chaperonin proteins are well known for their role in assisting protein folding and in protecting cells from the deleterious effects of stress [1][2][3][4][5]. The eukaryotic cell expresses representatives of two distinct groups of chaperonin genes that are otherwise typical of bacteria (Group I) or archaea (Group II). In eukaryotes, Group I chaperonins are mostly expressed in mitochondria and chloroplasts, and Group II chaperonins are found in the eukaryotic cytosol [1,[6][7][8][9][10]. Chaperonin proteins form typical multi-subunit double-ringed structures collectively called "chaperonins" [9][10][11][12][13]. The Group I chaperonins are typically formed by the products of a single gene (groEL in bacteria; hsp60/cpn60 in mitochondria) assembled into a 14-subunit double-ringed structure in bacteria and into a double or single-ringed structure in mitochondria [14]. Eukaryotic Group II chaperonin proteins assemble in a similar double-ringed oligomeric structure, called TRiC or CCT complex [15], composed of 16 subunits that in human are encoded by nine distinct genes (tcp1/cct1, cct2-5, cct6A-B, cct7-8) [8][9][10]. The CCT complex is mostly known for its role in folding the cytoskeleton proteins actin and tubulin [7,16] and mutations in individual CCT subunits lead to defects in the functioning of the cytoskeleton and mitosis arrest [17].
As for other chaperones, the malfunctioning of chaperonin proteins has been associated with various human pathological conditions, the chaperonopathies [18][19][20]. In this respect, besides the canonical cct and cpn60 genes described above, three divergent hsp60-like genes have been more recently identified [21][22][23] in association with pathological conditions. One gene, MKKS [21], was named for its association with the developmental disease McKusick-Kaufman Syndrome and was soon after also identified as BBS6 [24] for its association with the Bardet-Biedl Syndrome (BBS), another developmental condition involving ciliumrelated dysfunction [25]. More recently two other hsp60-like BBS genes, named BBS10 [22] and BBS12 [23], have been identified among fourteen genes (BBS1 to BBS14) so far associated with BBS. The protein products of MKKS/BBS6, BBS10 and BBS12 localize to the basal body of cilia and to the centrosome [26][27][28]. We will hereafter refer to the MKKS/BBS6 gene as MKKS, and collectively to the three hsp60-like BBS genes as the "BBS genes". The identification of these genes provides new perspectives on the spectrum of functionalities of Hsp60-like proteins in eukaryotes and on their role in development.
The recognition of chaperonopathies has increased the importance of elucidating the entire set of chaperone genes present in the human genome [19]. The work reported here was conceived to: a) identify all Hsp60like sequences encoded in the human and other genomes including all diverged chaperonin genes; b) reconstruct the evolutionary origins and relations of diverged chaperonin genes; c) distinguish with bioinformatics methods functional genes from pseudogenes; d) characterize structural properties of the corresponding proteins. We mostly devoted our attention to the characterization of the evolutionary history and structural properties of newly or recently identified sequences, referring the reader to the vast amount of published literature for information on functional/structural properties and the evolutionary history of mitochondrial Cpn60 or CCT-complex proteins.
Exhaustive searches of hsp60-like sequences were carried out in human and other genomes following and extending our "chaperonomics" methodological protocol [29]. The extensive analysis of the genomes of human and other vertebrate species lead to the identification and characterization of many previously unknown sequences and to the discovery of a new, mammal-specific class of chaperonin proteins. Classification, evolutionary analysis and structural characterization of diverged chaperonin-like sequences should provide valuable information for future studies on the functional roles of these proteins.

Chaperonin sequences in the human genome
To identify all human hsp60-like sequences we queried the human genome using the nine human CCT subunit and mitochondrial Cpn60 sequences. Analogous extensive searches were performed in the mouse and rat genomes using corresponding queries. In the human genome, we found a total of 54 sequences with significant similarity to Hsp60 proteins (Tables 1 and 2). Fifteen sequences had a NCBI Entrez [30] gene descriptor assigned. Nine of these corresponded to the canonical CCT-subunit sequences and one, HSPD1, encoded the mitochondrial Cpn60 protein. Three sequences corresponded to the BBS genes MKKS, BBS10 and BBS12. We recovered two additional uncharacterized sequences designated in the NCBI Entrez Gene database as CCT8L1 and CCT8L2. Besides these complete Hsp60like sequences, a sequence domain conserved across eukaryote species with highest similarity to the apical domain of the CCT3 protein has also been reported in PIKFYVE [31], a kinase belonging to the Fab1p protein family involved in corneal pathological conditions [32]. In addition, we identified 39 other human hsp60 sequences that did not correspond to a gene descriptor in the NCBI Entrez Gene database ( Table 2). All of these sequences contained in-frame stop codons or frame-shifts, suggesting that they were most likely pseudogenes. Thirty-five of these had not been described in the Pseudogene.org pseudogene database [33] and 33 were not listed in the Ensembl database [34], and are here annotated and classified for the first time. In analogous searches of the complete genomes of mouse and rat, we identified in each genome 14 chaperonin genes (nine for the canonical CCT monomers, one for the mitochondrial Cpn60, three BBS genes and one CCT8L gene), 38 pseudogenes in mouse and 61 pseudogenes in rat (see additional file 1: Table S1, for mouse sequences, and additional file 2: Table S2, for rat sequences).

Evolutionary origins of human BBS and CCT8L genes
A maximum-likelihood (ML) phylogenetic tree of human chaperonin-like proteins (Figure 1a) indicated that Hsp60-like BBS proteins are monophyletic (bootstrap support 86%) and that their common ancestor derived from a duplication event in the CCT8 lineage (bootstrap support 88%). The tree also showed that the unique ancestor of the two closely related genes CCT8L1 and CCT8L2 also originated in the CCT8 lineage from a more recent duplication event (bootstrap support 75%). The relation of BBS and CCT8L proteins with the CCT8 chaperonin subunit was confirmed with strong conditional probability support (0.99) by Bayesian tree construction (Figure 1b).
Although the association of BBS and CCT8L proteins with the CCT lineage was robustly supported, the high divergence of these sequences could produce clustering in the trees due to long-branch attraction. To address this concern, we built independent ML trees for each BBS or CCT8L sequence adding them separately to the tree of CCT subunits. All individual trees confirmed with strong bootstrap support the association of each BBS or CCT8L lineage with the CCT8 lineage (see additional file 3: Figure S1, additional file 4: Figure S2, additional file 5: Figure S3 and additional file 6: Figure S4). A ML evolutionary tree including hsp60-gene homologs found in the genomes of eighteen other vertebrate species, including representatives of several mammals, chicken, frogs, and fish, also confirmed the origin of BBS and CCT8L genes from the CCT8 lineage (see additional file 7: Figure S5).
We did not find CCT8L genes in the genomes of chicken, Xenopus laevis, or Danio rerio, representatives respectively of the reptile/bird, amphibian and fish lineages. However, among mammals we identified orthologs of CCT8L genes in genomes not only of placental mammals (Eutheria), but also of the marsupial opossum (Metatheria) and of the egg-laying platypus (Prototheria), suggesting that the CCT8L gene class originated at the onset of mammal evolution. All CCT8L gene orthologs were intron-less, indicating that their ancestor originated from a retro-transposition event. Two copies of CCT8L sequences were found in human and chimp and one CCT8L gene in all other genomes examined, including those from the other primate rhesus monkey (Macaca mulatta) and gray mouse lemur (Microcebus murinus) (Figure 2), suggesting that a duplication of the CCT8L gene occurred in Hominoidea after their separation from old world monkeys. However, the lone gene copy of CCT8L identified in rhesus monkey clustered with CCT8L1 in evolutionary trees (Figure 2), suggesting an earlier duplication of the gene and successive loss of the CCT8L2 copy from the genome of rhesus monkey. Close inspection of protein alignments revealed that the rhesus monkey CCT8L sequence included an anomalously diverged segment of about 50 amino acids of uncertain alignment. Excluding this segment from the analysis we obtained a different and more robustly supported tree topology (75% vs. 20% bootstrap value, see additional file 8: Figure S6, panels a and b), consistent with a later duplication of the CCT8L gene in Hominoidea. The tree also indicated that the removed segment was alone responsible for the overall higher evolutionary rate predicted for this sequence (see additional file 8: Figure S6).  Pseudogene names follow the HUGO nomenclature. They are composed of the name of the parental gene followed by a unique number identifier and the suffix "P" (Pseudogene); 2 Start and 3 End positions of the pseudogene on the chromosome; 4 Strand; 5 Chromosome; 6 Location on the chromosome; 7 Number of exons. A question mark indicates gene fragments with uncertain numbers of exons; 8 Processed (P), duplicated (D) or undetermined (?); 9 Ratio of nonsynonymous vs. synonymous substitution rates; 10 Likelihood Ratio Test (LRT) values. Values different from 1.0 with probability p < 0.01 (**) or p < 0.05 (*) are shown in bold-face; 11 Number of Frame-Shifts recognized in the coding region of the pseudogene; 12 Number of in-frame Stop Codons recognized in the coding region of the pseudogene; 13 Length in amino acids of pseudo-translation of the recognized pseudogene sequence; 14 Ten pseudogenes previously reported in the Ensembl (roman), Pseudogene.org (italics) or NCBI (bold) databases: CCT1-3P = OTTHUMG00000033751; CCT5-1P = Human.chr13.mb78; CCT6-5P = ENSP00000275603, Human.chr7.mb64; CCT7-1P = ENST00000399032; CCT8-1P = Human.chr1.mb145; HSPD1-1P = ENSG00000162241, Human.chr5.mb135; HSPD1-2P = ENSP00000328369; HSPD1-5P = LOC644745; HSPD1-6P = LOC645548; HSPD1-14P = OTTHUMG00000016753; 15,16,18 Tandemly duplicated; 17 Previously identified as Hsp60s2 (Hsp60 short form 2).

Differentiation rate of BBS and CCT8L proteins
The branch lengths of the trees shown in Figure 1 indicate that BBS and CCT8L proteins have differentiated at much higher rates than CCT subunits. We applied a newly-developed, unbiased measure of differentiation called "B-index" (see Methods) to calculate differentiation of MKKS, BBS10 and BBS12 proteins from their respective last ancestor common to Actinopterygii (rayfinned fishes) and Sarcopterygii (including tetrapods), determined by rooting the trees with CCT8 proteins from corresponding fish and tetrapod species. Similarly, we calculated differentiation of CCT8L proteins from a eutherial ancestor rooting their tree with corresponding sets of CCT8 proteins (see footnotes of Table 3 and legend for Figure 2 for species represented in each tree). We estimated for the MKKS family an average evolutionary distance from their root of almost 0.7 substitutions per site, corresponding to a 6-fold higher rate of differentiation compared to the number of substitutions estimated in CCT8 proteins over the same period of time. For BBS10 and BBS12, we calculated a distance of about 1.0-1.2 substitutions per site, corresponding to a substitution rate about 8-10 times higher than in CCT8. Finally, for the mammal-specific family of CCT8L proteins, we estimated an evolutionary distance from their mammal root of about 0.3 substitutions per site. The smaller divergence of CCT8L proteins compared to BBS proteins reflects the more recent origin of the CCT8L gene. However, when scaled to the evolution of CCT8 sequences over the same periods of time, the substitution rate of CCT8L proteins was about 14-15 times higher than in CCT8 and 1.4-2.3 times higher than in BBS proteins.

Functional constraints in the evolution of CCT8L genes
We tested functionality of CCT8L genes from several species estimating ratios of non-synonymous and synonymous substitution rates (Ka/Ks) along their respective lineages (see Methods). The results of this analysis are shown in Table 4, which indicates the gene(s) analyzed (foreground), the two genes used to identify foreground and background branches, the estimated Ka/Ks values and their significance. The evolutionary lineages for which Ka/Ks values were evaluated correspond to the branch numbers identified in the overall tree topology shown in Figure 3. In this tree are represented the "molecular tree" of mammal phylogenetic relations [35], the gene duplication event involving the CCT8L gene family in primates as inferred by this analysis, and the premammal separations of the CCT7, CCT8 and CCT8L families of paralogs. This topology is in agreement with the evolutionary tree of CCT8L genes ( Figure 2) with the only exception of the weakly supported position of the CCT8L sequence from rhesus monkey (see above). The highly significant constraints in non-synonymous substitution rates (Ka/Ks < 1.0) estimated in the overall evolution of the CCT8L family (Table 4, foreground genes: "All CCT8L1/2") indicated that the CCT8L sequences are genes generally expressing functional proteins. In evaluating Ka/Ks ratios for individual CCT8L gene lineages (Table 4), significantly constrained evolution (Ka/Ks < 1.0) was detected for branches leading to most sequences, including those of murids, lemur, cow, dog, elephant, marsupial, and to the human CCT8L1 and CCT8L2 group along the hominoid lineage. Constrained evolution was also estimated for the CCT8L genes of armadillo and rhesus monkey, and for human CCT8L1 and human and chimp CCT8L2 after divergence of human and chimp, although in these cases Ka/Ks values did not reach significance. In the cases of the human and chimp CCT8L1 and Figure 2 Evolutionary tree of CCT8L sequences. ML tree of CCT8L sequences from various mammal genomes. The homolog of human CCT8L1 in chimp (Ptr) is characterized as pseudogene and is shown in bold-italics font. Species abbreviations: Bt, Bos taurus (cow); Cf, Canis lupus familiaris (dog); Dn, Dasypus novemcinctus (nine-banded armadillo); Dr, Danio rerio (zebrafish); Ec, Equus caballus (horse); Ga, Gasterosteus aculeatus (stickleback, fish); Gg, Gallus gallus domesticus (chicken); Hs, Homo sapiens (human); La, Loxodonta africana (african bush elephant); Md, Monodelphis domestica (south american gray short-tailed opossum, marsupial); Mm, Mus musculus (mouse); Mmu, Macaca mulatta (rhesus monkey); Mmur, Microcebus murinus (gray mouse lemur); Oa, Ornithorhynchus anatinus (platypus); Ol, Oryzias latipes (the medaka or japanese killifish); Pp, Pongo pygmaeus (northwest bornean orangutan); Ptr, Pan troglodytes (chimpanzee); Rn, Rattus norvegicus (rat); Tn, Tetraodon nigroviridis (spotted green pufferfish); Tr, Takifugu rubripes (japanese pufferfish); Xl, Xenopus laevis (african clawed frog, amphibian); Xt, Xenopus tropicalis (western clawed frog, amphibian). The scale bar represents the indicated number of substitutions per position for a unit branch length.  5 4.0300 5.0987 6.0623 1.0822 Average D ij (D B ) 6 2.0951 2.9660 3.2858 0.8017

CCT8 7
Size (W C ) 5. Average D ij is the average pair-wise evolutionary distance of the sequences; 7 Estimates for CCT8 were computed over corresponding species represented by the sets of MKKS, BBS10, BBS12 or CCT8L proteins (see footnote 2, above).    (Table 4).
CCT8L2 genes, the lack of significance can be related to the loss of power of the test since few mutations accumulated after separation of these sequences (see additional file 9: Table S3). In the case of rhesus monkey CCT8L, we found that its relatively high estimate of Ka/Ks (= 0.73) was due to the previously mentioned 50-amino-acid diverged region within this sequence. After removing this region we estimated Ka/Ks = 0.55. Only for the lineage of chimp CCT8L1 we estimated Ka/Ks ≅ 1, consistent with differentiation of a non-functional sequence. Since this sequence was also characterized by an internal stop codon and a frame-shift, all evidence strongly suggests that chimp CCT8L1 is a pseudogene. To assess the functionality of human CCT8L sequences we investigated their expression profiles in comparison to those of human CCT monomers and BBS genes (see additional file 10: Table S4). Expression of CCT8L2 was confirmed by fifteen ESTs mostly identified from the testis, whereas only one EST identified as a CCT8L1 transcript has been so far reported (NCBI UniGene database, November 20, 2009). Querying the NCBI GEO microarray database, we found 542 expression-profile records identifying expression of CCT8L2, and none identifying expression of CCT8L1 (as of November 20, 2009). It must be noted, however, that CCT8L2 and CCT8L1 have similarity of 97.3% at the DNA level. Similarly to CCT8L2, another mammal-specific chaperonin gene, CCT6B, is also expressed almost exclusively in the testis, from which 160 ESTs have been reported versus an average of 4.4 ESTs (from 0 to 10 per tissue) found in all other tissues.

Pseudogenes
We identified in the human genome 39 sequences with significant similarity to CCT or HSPD1 genes that either were short fragments or were characterized by in-frame stop codons or frame-shifts. Based on their corruption, we classified these sequences as pseudogenes (Table 2). Similarly, searching the mouse and rat genomes we identified 38 and 61 pseudogenes, respectively (see additional file 1: Table S1 and additional file 2: Table S2). Most of these sequences have not been previously reported and are here systematically annotated and classified for the first time.
Based on phylogenetic-tree reconstructions (see additional file 11: Figure S7) or on similarity for the most corrupted sequences, we identified the association of 17 pseudogenes from human, 16 from mouse and 29 from rat with one of the nine CCT genes. None of the pseudogenes were related to MKKS, BBS10, BBS12 or CCT8L. To estimate the time of origin of the pseudogenes, we constructed trees using their translated sequences and chaperonin subunits from various vertebrate species (see additional file 12: Figures S8, and additional file 13: Figure S9). The trees indicated that all recognizable human CCT pseudogenes originated in the mammal lineage after separation from the reptile/bird lineage.
Of particular interest were the evolutionary relations of CCT6 genes and pseudogenes. Two CCT6 gene copies (CCT6A and CCT6B) were found, besides placental mammals, also in platypus and in opossum (see additional file 11: Figure S7), suggesting that the duplication of the CCT6 gene occurred in mammal evolution before separation of Theria (marsupial and placental mammals) and Prototheria (monotremes). We constructed an evolutionary tree of mammal CCT6 genes and pseudogenes (Figure 4) rooted by the corresponding gene sequences from chicken and frog (the diverged sequence Oa_con2651 from platypus was excluded from this tree to avoid long-branch attraction). Surprisingly, all recognizable human, mouse, and rat pseudogenes belonging to the CCT6 class branched in the tree from the CCT6A lineage after separation of the platypus, marsupial and placental mammal lineages.
Twenty-two pseudogenes in human (Table 2), and 22 and 32 pseudogenes in mouse and rat, respectively (see additional file 1: Table S1 and additional file 2: Table  S2), associated with the mitochondrial HSPD1 gene (Group I cpn60). Evolutionary trees incorporating all pseudogenes from different vertebrate species were uninformative due to the presence among the pseudogenes of highly corrupted sequences, resulting in extensive long-branch attraction (not shown). An ML tree built using only translations of the most conserved pseudogenes ( Figure 5) showed weakly supported but consistent association of the human pseudogenes with HSPD1 from primates, whereas pseudogenes from mouse and rat all associated with murid Hspd1 sequences, also indicating their relatively recent origin.

Ka/Ks ratio in the evolution of putative pseudogene sequences
Our characterization of many hsp60 sequences as pseudogenes was based on the presence of signs of corruption in the sequence (in-frame stop codons and frameshifts). However, in-frame stop codons and frame-shifts may correspond to truncated proteins that are still functional. For example, although human HSPD1-5P and HSPD1-6P sequences contain signs of sequence corruption, EST data indicate that these sequences are expressed and possibly functional (see additional file 14: Table S5). To confirm our characterization, we estimated Ka/Ks ratios in trees that identified the pseudogene-sequence lineage (branch) including as out-group its parental gene and the orthologous gene sequence from chicken (see Methods). The results of these analyses (Table 2) showed in most cases Ka/Ks values not significantly different from 1.0, as expected in the differentiation of pseudogene sequences not constrained by coding of functional amino acids. Significant differences in mutation rate were estimated in the case of four sequences. These sequences, however, contained multiple in-frame stop codons and frame-shifts (Table 2).

Structural features of BBS and CCT8L proteins
Because of their high sequence divergence, it is unclear whether BBS and CCT8L Hsp60-like proteins conserve the typical fold of chaperonin subunits and their ability to assemble into typical oligomeric chaperonin complexes. Chaperonin monomers are characterized by three structural domains (apical, intermediate and equatorial) with distinct functional roles and it was relevant to investigate whether BBS and CCT8L proteins conserve each of the domains typical of chaperonins. Experimental models of eukaryotic Group II Figure 4 Evolutionary tree of vertebrate CCT6 proteins. ML tree of CCT6 proteins from mammals, chicken, and frog (in roman font) and translated sequences of the related pseudogenes from human, mouse, and rat (in bold-italics font). Only one copy of CCT6 was found in chicken and frog. Two copies, CCT6A and CCT6B, were found in all mammals examined, including marsupial (Md) and platypus (Oa). The CCT6 sequences from chicken (Gg) and from the two amphibians Xenopus laevis (Xl) and Xenopus tropicalis (Xt) were used to root the tree. All human, mouse, and rat pseudogenes clustered with the CCT6A sequences. Numbers next to branches indicate percent bootstrap values. Only bootstrap values > 30% are shown. For all species abbreviations see the legend for Figure 2. The scale bar represents the indicated number of substitutions per position for a unit branch length.
chaperonins are not available but their structural properties can be inferred by comparison with their closest relative, the archaeal thermosome. To infer tertiarystructure conservation in BBS and CCT8L proteins we predicted the secondary structure for each family from alignments of multiple sequences, excluding structure and sequence information from other families. The results of these predictions are schematically represented in Figure 6a, in relation to the secondary structure description of the PDB structure 1a6d chain A of the thermosome subunit ThsA from Thermoplasma acidophilum [36] (see additional file 15: Figure S10, additional file 16: Figure S11, additional file 17: Figure S12, additional file 18: Figure S13, additional file 19: Figure S14, and additional file 20: Figure S15 for detailed representations of multiple alignments, secondary structure predictions and alignments to the secondary-structure elements of ThsA). In Figure 6a, the secondary structure Figure 5 Evolutionary tree of vertebrate mitochondrial Cpn60. ML tree of mitochondrial Cpn60 proteins from mammals, chicken, and frog (in roman font) and translated sequences of the related pseudogenes from human, mouse, and rat (in bold-italics font). Highly degraded pseudogenes for which only fragments could be detected were not considered. Human pseudogenes clustered with primate Cpn60 sequences whereas mouse and rat pseudogenes clustered with rodent counterparts, indicating independent evolution of these pseudogenes in these species. For all species abbreviations see legend for  Figure 6b. Results of a blind test of the performance of the method on the corresponding ThsA sequence are also shown ( Figure  6a, line "Ta_ThsA"). In this test most strand and helix elements (all "core" helices) described in the crystal structure were correctly predicted by the method, increasing our confidence in the reliability of other predictions. As expected, extensive conservation of predicted secondary-structure elements were also obtained from the alignment of human CCT sequences ( Figure  6a, line "CCT") with only few discrepancies involving mostly short beta strands (4, 5, 18, and 21) and one short helix (P) exposed at the external surface of the archaeal thermosome complex. Secondary-structure predictions for mammal CCT8L and for vertebrate MKKS, BBS10 or BBS12 sequences were also largely consistent with the secondary-structure description of thermosome proteins. In the equatorial domain, CCT8L and BBS structure predictions corresponded to the mostly alphahelical composition of this region. Variations were more obvious in BBS12 and involved mostly terminal elements of helices (most notably helices P and Q) and exposed beta-strands (strands [19][20][21]. In the intermediate domain the core helical-bundle elements (helices F, G, and K) as well as the extensive beta-sheet composition of this region were predicted in all BBS and CCT8L proteins. Exceptions were, in all sequences, the two short strands 5 and 6, which are part of an external elongated loop in the thermosome structure, and, in BBS12, the N-terminal part of helix K, which in the thermosome protrudes towards the central cavity covering the ATP hydrolysis site (Figure 6b). The apical domain is formed in the thermosome by a 4-strand anti-parallel beta-sheet (strands 9, 10, 15, and 16) with strand 10 extending into a second parallel beta-sheet (strands 10, 12, 13, and 14). The two sheets are flanked by a helix (J) and are surmounted by a structure composed of two contacting helices (H and I) and an extended loop including strand 11. All helices and most strands of the apical domain were recognized in BBS sequences. Most obvious differences were observed in BBS12 proteins, where the long apical helix H was predicted to be shortened, and in CCT8L, where helix I and strand 11 were not predicted.

Differentiation of monomer-monomer interaction regions in BBS and CCT8L proteins
To investigate the potential of CCT8L and BBS proteins to establish intra-ring and inter-ring monomer-monomer contacts, we investigated the relative conservation of predicted contact positions in CCT, BBS and CCT8L sequences. We identified potential contact positions in these families based on homology to the positions involved in inter-monomer contacts in the crystal structure of the T. acidophilum thermosome complex (PDB code 1a6d). After identifying all contact positions in CCT monomers, we distinguished among them those that conserved similar amino acid types across the nine monomers. We counted how many amino acid types observed in all or in conserved contact positions of CCT monomers were also observed in the T. acidophilum Thsa sequence, in human CCT8Ls or in human BBS sequences (Table 5). A complete list of all and conserved positions considered and of the residue types observed in these positions in all sequences can be found in additional file 21: Table S6. Thsa and CCT subunits conserve 89% similarity in monomer-monomer contact positions, which is substantially higher than the average similarity (62%-66%) of all homologous positions between the two families. The higher similarity of monomer-monomer contact regions is consistent with functional conservation between the two families of these positions. In contrast, the high rate of differentiation in comparison to global average differentiation shown in putative monomer-monomer contact positions in BBS or CCT8L sequences (Table 5), suggests a loss of capability to associate into a typical CCT-like oligomeric complex. This result is consistent with the presence in BBS proteins of inserted elements ( Figure 6) that would interfere with formation of the complex [22,23].

Conservation of ATP-binding and hydrolysis residues in BBS and CCT8L proteins
We compared conservation in CCT, BBS and CCT8L sequences of the ATP-binding and ATP-hydrolysis motifs typical of chaperonins of Group II (Figure 7).  Although there is considerable variation among BBS and CCT8L sequences at some of the ATP-binding positions, we observed complete conservation of the crucial ATP-binding dipeptide Gly-Pro, suggesting that these otherwise divergent proteins conserve ATP-binding ability. In the ATP-hydrolysis sites, substantial loss of conservation has been reported in MKKS [27] and in BBS12 [23]. In the CCT8L, MKKS and BBS10 families, unusual substitutions are observed in phosphate-binding positions and within the catalytic triad, where only Asp is conserved in MKKS. The effect that these mutations may have on the hydrolytic activity in these protein families is unclear. The high level of differentiation of this region in BBS12 (where the ATP-hydrolysis motif is not recognizable) strongly suggests that BBS12 has lost hydrolytic activity.

Conservation of substrate-binding positions
Three positions crucial in determining substrate-specificity of CCT monomers have been identified in the distal region of helix I in the apical domain [37]. We analyzed conservation at these positions across vertebrate species in all Group II chaperonin families and in the Fab1_TCP domain across vertebrate orthologs of the PIKFYVE protein kinase (Table 6). These positions are strikingly conserved within each CCT monomer type (with the exception of CCT6B) across species and are characteristically different between monomer types. They are mostly conserved also in the Fab1_TCP domain across vertebrate sequences. In contrast, in BBS and, particularly, in CCT8L sequences, the homologous positions are significantly more differentiated.

Discussion
We identified the full complement of chaperonin hsp60 genes and pseudogenes encoded in the human genome and, for comparison, in the genomes of the model organisms mouse and rat. We delimited the set of hsp60 genes encoded in the human genome to: a) nine canonical cct genes (CCT1 to CCT8 including CCT6A and CCT6B) involved in formation of the CCT complex; b) the cpn60 gene (HSPD1) of mitochondrial origin; c) the three highly diverged hsp60-like BBS genes MKKS, BBS10 and BBS12; and d) a newly characterized class of genes, CCT8L, represented in human by CCT8L1 and CCT8L2. We also identified a plethora of pseudogene sequences, many of which had not been previously  reported. The comparative analyses of these families of functional genes and of their pseudogenes revealed their evolutionary history and relationships. In contrast to the uncertainty of the duplication pattern of canonical CCT subunits (our results and [38,39]) the origin of Hsp60-like BBS and CCT8L proteins was unambiguously identified by phylogenetic tree reconstructions. Our analyses indicated that hsp60-like BBS genes originated monophyletically from a gene duplication event in the CCT8 gene lineage. In addition, we determined that the CCT8L family also originated in the CCT8 lineage, from a more recent retrotransposition event. The presence of this gene family in placental mammals, marsupials and monotremes but not in reptiles/birds or other vertebrate species, indicates that this family originated at the onset of mammal evolution, before divergence of Theria and Prototheria. Presence of two highly similar CCT8L genes (CCT8L1 and CCT8L2) in the genomes of human and chimp and of a single copy in other mammal genomes, including rhesus monkey, suggests that the duplication of this gene occurred in the ape lineage (Hominoidea) after its divergence from the old-world monkeys (Cercopithecidae). Multiple evidence gathered in this work indicates that CCT8L sequences (and at least one of the two paralogs in Hominoidea) encode for functional genes: (i) reduced rates of non-synonymous mutation were estimated along their lineages, as expected for functionally-constrained protein-coding genes; (ii) pseudogenes as ancient or more recent than the CCT8L genes were heavily degenerated and no pseudogenes pre-dating mammal evolution could be identified. In contrast, although CCT8L sequences originated early in mammal evolution, they did not show signs of degeneration (with the exception of the chimp CCT8L1 ortholog); (iii) multiple EST and microarray data have been collected for CCT8L2, mostly from testis, and one EST for CCT8L1 has been reported from placental tissue (as per the Uni-Gene EST and GEO expression data, November 23, 2009). These features taken together are strong evidence that at least CCT8L2 in Hominoidea and the lone CCT8L gene in other mammal lineages encode for functional proteins. The sparse expression of CCT8L1 in human and the presence of one in-frame stop codon and one frame-shift in its orthologous sequence from chimp raise doubts about the functionality of this sequence.
Numerous sequences associated with cct or cpn60 genes found in the human, mouse or rat genomes were classified as pseudogenes based on the presence of internal stop codons, frame-shifts and non-significant difference in synonymous and non-synonymous mutation rates. Among them, the sequences HSPD1-5P and HSPD1-6P appear to be expressed based on EST analysis (see additional file 14: Table S5) and may represent instances of expressed pseudogenes [40]. A general explosion of pseudogene generation in the human and murid lineages after they separated from the carnivore lineage has been reported [41]. Our analysis of chaperonin pseudogenes is consistent with this observation, although their relatively high rate of degeneration suggests that pseudogenes generated before the origin of mammals may have degraded beyond recognition. The intense duplication of chaperonin sequences witnessed by the many pseudogenes identified in the human and murid genomes, very likely provided opportunities for multiple paralogy, resulting in the proliferation of chaperonin classes in the vertebrate and mammal lineages.
Although the Hsp60-like BBS and CCT8L protein families have considerably differentiated from the canonical CCT subunits and within themselves, our analyses indicated that they still conserve the overall threedomain structure typical of CCT proteins. Structure and sequence variations predicted for their apical domains may reflect distinctive substrate specificities. In particular, lack of conservation at positions crucial in providing substrate-specificity to CCT monomers [37] suggests that BBS and CCT8L proteins may interact with their substrate(s) in different regions as compared with the canonical CCT subunits. Sequence differentiation patterns and acquisition of inserted elements in correspondence to potential monomer-monomer contact regions suggested that BBS and CCT8L proteins do not assemble in a CCT-like complex. This prediction is supported by experimental evidence showing that MKKS localizes as a free monomer at the pericentriolar material of centrosomes [27]. In this respect, it is also interesting to observe that among BBS and CCT8L sequences the ATP-hydrolysis motif "Gly-Asp-Gly-Thr", remarkably conserved among canonical chaperonins [42], has differentiated in MKKS and in BBS12 [23,27]. This condition may indicate that these families have lost the hydrolytic activity necessary for the functionality of the chaperonin complex [43][44][45][46][47][48][49][50][51][52]. It has been shown for the archaeal thermosome complex that mutation of the ATP-hydrolysis-motif Asp residue prevents hydrolysis and productive protein folding [49] and that some CCT subunits, among which CCT8, dissociate in vitro from the complex in conditions that prevent hydrolysis of ATP [53].
Functionalities independent from formation of the complex have also been reported for canonical CCT subunits. TCP1 monomers not in complex confer enhanced salt tolerance in plants [54]. Individual CCT subunits have been reported to associate in vitro with cytoskeleton structures, selectively binding to microtubule filaments [55] or to actin polymerizing filaments [56]. The localization of Hsp60-like BBS proteins at the cilium basal body and at the centrosome [26][27][28] suggests that they may also interact and associate with, for example, cytoskeleton structures in promoting the correct development of cilia [28,57]. The multiple structural and experimental evidence that BBS and CCT8L proteins do not form a canonical CCT-like complex provides strong indication that eukaryotic Group II chaperonin-protein functionalities extend beyond those of the typical oligomeric complex.

Conclusions
Chaperonin proteins are key players in ensuring and preserving cell and organism functionality under normal and stressful conditions and their biological and medical importance is undeniable. The recent discovery of hsp60 genes directly implicated in specific pathological conditions, the chaperonopathies, extends our understanding of the roles of chaperonin proteins in cellular processes and enhances awareness of their importance in pathology [18][19][20]. Here, we have provided a comprehensive, unifying framework encompassing all members of the extended hsp60 family of genes and pseudogenes. This unifying framework contributes to our understanding of the evolutionary history of the extended hsp60 family and widens our perspectives on the multiple roles that chaperonin proteins have acquired in vertebrates. Our findings highlight how differentiation of the chaperonin protein family in mammals has been facilitated by intense processes of gene duplication. The roles, mechanisms of action, and involvement in pathogenesis of individual chaperonin molecules beyond those typical of their canonical oligomeric complexes constitute aspects of chaperonin physiology particularly promising for future experimental testing.

Identification of chaperonin genes in eukaryotic genomes
Searches of genes for Hsp60-like proteins were exhaustively performed using TBLASTN [58] at Ensembl [34] and BLAT [59] at UCSC [60] on the genome sequences of human (NCBI Assembly 36, Genebuild Ensembl Dec 2006), mouse (NCBI Assembly m37, Genebuild Ensembl Apr 2007) and rat (Assembly RGSC 3.4, Genebuild Ensembl Feb 2006). We used the nine canonical human CCT proteins and the Cpn60 protein (mitochondrial Hsp60) as queries. We recursively queried the genomes with the sequences recovered from previous searches until no other Hsp60 sequences were detected. We used both search engines also to recover the full list of annotated hsp60-like genes in several other mammal genomes and in chicken. Sequences from frog (Xenopus sp.) were retrieved from the NCBI nr (non-redundant) database using PSI-BLAST [61] with Cpn60 and the individual CCT subunits as queries. To recover complete hsp60 gene and pseudogene sequences, after the TBLASTN searches the genomic sequences from approximately 2,000 nt upstream to 2,000 nt downstream of the hit-regions were excised and the hsp60 sequences were extracted using the homology-based gene prediction method implemented in FGENESH+ [62] at the Softberry web site [63]. For pseudogenes, when FGENESH+ failed to recognize the complete sequence due to in-frame stop codons or frame shifts in the sequence, the coding region was manually reconstructed, aligning the three-frame-translations of the genomic sequence to the query sequence with the multiple protein alignment program ITERALIGN [64]. The Pseudogene.org [33,65] database and Ensembl [34], Entrez [30] and HUGO [66] annotations were consulted for the presence of annotated human pseudogenes, as recorded in our tables of results.

Multiple sequence alignment and secondary structure prediction
Multiple sequence alignments were obtained using MUSCLE [67], which in previous analyses [68,69] performed well when aligning divergent sequences. Alignments were manually adjusted as needed. Predictions of secondary structure for each protein family were performed from their multiple alignment using the Jnet algorithm as implemented in the JPRED-3 secondary structure prediction server [70,71].

Evolutionary tree reconstructions
To infer phylogenetic relationships, evolutionary trees were obtained using the maximum-likelihood (ML) treebuilding procedure implemented in PHYML [72] using the default JTT substitution model and 100 bootstrap resampling replicates (each ML tree reconstruction being quite time consuming). Selected trees were compared with those obtained with the Bayesian approach implemented in MrBayes 3.1 [73] using the WAG substitution model and 10,000 iterations for the MCMC process. Conditional probabilities were estimated sampling the MCMC process every 10 iterations after 2,500 burn-in iterations (sample size 750).

Estimates of evolutionary divergence of sequence families
We obtained rates of divergence among families of sequences using a newly developed estimator, called "Bindex". The B-index is an unbiased estimator of the average divergence of a family of sequences from its last common ancestor (root) that takes into consideration the correlations among sequences determined by their phylogenetic tree. Briefly, given a rooted tree, a terminal branch of length d i of the original tree is considered a "cluster" of size w i = 1 and length d = d i . Each forkstructure comprising two terminal branches (clusters) of lengths d 1 and d 2 and sizes w 1 and w 2 bifurcating from a stem-branch of length d s is considered in turn. The average length d of each fork-structure is computed as d = (d 1 + d 2 )/2 + d s and the average size w of the structure is defined as w = [2(d 1 + d 2 )/2 + 1d s ]/[(d 1 + d 2 )/2 + d s ] = (d 1 + d 2 + d s )/d. Each fork-structure is progressively replaced by a corresponding cluster of length d and size w. The procedure is repeated merging bifurcating clusters of lengths d 1 and d 2 and sizes w 1 and w 2 connected to a stem-branch of length d s into a larger cluster of average length d = (w 1 d 1 + w 2 d 2 )/(w 1 + w 2 ) + d s and average size w = (d 1 w 1 + d 2 w 2 + d s )/d, until the tree is reduced to two clusters connected to the root (d s = 0). The global average differentiation D ("B-index") and size W can finally be computed as D = (w 1 d 1 + w 2 d 2 )/(w 1 + w 2 ) and W = w 1 + w 2 . It can be shown that DW = L is the length of the tree (sum of all branch lengths). If two sequence families A and B are sampled from the same set of species and W A = W B , then D B /D A = L B /L A and the relative rate of differentiation of the two families of sequences can be estimated by the ratio of their tree lengths. The B-index has several advantages compared to the most commonly used average pair-wise sequence-similarity measure: (i) it takes into account the correlation among sequences imposed by the topology of the evolutionary tree; (ii) in contrast to average pairwise similarity, its expectations are invariant over the number and phylogenetic relations of sequences sampled from a cluster with the same common ancestor and evolutionary model; and (iii) with the B-index, the average differentiation rate of a protein family relative to a reference family sharing the same evolutionary relations (e.g., sampled from the same set of species) is simply estimated by the ratio of the lengths of the evolutionary trees of the two families.
Estimates of ratios of non-synonymous vs. synonymous mutation rate (Ka/Ks)