Long-term trends in evolution of indels in protein sequences
© Wolf et al; licensee BioMed Central Ltd. 2007
Received: 2 January 2007
Accepted: 13 February 2007
Published: 13 February 2007
In this paper we describe an analysis of the size evolution of both protein domains and their indels, as inferred by changing sizes of whole domains or individual unaligned regions or "spacers". We studied relatively early evolutionary events and focused on protein domains which are conserved among various taxonomy groups.
We found that more than one third of all domains have a statistically significant tendency to increase/decrease in size in evolution as judged from the overall domain size distribution as well as from the size distribution of individual spacers. Moreover, the fraction of domains and individual spacers increasing in size is almost twofold larger than the fraction decreasing in size.
We showed that the tolerance to insertion and deletion events depends on the domain's taxonomy span. Eukaryotic domains are depleted in insertions compared to the overall test set, namely, the number of spacers increasing in size is about the same as the number of spacers decreasing in size. On the other hand, ancient domain families show some bias towards insertions or spacers which grow in size in evolution. Domains from several Gene Ontology categories also demonstrate certain tendencies for insertion or deletion events as inferred from the analysis of spacer sizes.
Proteins evolve through gene duplication, diversification and domain shuffling, which allow novel proteins to emerge via different domain combinations. Many evolutionary mechanisms shaping protein sequence and structure can be probed by studying the length distributions of proteins and protein domains. Fusion of single domain proteins and domain accretion, for example, play an important role in the evolution of eukaryotic proteins [1–3]; as a result eukaryotic proteins on average are longer than bacterial and archaeal proteins [4–6]. Moreover, it was shown previously that there exists a correlation between sequence length and protein conservation [7, 8], sequence length and protein expression .
Diverse multi-domain proteins may consist of homologous domains and knowledge of protein domain evolution may considerably help in reconstructing the evolutionary history of entire proteins. Changes in protein domains result mostly from point mutations, insertion and deletion processes. Although amino acid insertion and deletion (indel) events in proteins are less frequent than amino acid substitutions [10–12], they can have a major effect in metazoan protein evolution and indel bias can influence the overall genome size [13–15]. It has been observed that indels most often occur in non-conserved protein loop regions. While proteins seem to be rather tolerant to indels in loops compared to core structure elements, protein loops can not be viewed as "random coils" and indels are under constant evolutionary pressure . At the same time, indels can relax structural tension occurring due to amino acid substitutions in some proteins and can lead to significant structural changes .
The mechanisms of indel events are not very well understood and there are only few statistical models describing these events in evolution [18–20]. Traditionally, indels in sequence alignments are scored using affine gap penalties despite the fact that this model does not adequately describe the evolution of insertions and deletions. In particular, the empirical distribution of indel lengths was analyzed for the alignment of closely related proteins and it was shown that it can be well approximated by the Zipfian distribution [10, 21, 22]. It has also been found that the probability of a gap in the alignment of two homologous sequences depends on the evolutionary distance and there exists a strong relationship between the evolutionary distance and the indel lengths [10, 11].
In this paper we analyze the size evolution of whole protein domains and indels in protein domains, as judged by changing sizes of whole domains or individual unaligned regions. We study intra-domain events which are not affected by domain shuffling and domain accretion in multidomain proteins. To examine these events on a wide scale of evolutionary distances we use the Conserved Domain Database (CDD)  that provides accurate domain alignments of diverse sequences. We are interested in whether the insertion and deletion events in protein domains as a whole or in individual fragments are balanced and if there exist trends toward increasing or decreasing indel or domain lengths. To answer these questions, we perform an extensive analysis of protein domain families spanning a wide range of different taxonomic and functional categories. The answers which we provide in this paper give the means to model indel events in the evolution of different domain families and to understand the nature of protein domain size diversity.
Domain size evolution
The first example represents the Ribonuclease A family (cd00163, Figure 1), which shows a significant trend to increase domain sizes (inferred from the total sizes of all spacers in the domain alignment) (ρ = 0.57). As can be seen from this Figure, the frog ribonucleases (red) have the smallest domains with about 9–16 residues in spacer regions, angiogenins (blue, 22–23 residues) and the mammalian RNase 1 (green, 28–30 residues) have intermediate size domains and finally the eosinophil ribonucleases (brown) have the largest domains with about 33–36 residues in spacer regions. The second example of the Lysozyme/Lactalbumin domain family (cd00119, Figure 2) shows a negative correlation with respect to domain sizes. In this family there are three main groups, the mammalian lysozymes (green), the insect lysozymes (red), and the mammalian alpha-lactalbumins (blue). The mammalian lysozymes are nearest to the root of the tree and have the largest domains. The insect lysozymes and alpha-lactalbumins are at about the same distance from the root and they have similar domain sizes, smaller than the mammalian lysozymes.
Assuming equal distribution of families among these two increasing and decreasing classes we expect that, half of families (51 out of 102 for model #2) will exhibit an increasing pattern and another half will decrease in size. However, we observed 71 cases instead of 51 and the probability to observe such bias given the above assumption can be estimated from the binomial distribution and is very small (P < 0.00005). We also estimated an average rate of domain size change which was found to be 7.2 (5.7) residues per domain per unit of evolutionary distance for domains of increasing and decreasing size respectively. Unlike for the number of domains, there is no statistically significant difference between the two average rates.
Analyzing domain families with large spacers of more than 50 residues long (so-called "inserted domains", 32 domains altogether), we investigated the functional and taxonomic assignments for these domains and found that 27 of them represent enzyme domains and 31 of them belong to the "Root" taxonomic category (with the background of 117 enzyme domain families and 183 ancient domain families in the overall test set). The probability to observe such bias, given the assumption that families with "inserted domains" are distributed equally among different functional/taxonomic classes, can be estimated from the binomial distribution, which yields a p-value of < 10-6 for enzymatic domains and a p-value < 3*10-5 for ancient domains. Thus, ubiquitous domains with enzymatic activity have a tendency to accommodate very long indels, a similar observation for enzymes having been reported earlier . In our dataset we found that the long insertions predominate over long deletions, the number of enzymatic domains with spacers longer than 50 residues inserted (12 domains) is three times larger than the number of domains with long spacers deleted in evolution (4 domains). In general, the evolutionary mechanism of inserting the whole domain into another protein domain might be different from the mechanisms of short indel evolution, but excluding those domain families with spacers longer than 50 residues does not change the overall conclusions reported in this paper (data not shown).
Spacer size evolution
GO categories listed for domain families with statistically significant bias (p-value < 0.01) with respect to increasing, decreasing and stable individual spacers.
transferase activity, transferring glycosyl groups
replicative cell aging
endoplasmic reticulum membrane
positive regulation of cell proliferation
ubiquitin-dependent protein catabolism
G-protein coupled receptor protein signaling pathway
The collection of accurate, curated multiple sequence alignments from CDD gives us an opportunity to study the evolution of domain and spacer sizes on a wide scale of evolutionary distances. Thus far the indel events have been studied on sets of relatively closely related species such as human, mouse and rat, where indels can be defined explicitly. It has been found that there exists a 2–3 fold excess of deletions over insertions in non-coding regions (and pseudogenes in particular) from human and murids [15, 30–32] and much higher deletion bias in Drosophila melanogaster pseudogenes . Moreover, deletions are approximately three times more common than insertions in loci causing Mendelian diseases . Protein coding regions, however, are generally under higher selective pressure than pseudogenes and non-coding regions and as was shown in a recent studies the ratio of deletions to insertions in protein coding regions is much closer to unity compared to non-coding regions. For example, a deletion to insertion ratio of microindels (upto 10–15 bp) in non-coding regions of mouse is 2.5 : 1 and this ratio is reduced to a 1.1(1.05) : 1 in protein coding regions [34, 35]. Along these lines, we showed that for eukaryotic domains there exists a bias towards deletions (spacers decreasing in size) compared to the overall test set, which is significantly enriched with insertions (spacers increasing in size). As a result, for eukaryotes from our test set the total number of spacers which grow in size in evolution is approximately equal to the number of spacers which decrease in size.
For the entire test set we observed a certain pattern for domains and spacers to increase in size on average, with a two-fold difference between the number of domains/spacers growing in size over those diminishing in size. The overall evolutionary scenario which we can portray based on our study is the following. It has been argued that it is unlikely that early proteins represented long peptide chains. On the contrary, various data suggest that the first protein domains emerged through the recombination of short peptides or a limited vocabulary of structural units [17, 36, 37]. Apparently, the spacers between the domain core structural elements were minimal in size, just enough to span the spatial gaps, connecting the structure. Then, in the course of evolution the majority of ancient domains acquired additional residues through the subsequent set of insertion events, although in a fraction of domains the equilibrium of indels tended towards deletions. A plausible explanation for the prevalence of increasing spacers and domains is the selection for acquisition of novel functions and fine-tuning of existing ones. Surprisingly, as we show, the rates of net insertion and deletion size change were not significantly different between each other with the average rate of 1.8–1.9 residues per spacer per unit of evolutionary distance. Similar observations have been made about the similarity of size distributions of insertions and deletions in three mammalian genomes .
For relatively "modern" proteins, however, the trend of domain size evolution was rather different. Eukaryotic proteins started losing residues in spacers, indicating that in eukaryotic evolution deletions played as important a role as insertions. Eukaryotic novel proteins seem to evolve mainly through acquiring new domains and through domain shuffling which could result in longer proteins with slightly shorter individual domains. At the same time, in bacteria the deletion trend was even more pronounced (although supported by rather limited amount of data from our test set) and, indeed, it was shown earlier that deletions in E. coli are 8 times more frequent than insertions . Such bias in bacteria towards deletions can be explained by strong selection pressure on genome size which is primarily composed of protein coding regions [8, 39].
There are different factors which would favor shorter or longer proteins or spacers in evolution [40–43]. Efficiency of protein translation, transcription and the folding process would probably benefit from shorter proteins [41, 42]. On the other hand, certain insertions may also be advantageous and subject to positive selection. For example, lineage-specific insertions/deletions in the elastin gene have functional importance in each lineage  and housekeeping proteins from pathogenic organisms may contain insertions/deletions responsible for virulence properties . Our study showed that one third of all protein domains have a statistically significant linear correlation between the evolutionary distance and the domain/spacer sizes and moreover, there is a certain tendency for domain/spacer sizes to increase with evolutionary distance. We do not yet have an explanation for these observations, however, future in-depth studies may provide further insights into these phenomena.
The analysis was performed on a set of protein families with curated alignments from the NCBI Conserved Domain Database (CDD). CDD comprises diverse non-redundant sequences and alignments are refined using three-dimensional structures and structure-structure alignments . CD alignments are block-wise multiple alignments where block regions are defined as those aligned among all family members. CD alignments are constructed to ensure enough sequence diversity and taxonomy span while avoiding bias towards highly represented sequences in the database, which is important for our analysis. The redundancy is removed by using single-linkage clustering to group the domain sequences with greater than 67% sequence identity and then choosing one representative from each preferred taxonomy node within each sequence cluster (the list of preferred taxonomy nodes can be downloaded from the CDTree . We start our analysis with a set of 362 manually curated parent node alignments from CDD version 2.00 [48, 49]. Parent alignments correspond to the top node alignments in the hierarchy of CD families. We excluded CD families consisting of short sequence repeats (ex. SUSHI repeats) and those containing less than 10 sequences. The redundancy between protein domain families was checked using the procedure implemented in the CDART algorithm ; and not more than one domain family from the same domain cluster was retained in the final test set, which yielded 278 domain families. A table is available listing the 278 test domains with taxonomy assignments and computed regression coefficients .
The domain families from the test set encompass a large spectrum of functional and taxonomic groups. Protein function was categorized by the Gene Ontology (GO) terms . Gene ontology (GO) annotations were obtained from GenBank for individual family members and pooled for the whole family. The taxonomic information for each CD family was assigned according to the range of organisms in which the family members were represented . We used a simplified classification of the families into the following three categories: "R" ("Root", family members are present in at least two kingdoms among eukaryotes, prokaryotes and archaea and thus thought to be of ancient origin, dating back at least to the Last Universal Common Ancestor; 182 families); "E" (eukaryote-specific protein families; 85 families) and "B" (bacteria-specific protein families; 11 families). There were no archaea-specific families in our dataset.
Phylogenetic trees were constructed from the aligned block regions (in case of sequence repeats only one instance was kept) by the neighbor-joining method  with the PHYLIP package . Blocks represent regions where all CDD sequences are aligned so that the resulting trees are not in any case dependent on the difference between spacer's lengths. The neighbor joining trees were rooted manually using the taxonomy of represented organisms. If multiple subfamilies within a protein domain family were present, the root was placed on the deepest inter-subfamily branch so as to balance the average length between the root and every external node of each subtree. For about 30% of the trees an alternative root placement was checked and it was observed that the overall results do not change if alternatively rooted trees were used. The phylogenetic trees are available at the ftp site .
Spacer length calculation
Taking advantage of CDD block structure where multiple alignments are anchored at certain conserved positions, we define a spacer as a non-aligned segment between two consecutively aligned block elements. To analyze how spacer lengths change in the course of evolution, the spacer lengths for all CDD sequences between two consecutive blocks were individually mapped to the external nodes of the phylogenetic tree for the corresponding domain family. To study the evolution of domain sizes, the sum of all spacer lengths in a CDD alignment (not counting N-terminal and C-terminal spacers) was mapped to the external nodes of the phylogenetic tree.
The values of spacer lengths were inferred for the internal nodes of the phylogenetic trees using the following models. According to model #1, the spacer length for an internal node was inferred to be the same as the phylogenetically closest external node. Model #2 (analogous to the squared-change parsimony,, Ch. 23) inferred the spacer lengths for internal nodes as a weighted average of the spacer lengths of external nodes using a recursive procedure. For a strictly binary tree we can define:
Where s i is the spacer length at internal node i; s1 and s2 the spacer lengths assigned to the direct descendants of node i; and h1 and h2 the heights of the descendant subtrees. The height of a subtree is, in turn, recursively defined as the branch length plus the average height of two descendant subtrees (the latter being zero for terminal nodes).
Using these models we calculated the Pearson correlation coefficients between the distance from the root to a given internal/external node and the value of the spacer length at a given node for all nodes in a tree. It should be noted that using just internal nodes in the correlation analysis does not change the results significantly. For the domain length analysis, the correlation was calculated between the evolutionary distance from the root and the sum of spacer length. The p-values for the correlation coefficients were estimated under the null hypothesis of being equal to zero; correlations for those families with p-values less than 0.01 were considered significant. The regression coefficients were calculated by linear regression analysis.
The authors thank Alex Kondrashov for helpful discussions. This work was supported by the Intramural Research Program of the National Library of Medicine at National Institutes of Health/DHHS.
- Chervitz SA, Aravind L, Sherlock G, Ball CA, Koonin EV, Dwight SS, Harris MA, Dolinski K, Mohr S, Smith T, Weng S, Cherry JM, Botstein D: Comparison of the complete protein sets of worm and yeast: orthology and divergence. Science. 1998, 282 (5396): 2022-2028. 10.1126/science.282.5396.2022.PubMed CentralView ArticlePubMedGoogle Scholar
- Aravind L, Dixit VM, Koonin EV: Apoptotic molecular machinery: vastly increased complexity in vertebrates revealed by genome comparisons. Science. 2001, 291 (5507): 1279-1284. 10.1126/science.291.5507.1279.View ArticlePubMedGoogle Scholar
- Kummerfeld SK, Teichmann SA: Relative rates of gene fusion and fission in multi-domain proteins. Trends Genet. 2005, 21 (1): 25-30. 10.1016/j.tig.2004.11.007.View ArticlePubMedGoogle Scholar
- Zhang J: Protein-length distributions for the three domains of life. Trends Genet. 2000, 16 (3): 107-109. 10.1016/S0168-9525(99)01922-8.View ArticlePubMedGoogle Scholar
- Liang P, Riley M: A comparative genomics approach for studying ancestral proteins and evolution. Adv Appl Microbiol. 2001, 50: 39-72. full_text.View ArticlePubMedGoogle Scholar
- Brocchieri L, Karlin S: Protein length in eukaryotic and prokaryotic proteomes. Nucleic Acids Res. 2005, 33 (10): 3390-3400. 10.1093/nar/gki615.PubMed CentralView ArticlePubMedGoogle Scholar
- Lipman DJ, Souvorov A, Koonin EV, Panchenko AR, Tatusova TA: The relationship of protein conservation and sequence length. BMC Evol Biol. 2002, 2: 20-10.1186/1471-2148-2-20.PubMed CentralView ArticlePubMedGoogle Scholar
- Mira A, Klasson L, Andersson SG: Microbial genome evolution: sources of variability. Curr Opin Microbiol. 2002, 5 (5): 506-512. 10.1016/S1369-5274(02)00358-2.View ArticlePubMedGoogle Scholar
- Warringer J, Blomberg A: Evolutionary constraints on yeast protein size. BMC Evol Biol. 2006, 6 (1): 61-10.1186/1471-2148-6-61.PubMed CentralView ArticlePubMedGoogle Scholar
- Benner SA, Cohen MA, Gonnet GH: Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J Mol Biol. 1993, 229 (4): 1065-1082. 10.1006/jmbi.1993.1105.View ArticlePubMedGoogle Scholar
- Pascarella S, Argos P: Analysis of insertions/deletions in protein structures. J Mol Biol. 1992, 224 (2): 461-471. 10.1016/0022-2836(92)91008-D.View ArticlePubMedGoogle Scholar
- Kondrashov AS: Direct estimates of human per nucleotide mutation rates at 20 loci causing Mendelian diseases. Hum Mutat. 2003, 21 (1): 12-27. 10.1002/humu.10147.View ArticlePubMedGoogle Scholar
- Castillo-Davis CI, Kondrashov FA, Hartl DL, Kulathinal RJ: The functional genomic distribution of protein divergence in two animal phyla: coevolution, genomic conflict, and constraint. Genome Res. 2004, 14 (5): 802-811. 10.1101/gr.2195604.PubMed CentralView ArticlePubMedGoogle Scholar
- Petrov DA: Mutational equilibrium model of genome size evolution. Theor Popul Biol. 2002, 61 (4): 531-544. 10.1006/tpbi.2002.1605.View ArticlePubMedGoogle Scholar
- Cooper GM, Brudno M, Stone EA, Dubchak I, Batzoglou S, Sidow A: Characterization of evolutionary rates and constraints in three Mammalian genomes. Genome Res. 2004, 14 (4): 539-548. 10.1101/gr.2034704.PubMed CentralView ArticlePubMedGoogle Scholar
- Panchenko AR, Wolf YI, Panchenko LA, Madej T: Evolutionary plasticity of protein families: coupling between sequence and structure variation. Proteins. 2005, 61 (3): 535-544. 10.1002/prot.20644.PubMed CentralView ArticlePubMedGoogle Scholar
- Grishin NV: Fold change in evolution of protein structures. J Struct Biol. 2001, 134 (2-3): 167-185. 10.1006/jsbi.2001.4335.View ArticlePubMedGoogle Scholar
- Thorne JL, Kishino H, Felsenstein J: An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol. 1991, 33 (2): 114-124. 10.1007/BF02193625.View ArticlePubMedGoogle Scholar
- Mitchison GJ: A probabilistic treatment of phylogeny and sequence alignment. J Mol Evol. 1999, 49 (1): 11-22. 10.1007/PL00006524.View ArticlePubMedGoogle Scholar
- McGuire G, Denham MC, Balding DJ: Models of sequence evolution for DNA sequences containing gaps. Mol Biol Evol. 2001, 18 (4): 481-490.View ArticlePubMedGoogle Scholar
- Chang MS, Benner SA: Empirical analysis of protein insertions and deletions determining parameters for the correct placement of gaps in protein sequence alignments. J Mol Biol. 2004, 341 (2): 617-631. 10.1016/j.jmb.2004.05.045.View ArticlePubMedGoogle Scholar
- Qian B, Goldstein RA: Distribution of Indel lengths. Proteins. 2001, 45 (1): 102-104. 10.1002/prot.1129.View ArticlePubMedGoogle Scholar
- Marchler-Bauer A, Anderson JB, Derbyshire MK, Deweese-Scott C, Gonzales NR, Gwadz M, Hao L, He S, Hurwitz DI, Jackson JD, Ke Z, Krylov D, Lanczycki CJ, Liebert CA, Liu C, Lu F, Lu S, Marchler GH, Mullokandov M, Song JS, Thanki N, Yamashita RA, Yin JJ, Zhang D, Bryant SH: CDD: a conserved domain database for interactive domain family analysis. Nucleic Acids Res. 2006Google Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995, 247 (4): 536-540.PubMedGoogle Scholar
- Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH--a hierarchic classification of protein domain structures. Structure. 1997, 5 (8): 1093-1108. 10.1016/S0969-2126(97)00260-8.View ArticlePubMedGoogle Scholar
- Edwards D, Berry JJ: The efficiency of simulation-based multiple comparisons. Biometrics. 1987, 43 (4): 913-928. 10.2307/2531545.View ArticlePubMedGoogle Scholar
- Aroul-Selvam R, Hubbard T, Sasidharan R: Domain insertions in protein structures. J Mol Biol. 2004, 338 (4): 633-641. 10.1016/j.jmb.2004.03.039.PubMed CentralView ArticlePubMedGoogle Scholar
- Consortium GO: Creating the gene ontology resource: design and implementation. Genome Res. 2001, 11 (8): 1425-1433. 10.1101/gr.180801.PubMed CentralView ArticleGoogle Scholar
- Shoemaker BA, Panchenko AR, Bryant SH: Finding biologically relevant protein domain interactions: conserved binding mode analysis. Protein Sci. 2006, 15 (2): 352-361. 10.1110/ps.051760806.PubMed CentralView ArticlePubMedGoogle Scholar
- Ophir R, Graur D: Patterns and rates of indel evolution in processed pseudogenes from humans and murids. Gene. 1997, 205 (1-2): 191-202. 10.1016/S0378-1119(97)00398-3.View ArticlePubMedGoogle Scholar
- Zhang Z, Gerstein M: Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes. Nucleic Acids Res. 2003, 31 (18): 5338-5348. 10.1093/nar/gkg745.PubMed CentralView ArticlePubMedGoogle Scholar
- Saitou N, Ueda S: Evolutionary rates of insertion and deletion in noncoding nucleotide sequences of primates. Mol Biol Evol. 1994, 11 (3): 504-512.PubMedGoogle Scholar
- Petrov DA, Chao YC, Stephenson EC, Hartl DL: Pseudogene evolution in Drosophila suggests a high rate of DNA loss. Mol Biol Evol. 1998, 15 (11): 1562-1567.View ArticlePubMedGoogle Scholar
- Taylor MS, Ponting CP, Copley RR: Occurrence and consequences of coding sequence insertions and deletions in Mammalian genomes. Genome Res. 2004, 14 (4): 555-566. 10.1101/gr.1977804.PubMed CentralView ArticlePubMedGoogle Scholar
- Snir S, Pachter L: Phylogenetic Profiling of insertions and deletions in vertebrate genomes. Research in Computational Molecular Biology. 2006, 265-280. full_text.View ArticleGoogle Scholar
- Soding J, Lupas AN: More than the sum of their parts: on the evolution of proteins from peptides. Bioessays. 2003, 25 (9): 837-846. 10.1002/bies.10321.View ArticlePubMedGoogle Scholar
- Panchenko AR, Luthey-Schulten Z, Cole R, Wolynes PG: The foldon universe: a survey of structural similarity and self-recognition of independently folding units. J Mol Biol. 1997, 272 (1): 95-105. 10.1006/jmbi.1997.1205.View ArticlePubMedGoogle Scholar
- Halliday JA, Glickman BW: Mechanisms of spontaneous mutation in DNA repair-proficient Escherichia coli. Mutat Res. 1991, 250 (1-2): 55-71.View ArticlePubMedGoogle Scholar
- Gil R, Sabater-Munoz B, Latorre A, Silva FJ, Moya A: Extreme genome reduction in Buchnera spp.: toward the minimal genome needed for symbiotic life. Proc Natl Acad Sci U S A. 2002, 99 (7): 4454-4458. 10.1073/pnas.062067299.PubMed CentralView ArticlePubMedGoogle Scholar
- Thompson MJ, Eisenberg D: Transproteomic evidence of a loop-deletion mechanism for enhancing protein thermostability. J Mol Biol. 1999, 290 (2): 595-604. 10.1006/jmbi.1999.2889.View ArticlePubMedGoogle Scholar
- Dufton MJ: Genetic code synonym quotas and amino acid complexity: cutting the cost of proteins?. J Theor Biol. 1997, 187 (2): 165-173. 10.1006/jtbi.1997.0443.View ArticlePubMedGoogle Scholar
- Hartl FU, Hayer-Hartl M: Molecular chaperones in the cytosol: from nascent chain to folded protein. Science. 2002, 295 (5561): 1852-1858. 10.1126/science.1068408.View ArticlePubMedGoogle Scholar
- Berezovsky IN, Shakhnovich EI: Physics and evolution of thermophilic adaptation. Proc Natl Acad Sci U S A. 2005, 102 (36): 12742-12747. 10.1073/pnas.0503890102.PubMed CentralView ArticlePubMedGoogle Scholar
- Piontkivska H, Zhang Y, Green ED, Elnitski L: Multi-species sequence comparison reveals dynamic evolution of the elastin gene that has involved purifying selection and lineage-specific insertions/deletions. BMC Genomics. 2004, 5 (1): 31-10.1186/1471-2164-5-31.PubMed CentralView ArticlePubMedGoogle Scholar
- Cherkasov A, Lee SJ, Nandan D, Reiner NE: Large-scale survey for potentially targetable indels in bacterial and protozoan proteins. Proteins. 2006, 62 (2): 371-380. 10.1002/prot.20631.View ArticlePubMedGoogle Scholar
- Marchler-Bauer A, Panchenko AR, Shoemaker BA, Thiessen PA, Geer LY, Bryant SH: CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res. 2002, 30 (1): 281-283. 10.1093/nar/30.1.281.PubMed CentralView ArticlePubMedGoogle Scholar
- Marchler-Bauer A, Anderson JB, Cherukuri PF, DeWeese-Scott C, Geer LY, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, Lanczycki CJ, Liebert CA, Liu C, Lu F, Marchler GH, Mullokandov M, Shoemaker BA, Simonyan V, Song JS, Thiessen PA, Yamashita RA, Yin JJ, Zhang D, Bryant SH: CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res. 2005, 33 (Database issue): D192-6. 10.1093/nar/gki069.PubMed CentralView ArticlePubMedGoogle Scholar
- Geer LY, Domrachev M, Lipman DJ, Bryant SH: CDART: protein homology by domain architecture. Genome Res. 2002, 12 (10): 1619-1623. 10.1101/gr.278202.PubMed CentralView ArticlePubMedGoogle Scholar
- Wheeler DL, Chappey C, Lash AE, Leipe DD, Madden TL, Schuler GD, Tatusova TA, Rapp BA: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2000, 28 (1): 10-14. 10.1093/nar/28.1.10.PubMed CentralView ArticlePubMedGoogle Scholar
- Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4 (4): 406-425.PubMedGoogle Scholar
- Felsenstein J: PHYLIP - phylogeny inference package. Cladistics. 1989, 5: 164-166.Google Scholar
- Felsenstein J: Inferring phylogenies. 2004, Sinauer Associates, IncGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.