Volume 7 Supplement 1
FlowerPower: clustering proteins into domain architecture classes for phylogenomic inference of protein function
© Krishnamurthy et al; licensee BioMed Central Ltd. 2007
Published: 8 February 2007
Function prediction by transfer of annotation from the top database hit in a homology search has been shown to be prone to systematic error. Phylogenomic analysis reduces these errors by inferring protein function within the evolutionary context of the entire family. However, accuracy of function prediction for multi-domain proteins depends on all members having the same overall domain structure. By contrast, most common homolog detection methods are optimized for retrieving local homologs, and do not address this requirement.
We present FlowerPower, a novel clustering algorithm designed for the identification of global homologs as a precursor to structural phylogenomic analysis. Similar to methods such as PSIBLAST, FlowerPower employs an iterative approach to clustering sequences. However, rather than using a single HMM or profile to expand the cluster, FlowerPower identifies subfamilies using the SCI-PHY algorithm and then selects and aligns new homologs using subfamily hidden Markov models. FlowerPower is shown to outperform BLAST, PSI-BLAST and the UCSC SAM-Target 2K methods at discrimination between proteins in the same domain architecture class and those having different overall domain structures.
Structural phylogenomic analysis enables biologists to avoid the systematic errors associated with annotation transfer; clustering sequences based on sharing the same domain architecture is a critical first step in this process. FlowerPower is shown to consistently identify homologous sequences having the same domain architecture as the query.
FlowerPower is available as a webserver at http://phylogenomics.berkeley.edu/flowerpower/.
Biological processes such as speciation, gene duplication, and domain shuffling produce families of related genes whose gene products can have vastly different molecular functions. Inference of protein function in these cases has been shown to be prone to systematic error [1–3]. Phylogenomic analysis – inferring the function of a protein in the larger context of a protein family based on evolutionary relationships – addresses these errors and improves the accuracy of functional classification [2, 4]. In a phylogenomic approach, a phylogenetic tree is constructed from a multiple alignment of evolutionarily related sequences. The tree topology is analyzed to discriminate orthologs from paralogs, and is overlaid with existing experimental data for the members of the family. Functional inference can then be performed in an evolutionary context.
Protein domains are independently folding structural units that often confer specific functions. Roughly 65% of eukaryotic proteins and 40% of prokaryotic proteins are composed of multiple domains [5, 6]. Domain fusion and fission events produce "families" of proteins that may share only a single domain in common, and some domains are "promiscuous," in that they are present in many different domain architectures. In automated functional inference approaches, these "local" (partial) homologs are often retrieved using database search, and may, in fact, be the top hits; they may also be placed as siblings in a phylogenetic tree with proteins having entirely different domain structures. Since the function of a multi-domain protein is a composite of all its constituent domains, annotation transfer based on local homology – even in a phylogenomic context – can be misleading. This issue has received less attention than differentiation of orthologs and paralogs, but is the source of a significant number of annotation errors.
The most commonly used methods for clustering homologous proteins are BLAST  and PSI-BLAST . There are three primary problems with the use of these and similar tools in the context of phylogenomic inference of molecular function. First, these methods are optimized for homolog detection based on local similarity; clusters are not screened to remove proteins with different domain structure. Second, overly permissive parameterization of these tools – particularly iterative methods such as PSI-BLAST – can result in the inclusion of non-homologs. Third, it is possible for repeated iterations of the homolog identification process to result in profile drift, with the result that the seed sequence may not be included in the final cluster, or the profile may have drifted to include non-homologs in the set.
Structural phylogenomics combines evolutionary and structural analysis to elucidate changes in molecular function and structure in protein superfamilies. This approach has several applications, one of which is predicting the molecular function of unknown proteins in an evolutionary context. Phylogenomic inference has been shown to reduce the systematic errors associated with function prediction by homology; integration of structural information (or prediction) improves the accuracy of this approach. Our recommended protocol for protein function prediction integrates structural considerations in the first step of a phylogenomic pipeline, i.e. gathering homologs that share the same domain architecture. For this task we present FlowerPower, a method that discriminates between local and global homologs with much higher precision than BLAST, PSIBLAST and the UCSC SAM Target-2K (T2K) hidden Markov model (HMM) method . We also present examples of sequence annotation errors detected through the use of structural phylogenomics, which could have been avoided at the outset by adopting this approach.
Results and discussion
We compared FlowerPower, BLAST, PSI-BLAST and T2K on the task of discriminating between proteins sharing the same domain architecture (global homologs) and those having local similarity but different overall domain structures (local homologs). BLAST and PSI-BLAST are the most commonly used methods for clustering homologous sequences. The T2K method is less well known, but has been shown to outperform all other methods at remote homolog detection . For these experiments, we selected nine sequences whose domain structures could be confidently predicted by PFAM . Each method was allowed to select sequences from the SwissPFAM database . Method parameters were varied to assess the impact on sensitivity (recall: fraction of global homologs selected) and precision (selectivity: fraction of selected sequences that were global homologs).
The structural and functional variability in protein families complicates every aspect of an automated structural phylogenomic pipeline. Restricting a set of sequences to global homologs can be particularly challenging in these circumstances, as local similarity can result in sequences with different domain architectures being included in a dataset. These partial homologs can introduce errors in function prediction based on homology, as illustrated below.
Examples of database annotation errors
We include here two examples of misannotated sequences which we discovered using structural phylogenomic inference of protein families in constructing our PhyloFacts phylogenomic resource . The first sequence appears to have been annotated entirely by homology with a protein with strictly partial (local) similarity. The second, intriguingly, has been investigated experimentally, but neither the presumed species of origin (human) nor the assigned domain structure agree with that suggested by structural phylogenomic inference. While many annotation errors can also be detected through the use of domain structure analysis (e.g., through the use of PFAM or similar domain prediction webservers), the use of FlowerPower to cluster sequences sharing the same domain structure enables us to identify potentially erroneous annotations as anomalous in the context of the family as a whole.
Rice protein XP_478746
FlowerPower, on the other hand, had much higher precision in the 43 sequences it selected from the NR database. PFAM analysis of FlowerPower sequences, requiring global matches to PFAM domains (PFAM-ls), finds only TIR domains in each of the sequences selected by FlowerPower. Allowing partial matches (PFAM-fs) detects fragmentary matches to other domains in two sequences: one (GenBank accession AAL07540) contains a fragmentary match to an NB-ARC domain, and the second (GenBank accession BAD94633) has a short (43aa) partial match to a motile sperm domain. SMART  detects only TIR domains in the FlowerPower sequences.
Details of these analyses are available in Supplementary Materials. A structural phylogenomic analysis of this protein, including the FlowerPower cluster, is available at .
Putative human neutral sphingomyelinase (AAF19052)
FlowerPower is designed for the first step in structural phylogenomic inference of protein function: selecting a dataset upon which functional inference will be based. For phylogenomic inference to be accurate, all sequences in the set must share the same domain architecture. FlowerPower has been shown to outperform BLAST, PSI-BLAST and the HMM-based SAM-T2K method at discriminating between proteins sharing the same domain structure and those having only local similarity. The precision of FlowerPower is much higher than the other methods tested, with an average false positive error rate under 3%, though PSI-BLAST and SAM-T2K have better sensitivity.
We have presented two examples of sequences with errors in their domain structure annotations producing errors in function prediction, which would have been prevented had a structural phylogenomic inference protocol been adopted. In these two cases, errors could also have been prevented through domain-structure analyses of each individual sequence using resources such as PFAM. FlowerPower provides an independent means of both preventing such errors and post-hoc identification of existing errors, through anomaly detection. Clustering sequences using a method such as FlowerPower enables us to assume all (or at least, most) of the sequences in the set have the same domain structure. If such a set contains sequences labelled differently, as in the examples presented here, oddball annotations will stand out as anomalous, signalling a potential error. Phylogenetic tree construction of these global homology clusters for phylogenomic inference of protein function enhances the specificity of functional annotation possible.
The FlowerPower method depends on two core methods to detect and align sequences: SCI-PHY subfamily identification and subfamily HMM construction. Subfamilies identified by SCI-PHY correspond closely to conserved clades found by phylogenetic analysis and to functional subtypes found by experts (submitted). Subfamily HMMs based on SCI-PHY subfamilies model the subtypes within a diverse protein family, accommodating lineage-specific structural and functional changes. Relative to the use of a single HMM for the family as a whole, subfamily HMMs have improved sensitivity at the same false positive rate: they identify dramatically more true positives under high significance cutoffs and provide greater separation between true and false positives. In addition, novel sequences can be classified to existing subfamily HMMs with very high accuracy .
A web server for FlowerPower is available at . The default parameters are designed to retrieve sequences sharing the same domain structure; users can override the parameters to retrieve sequences sharing local or global-local similarity. Results include the native FlowerPower subfamily-HMM based alignment and a realignment of the sequences using MUSCLE. Users also have the option of generating functional subfamily using SCI-PHY and constructing subfamily HMMs. Alignments and SCI-PHY tree can be viewed online or downloaded.
FlowerPower's high precision appears to be quite robust to different parameter settings, but there is clearly room for improvement in FlowerPower sensitivity. We plan to test the effect of other parameterizations of FlowerPower, including different ways to select the initial set for multiple sequence alignment, different SAM parameter settings for aligning sequences to subfamily HMMs, and score and alignment statistic requirements for inclusion of new sequences. Future experiments will also be performed on an expanded benchmark dataset, to ensure that method parameterization generalizes well to different protein architectures.
Profile drift is another area where we expect FlowerPower to provide superior performance. The use of subfamily hidden Markov models in retrieving new sequences in each iteration ensures that the seed and its homologs are included in the next iteration. Since subfamily HMMs improve the separation between true homologs and non-homologs, FlowerPower should prevent the intrusion of false positives in the final cluster. We plan to test FlowerPower's performance at these tasks.
The FlowerPower algorithm
FlowerPower is an iterative clustering algorithm akin to PSI-BLAST that retrieves and aligns sequences using profile methods. However, instead of using a single profile for homolog detection, FlowerPower constructs and uses subfamily HMMs to detect and align sequences in the next iteration; this is designed to improve alignment accuracy and prevent profile drift. Automated alignment analysis and quality control at every step ensure that homologs selected meet user-specified criteria. When parameterized for functional inference, quality control measures of alignment overlap and sequence identity result in clusters composed of sequences with the same domain architecture.
Selection of sequences for FlowerPower search
The input to FlowerPower is a seed sequence, a specified database from which putative homologs will be selected, and user-selected alignment quality control criteria (coverage and percent identity cutoffs). Due to the computational complexity of HMM scoring, the first step in FlowerPower involves running PSI-BLAST to gather a set of potential homologs against which subfamily HMMs will search. When FlowerPower is used to select globally alignable matches, this set is filtered to remove sequences that are clearly too long or too short to share the same domain structure as the seed. The final set, S, is then used as the target database from which potential homologs are selected using HMM scoring.
Creation of the initial multiple sequence alignment
We select a set of sequences with high sequence similarity to the seed, as follows. We construct an HMM for the seed sequence using the SAM modelfromalign software. Sequences from S with a BLAST E-value <= 10-10 to the seed are aligned to the HMM using the align2model software; using HMMs to align the sequence produces somewhat more global alignments than produced by BLAST. Sequence alignments are then examined; sequences having >=25% pairwise identity to the seed and passing minimum fractional (length-dependent) bi-directional overlap between the database hit and the seed (termed the "coverage") are accepted. The "coverage" fraction varies between 0.60 for sequences of <100aa and 0.85 for sequences of >500aa. We then use MUSCLE  to realign the sequences. The MUSCLE alignment is submitted to SCI-PHY  for subfamily identification and subfamily HMM construction .
Iterated sequence retrieval and alignment
Subfamily identification using SCI-PHY
FlowerPower uses the SCI-PHY (Subfamily Classification In Phylogenomics) method to predict functional subfamilies in each iteration, based on the multiple sequence alignment of sequences retrieved at that stage. SCI-PHY constructs a hierarchical tree using agglomerative clustering, and cuts the tree into subtrees using a combination of information theoretic methods and Dirichlet mixture densities . SCI-PHY uses an encoding cost measurement under a Dirichlet mixture density to determine an optimal cut of the tree into subtrees. SCI-PHY subfamily classification has been shown to correspond closely to phylogenetic clades and expert identified subtypes (submitted). A detailed description of the algorithm is published in .
Subfamily HMM construction
Subfamilies identified by SCI-PHY are used to construct subfamily hidden Markov models, which are used to score and align sequences for the next iteration of FlowerPower. Subfamily HMM parameters are estimated using an information-sharing protocol enabling statistics to be shared across subfamilies in a position- and subfamily-specific manner. Amino acid distributions at positions conserved across the family are fixed for each subfamily; this ensures that even very small subfamilies include information about positions defining the family as a whole. At other positions, subfamilies share statistics with subfamilies aligning similar residues, while keeping their statistics separate from subfamilies aligning very dissimilar residues. This protocol retains specificity at subfamily-defining regions or motifs, while generalizing well to more distant homologs. This provides high specificity of classification while simultaneously improving the sensitivity of the subfamily HMM to detect new members . Recent experiments on a large representative dataset of 515 SCOP folds show that subfamily HMMs dramatically increase the separation between true homologs and non-homologous proteins with different folds (submitted).
Comparison of FlowerPower with BLAST, PSI-BLAST and T2K
BLAST was tested using three different e-value cut-offs: 10-20, 10-10 and 10-5. For PSI-BLAST, we varied the number of iterations (three and five) and e-value cut-offs (10-10, 10-5 and 10-3). The results from five iterations were almost indistinguishable to results obtained from using three iterations and are not shown. SAM-T2K was run using default parameters.
Details of FlowerPower validation dataset.
AA_kinase (26–269); Acetyltransf_1 (338–414)
SH3_1 (54–109); SH2 (117–198); Pkinase (234–486)
SH2 (14–88); SH3_1 (126–181); SH3_2 (239–294)
CARD (2–91); Peptidase_C14 (163–401)
Death (31–109); TIR(163–292)
Response_reg (7–128); GerE (153–210)
RNase_PH (12–144); RNase_PH_C (147–211); PNPase (242–320); RNase_PH (323–456); RNase_PH_C (459–529); KH_1 (555–612); S1 (618–690)
MATH (38–163); BTB (190–297)
BLAST, PSI-BLAST, T2K and FlowerPower were then used to retrieve proteins from SwissPFAM version 15. Retrieved sequences were labelled as homologous, non-homologous, or indeterminate. To be called homologous, a database hit had to be clearly in the same domain architecture class as the seed sequence (i.e., the same or structurally equivalent PFAM domains (based on SCOP analysis) in the same order), with any unlabelled region restricted to less than 80 amino acids. PFAM domains are considered equivalent if they match at the level of SCOP superfamily. For instance, the PFAM CARD, DEATH and DED domains are all members of the SCOP DEATH domain superfamily, and would be considered structurally equivalent in our analysis. The set of proteins in SwissPFAM matching these criteria form the full set of global homologs; methods identifying all the global homologs would therefore have perfect sensitivity. Proteins were defined as non-homologs if they were much longer than the seed (i.e., hit length > seed length + 500), contained a PFAM domain not homologous to any domain in the seed (based on disagreement at the level of SCOP fold), contained an unequal number of homologous PFAM domains, or had a different ordering of PFAM domains than those in the seed. All other proteins were called indeterminate, as their global structural homology or lack thereof could not be rigorously determined on the basis of these analyses. For a given homolog-detection method, we then define True Positive hits (TP, global homologs correctly selected by that method), True Negatives (TN, non-global-homologs that are correctly rejected), False Positives (FP, non-global-homologs that are incorrectly accepted), and False Negatives (FN, true global homologs that are incorrectly rejected). Results of these experiments are shown in Figure 1.
Phylogenetic tree construction and display
[GenBank:XP_478746, GenBank:CAC82811, GenBank:ABB82024, GenBank:AAN63807, GenBank:AAM28910, GenBank:AAM28917, GenBank:AAM28914, GenBank:NP_974060, GenBank:AAB71484, GenBank:NP_175698, GenBank:AAM28917, GenBank:AAL07540, GenBank:BAD94633, GenBank:AAF19052, SwissProt:ARGA_ECOLI, SwissProt:BIR5_HUMAN, SwissProt:BLK_MOUSE, SwissProt:CRKL_MOUSE, SwissProt:I1BC_HUMAN, SwissProt:MY88_MOUSE, SwissProt:NARL_ECOLI, SwissProt:PNP_ECOLI, SwissProt:SPOP_HUMAN]
Hidden Markov model
Structural Classification of Proteins
This work was supported in part by a Presidential Early Career Award for Scientists and Engineers (PECASE) Grant #0238311 from the National Science Foundation, and by Grant #R01 HG002769-01 from the National Institutes of Health. We thank Jason Chan and Wayne Christopher for earlier development work on the software tools, and Dan Kirshner for maintaining the FlowerPower web server.
This article has been published as part of BMC Evolutionary Biology Volume 7, Supplement 1, 2007: First International Conference on Phylogenomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcevolbiol/7?issue=S1.
- Bork P, Koonin EV: Predicting functions from protein sequences – where are the bottlenecks?. Nat Genet. 1998, 18 (4): 313-318. 10.1038/ng0498-313.View ArticlePubMedGoogle Scholar
- Eisen JA: Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 1998, 8 (3): 163-167.View ArticlePubMedGoogle Scholar
- Galperin MY, Koonin EV: Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption. In Silico Biol. 1998, 1 (1): 55-67.PubMedGoogle Scholar
- Sjölander K: Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics. 2004, 20 (2): 170-179. 10.1093/bioinformatics/bth021.View ArticlePubMedGoogle Scholar
- Ekman D, Bjorklund AK, Frey-Skott J, Elofsson A: Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. J Mol Biol. 2005, 348 (1): 231-243. 10.1016/j.jmb.2005.02.007.View ArticlePubMedGoogle Scholar
- Apic G, Gough J, Teichmann SA: Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J Mol Biol. 2001, 310 (2): 311-325. 10.1006/jmbi.2001.4776.View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410. 10.1006/jmbi.1990.9999.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Karplus K, Karchin R, Barrett C, Tu S, Cline M, Diekhans M, Grate L, Casper J, Hughey R: What is the value added by human intervention in protein structure prediction?. Proteins. 2001, 45 (Suppl 5): 86-91. 10.1002/prot.10021.View ArticleGoogle Scholar
- Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, Chothia C: Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol. 1998, 284 (4): 1201-1210. 10.1006/jmbi.1998.2221.View ArticlePubMedGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR: The Pfam protein families database. Nucleic Acids Res. 2004, D138-141. 10.1093/nar/gkh121. 32 Database
- SWISSPFAM database. [http://pfam.janelia.org/swisspfam.shtml]
- PhyloFacts Universal Proteome Explorer. [http://phylogenomics.berkeley.edu/phylofacts/]
- Letunic I, Copley RR, Pils B, Pinkert S, Schultz J, Bork P: SMART 5: domains in the context of genomes and networks. Nucleic Acids Res. 2006, D257-260. 10.1093/nar/gkj079. 34 Database
- Structural phylogenomic analysis of rice protein XP_478746. [http://phylogenomics.berkeley.edu/book/book_info.php?book=bpg004945]
- Chatterjee S, Han H, Rollins S, Cleveland T: Molecular cloning, characterization, and expression of a novel human neutral sphingomyelinase. J Biol Chem. 1999, 274 (52): 37407-37412. 10.1074/jbc.274.52.37407.View ArticlePubMedGoogle Scholar
- Structural phylogenomic analysis of putative human neutral sphingomyelinase AAF19052. [http://phylogenomics.berkeley.edu/book/book_info.php?book=bpg004927]
- Brown D, Krishnamurthy N, Dale JM, Christopher W, Sjolander K: Subfamily hmms in functional genomics. Pac Symp Biocomput. 2005, 322-333.Google Scholar
- FlowerPower Online Server. [http://phylogenomics.berkeley.edu/flowerpower]
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32 (5): 1792-1797. 10.1093/nar/gkh340.PubMed CentralView ArticlePubMedGoogle Scholar
- Sjölander K: Phylogenetic inference in protein superfamilies: analysis of SH2 domains. Proc Int Conf Intell Syst Mol Biol. 1998, 6: 165-174.PubMedGoogle Scholar
- Sjölander K, Karplus K, Brown M, Hughey R, Krogh A, Mian IS, Haussler D: Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput Appl Biosci. 1996, 12 (4): 327-345.PubMedGoogle Scholar
- Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res. 2004, D226-229. 10.1093/nar/gkh039. 32 Database
- Swofford D: PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. 2002, Sunderland, Massachusetts: Sinauer AssociatesGoogle Scholar
- Zmasek CM, Eddy SR: ATV: display and manipulation of annotated phylogenetic trees. Bioinformatics. 2001, 17 (4): 383-384. 10.1093/bioinformatics/17.4.383.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.