Supplementary Results

Generation of core-orthologs: InParanoid-TC For this study we compute orthologs for each pair of primer-taxa with InParanoid. The pair-wise orthology predictions are subsequently extended to include all primer-taxa by using a criterion of transitive closure (InParanoid-TC). This approach has the advantage of generating ortholog clusters, with only one sequence per taxon. Such clusters can then be directly used for downstream standard approaches of phylogeny reconstruction. In brief, we order the n primer-taxa such that taxon i+1 is the closest relative to taxon i in the species tree, for i = 1, 2, ..., n-1, where ties are broken randomly. For each protein in taxon 1 we carry out the following loop: a) We identify the corresponding InParanoid-orthologs in taxon 2. If more than one co-ortholog exists, we choose the one with the highest InParanoid-score. b) With the protein from taxon 2 we identify the ortholog with the highest InParanoid-score in taxon 3. This procedure continues until the ortholog-pair for taxon n-1 and n has been determined. c) The circle is then closed, by identifying the ortholog to the protein from taxon n in taxon 1. d) If the two proteins in taxon 1 at the beginning and the end of the round trip are identical, we keep the set of orthologs and call it core-ortholog. Else, we discard the proteins. We end up with a collection of core-orthologs representing genes present in all primer-taxa. In each individual core-ortholog each taxon is represented exactly once.


Results:
We present a novel approach (HaMStR) to mine EST data for the presence of orthologs to a curated set of genes. HaMStR combines a profile Hidden Markov Model search and a subsequent BLAST search to extend existing ortholog cluster with sequences from further taxa. We show that the HaMStR results are consistent with those obtained with existing orthology prediction methods that require completely sequenced genomes. A case study on the phylogeny of 35 fungal taxa illustrates that HaMStR is well suited to compile informative data sets for phylogenomic studies from ESTs and protein sequence data.

Conclusion:
HaMStR extends in a standardized manner a pre-defined set of orthologs with ESTs from further taxa. In the same fashion HaMStR can be applied to protein sequence data, and thus provides a comprehensive approach to compile ortholog cluster from any protein coding data. The resulting orthology predictions serve as the data basis for a variety of evolutionary studies. Here, we have demonstrated the application of HaMStR in a molecular systematics study. However, we envision that studies tracing the evolutionary fate of individual genes or functional complexes of genes will greatly benefit from HaMStR orthology predictions as well.

Background
The amount of protein-coding DNA sequences in the public data bases is steadily increasing. This data is mainly generated by the sequencing and annotation of entire genomes and by numerous EST sequencing projects.
Approaches to resolve the evolutionary relationships of eukaryotes on a molecular basis -frequently referred to as molecular systematics-particularly benefit from this data. Recent studies on the evolution of metazoans and fungi present trees with 40 to 77 taxa, reconstructed from more than 140 genes [1][2][3][4][5][6]. Still, these studies consider only a small fraction of the data available. For example, as of May 2008 dbEST contains 714 eukaryotic taxa with more than 2.000 ESTs each, and 394 taxa have more than 10,000 ESTs http://www.ncbi.nlm.nih.gov/dbEST. Despite the potential value of ESTs especially for molecular systematics [7] this data has rarely been used in phylogenetic studies so far. This is due to the fact that ESTs are redundant, short and of low sequence quality. Plenty of solutions exist to improve data quality by cleaning and clustering of ESTs sequences prior to analysis [8][9][10]. However, annotating ESTs and more importantly inferring their relationships to known genes in other taxa is still problematic.
For most molecular systematics approaches only sequences are admissible for which the orthology [11] or co-orthology [12] to the other sequences in the dataset is known, or at least reasonable to assume [6] (but see [13]). A variety of tools exist to define groups of orthologous proteins. Among these, phylogeny-based programs, such as RIO [14], reveal the lowest fraction of false positives in a comparison of 10 programs for orthology/homology inference [15]. It is obvious that methods depending on a known species tree are not applicable if the aim is to infer the phylogenetic relationships of taxa. However, more recent approaches have found solutions to overcome this strict dependence on a known species tree, e.g., [16]. The reciprocal best BLAST hit criterion (RBH) between pairs of sequences from two taxa, is frequently used to infer orthology when the species tree is unknown. Orthology predictions from the RBH method have a very low false positive rate [15]. However, ortholog groups are ultimately confined to two sequences. Thus, co-orthologs resulting from gene duplications after the separation of the compared taxa are missed. Several variations exist to extend RBH ortholog groups to allow for the addition of co-orthologs [12] and/or to extend the prediction to more than two taxa [17][18][19].
Orthology prediction methods based on the RBH criterion have been designed for, and successfully applied to, inference of ortholog groups for taxa for which the entire genome has been sequenced. Their application to EST based data sets, however, is not straightforward. Incomplete representation of a species' gene set, especially in small EST projects, depreciates a reciprocal best BLAST hit as an indicator of orthology. Moreover, the coding sequence of a gene can be tagged by two or more nonoverlapping ESTs. Methods such as InParanoid [12] or orthoMCL [18] will in most cases add only one of the ESTs to the ortholog cluster. Both programs allow for more than one sequence per species in an ortholog cluster, only when the intra-specific similarity for the sequences in the cluster is higher than the sequence similarity between species. However, the intra-specific similarity between non-overlapping ESTs derived from the same gene is approximately that of two randomly chosen and unrelated sequences.
To use EST data in molecular systematics, sequences have been initially BLASTed against a protein sequence database, e.g., the non-redundant protein database at NCBI, and the highest scoring hits are taken to tentatively annotate the EST sequences [20]. Post-processing of the BLAST results to infer the orthology status of the query-hit pair ranged then from the application of simple e-value cutoffs, e.g., [21], to more complex procedures including Markov Clustering and visual inspection of individual sequence trees [2]. Problematic to these approaches is that they unduly substitute e-values of a BLAST result with evolutionary relationships, involve too many human interactions to be routinely applied, or suffer from the conceptual problem of Markov Clustering when applied to non-overlapping sequence fragments representing the same protein.
Here we present a novel approach, HaMStR, to extend predefined groups of orthologous genes derived from taxa, whose genomes have been fully sequenced with data from further taxa. We show that the joint application of a profile Hidden Markov Model (pHMM) based similarity search and a subsequent re-BLAST of the hit sequences against a reference proteome identifies candidate orthologs. HaMStR exhibits a very low false positive rate and good sensitivity. The resulting collection of orthologous sequences provides an excellent basis for phylogeny reconstruction.

Algorithm
Step 1 Defining a Gene Set for the Ortholog Search 1.1 Generation of core-orthologs As input we introduce the primer-taxa (set) where each taxon is completely sequenced and where the phylogeny of the primer-taxa is undisputed. Standard orthology prediction tools such as InParanoid [12] or orthoMCL [18] can be applied to identify genes with orthologs present in all primer taxa, the so called core-ortholog groups. For this study we compute orthologs for each pair of taxa with InParanoid [12]. The pair-wise orthology predictions are subsequently extended to include all primer taxa by using a criterion of transitive closure (InParanoid-TC). We end up with a collection of core-orthologs where in each individual core-ortholog every primer taxon is represented exactly once. Details about this procedure are provided online (see Additional file 1).

Generation of profile Hidden Markov Models
For each sequence cluster in the core-orthologs, the sequences are aligned using MAFFT [22] with the options --maxiterate 1000 and --localpair. The resulting multiple sequence alignments, comprising the n sequences from the primer-taxa are then converted into a profile Hidden Markov Model (pHMM) [23]. The programs hmmbuild and hmmcalibrate from the HMMER package http:// hmmer.janelia.org are used for building, training and calibrating the pHMMs. Each core-ortholog is now represented by a pHMM.
Step 2 Extension of core-orthologs We now extend the core-orthologs with data from additional taxa, the query-taxa. As data may serve translated ESTs, or protein sequences inferred from either complete or partially sequenced genomes.

pHMM search
We use a fast implementation of the hmmsearch algorithm http://www.clccell.com to search protein sequence data from the query-taxon for matches to the individual pHMMs. If ESTs are the data for the query taxa, the individual ESTs are translated in all six reading frames prior to the search.

Re-BLAST and orthology prediction
To determine the orthology status of the hmmsearch hits, we use a reciprocity criterion ( Figure 1). Each hit is compared by BLASTP [24] to the proteome of one of the primer-taxa, the so-called reference-taxon (Proteome F in Figure 1). Ideally, the reference-taxon should be the closest related primer-taxon to the query-taxon. If the protein of the reference taxon that contributed to the pHMM provides the best BLASTP hit, then the hmmsearch hit is added to the corresponding core-ortholog. Otherwise, it is discarded. Please note that the reciprocity criterion is also fulfilled when the reference protein is among the lower ranking BLASTP hits, but has the same score as the top listed hit in the BLASTP output.

Post-processing of ESTs
To account for possible frame shifts caused by sequencing errors in ESTs we use genewise [25] to generate a codonalignment for the EST and the protein sequence of the reference-taxon. This alignment determines the coding part together with the reading frame in the EST. Figure 1 Workflow of the HaMStR approach. Standard orthology prediction tools are used to identify orthologous groups, the so called core-orthologs, for a set of completely sequenced primer taxa (Proteome A -F). The sequences in a core-ortholog are aligned and converted into a profile HMM (pHMM). A compilation of protein sequences or translated ESTs from a taxon not included in the primer-taxa (Protein set G) is searched for hits with the pHMM. The resulting candidates display features that are characteristic for the protein modelled by the pHMM. To determine the orthology status of the candidates, we introduce a reciprocity criterion. Each candidate is compared by BLASTP with the proteome of one of the primer-taxa, the so-called reference-taxon (Proteome F). If the best BLASTP hit sequence from the reference taxon corresponds to the protein that contributed to the pHMM, the candidate is called candidate-ortholog, else it is discarded.

Testing
We next test the versatility of HaMStR. The data for this step are summarized in Table 1.
We analyze three scenarios. In all cases we wanted to detect orthologs in the human proteome, where the primer-taxa set consists of opossum (Monodelphis domestica), sea squirt (Ciona intestinalis), fruit fly (Drosophila melanogaster), worm (Caenorhabditis elegans), and a fungus (Saccharomyces cerevisiae). The corresponding coreortholog set (PoP) represents 994 genes.
In the first scenario, we carried out step 1 of HaMStR and extracted the opossum proteins from the 994 coreorthologs. Then we used InParanoid to infer the corresponding orthologs in the human proteome. InParanoid identified 979 human-opossum ortholog clusters. For 15 opossum genes InParanoid did not find a human ortholog.
In the second scenario, we applied step 2 of HaMStR with the opossum as reference-taxon instead of using InParanoid. If HaMStR predicts more than one ortholog, we use ClustalW [26] to align all candidates individually to the opossum protein in the corresponding core-ortholog. The human proteins are then ranked according to their alignment score. HaMStR predicted 976 human orthologs. For the remaining 18 core-orthologs no human counterpart was found.
For 972 core-orthologs the top ranking HaMStR ortholog is the same as those found with InParanoid (Table 2). Four human genes that have been found with InParanoid were not detected in step 2 of HaMStR. In all instances, InParanoid identified several proteins as co-orthologs to the human gene. However, only one of these opossum sequences was used to train the corresponding pHMMs. During the re-BLAST (c.f. 2.2), a different co-ortholog from the same InParanoid-ortholog cluster was obtained as the best BLASTP hit. Thus, the reciprocity criterion was not fulfilled and HaMStR did not make an orthology prediction.
One human gene was found with HaMStR but not with InParanoid.  2) between ENSP00000222402 and both opossum proteins. Again, orthoMCL groups these proteins into the same ortholog cluster.
In summary, the HaMStR strategy is suitable for identifying orthologs for a pre-defined gene set. The results are consistent with those obtained with one of the most efficient existing orthology prediction programs. For 168 core-orthologs, more than 1 human protein was assigned as a putative ortholog. A detailed description of these lower ranking hits is given in the online supplementary material (see Additional file 2).
The HaMStR approach is efficient in identifying orthologs when the entire sequence of the query protein is known.
In scenario 3, we assess the performance of HaMStR when using translated ESTs. In many molecular systematics studies, EST projects are typically small [21,27,7]. To assess the performance of HaMStR in such cases, we did not cluster the ESTs. Each single EST is now an input for HaMStR.
As a typical example we selected 32,647 ESTs derived from human chromosome 2 http://www.genome.ucsc.edu according to the following criteria: A Blat search [28] with each EST against the human genome sequence must obtain a best hit on chromosome 2 covering at least 90% of the EST. Moreover, the best Blat hit must be located between the annotated start, and the end of the transcript of an annotated gene (Ensembl database; http:// www.ensembl.org), and must overlap with a coding exon. The 32,647 ESTs (hchr2-ESTs) map to 1,106 genes. From the analyses in scenario 1 and 2 we know that only 81 of these are represented by core-orthologs in the PoP primertaxa set. This set of 81 human chromosome 2 genes is tagged by 6,288 ESTs (19%).
From the 32,647 hchr2-ESTs that served as HaMStR input, 29,293 were not assigned to any of the 994 core-orthologs ( Table 3). The remaining 3,354 ESTs are highly enriched (97%) with ESTs tagging the 81 genes that are represented in the PoP set. More precisely, 3,243 ESTs were assigned correctly to 72 core-orthologs. The remaining 3% were assigned to two core-orthologs for which the human gene is not located on chromosome 2.
In summary, only 2 out of 74 core-orthologs were wrongly extended with EST sequences. Also, on the EST level only 111 ESTs were assigned to the wrong ortholog cluster. Thus, the false positive rate was in both cases very low (~3%).
The sensitivity for detecting an ortholog was also good. From the 81 orthologs present on human chromosome 2 and represented by at least 1 hchr2-EST, HaMStR identified 72 (89%). On the level of individual EST sequences, however, the sensitivity was substantially reduced. The 72 genes for which HaMStR predicted orthologs correctly are represented by 5,899 hchr2-ESTs. HaMStR detected only 3,243 (55%) of these. The ESTs, that HaMStR failed to assign, covered on average less than 13% of the coding sequence of the corresponding gene. Thus, the encoded protein fragment is too short to result in a significant hit in the HaMStR search. Figure 2 displays the dependency of the search result on the fraction of the coding sequence (CDS) covered by an EST. The length distribution of the missed ESTs explains why 9 core-orthologs represented by 389 hchr2-ESTs have been missed entirely. For these ESTs the average coding sequence coverage was in the range of 13%.
Some ESTs, however, have not been assigned even though the encoded protein fragment should be of sufficient length (Figure 2). A visual inspection of a subset of these ESTs indicates that most of them comprise splice variants lacking one or several exons (data not shown).
In summary, the false positive rate of HaMStR is convincingly small and the sensitivity depends strongly on the fraction of the CDS covered by an EST.
To increase the sensitivity of HaMStR, we re-designed the pHMMs such that local alignments between sequence and pHMM are generated during the search (options to the hmmbuild command). Using HaMStR with the modified pHMMs, it assigned 5,796 ESTs correctly to all 81 coreorthologs. Compared to the search with the default pHMM structures, this indeed results in a substantially higher sensitivity both on the level of detected genes and ESTs (100% and 92%, respectively). However, 256 ESTs were incorrectly assigned to 11 core-orthologs for which the human ortholog is not located on human chromosome 2. Thus, the gain in sensitivity leads to a pronounced increase of false positive predictions (12%).

Proof of principle: An EST based phylogeny of Fungi
We have shown that HaMStR efficiently extends coreorthologs with protein or translated EST sequences from further taxa. Now we explore the applicability of the HaM-StR approach for large-scale phylogenomic analyses. To this end we analyzed the proteome from 16 completely sequenced fungi and EST data from further 18 fungi (c.f. For the subsequent maximum likelihood tree reconstruction we used 178 genes. Each gene represents at least 40% of the taxa and at the same time each taxon is represented by at least 60% of these genes. The resulting tree has excellent branch support ( Figure 3) and is by and large congruent to the phylogeny of these taxa, that was inferred from an analysis of their entire genomes [3]. Only two taxa are differently placed. Staganospora nodorum, a dothideomycete, is placed as a sister taxon to the Eurotiomycetes, whereas it is sister to the Sordariomycetes/Leothiomycetes clade in Fitzpatrick et al. (2006) [3]. Aspergillus nidulans is positioned basal to the other Aspergillus species instead of grouping together with A. fumigatus. We cannot rule out the possibility that the differences might be due to erroneous orthology predictions by HaMStR. However, the following explanations argue against a wrong prediction.
For S. nodorum, the entire annotated proteome was used. As exemplified in our validation of HaMStR (c.f. Table 2), errors in the orthology prediction are unlikely for entire proteomes. The discrepancy between both trees may be caused by the fact that S. nodorum is the only representative of the Dothideomycetes. Together with the short internal branches in this region of the trees an accurate placement of this single taxon is difficult [29]. Moreover, as of today, the Fungi section of the ITOL-project http:// tolweb.org/fungi considers the placement of Dothideomycetes as still unresolved.
A. nidulans, the second problematic taxon, is the only Aspergillus species in our data for which orthologs are Sensitivity of HaMStR as a function of CDS coverage Figure 2 Sensitivity of HaMStR as a function of CDS coverage. Fraction of the coding sequence (CDS) covered by the ESTs that have been correctly annotated and missed by HaMStR, respectively. derived from ESTs. No quality information was available for these sequences, indicating that sequencing errors could interfere with the correct placement of A. nidulans in our phylogenetic tree. However, our A. nidulans placement agrees with other results [30].
Our example clearly shows that HaMStR is well suited to compile phylogenomic datasets, including incomplete sequence data like ESTs or unfinished whole genome sequencing projects.

Implementation
We offer the HaMStR tool online and for download at http://www.deep-phylogeny.org/hamstr. Currently, the users can search EST data for orthologs to core-orthologs derived from six predefined sets of primer-taxa, covering different regions of the metazoan tree. The referencetaxon for the re-BLAST can be selected from the taxa that constitute the primer-taxa. To increase EST data quality, we provide options to clean and cluster EST data, taking into account user information about base qualities [31] of the uploaded sequences. The output gives information to which core-ortholog set a particular EST was assigned to.
Links to detailed information about the protein of the chosen reference taxon for the re-BLAST are also provided. A fasta-file containing the sequences of the proteins in the core-ortholog cluster together with a tentative translation of the annotated EST sequence can be downloaded. An example-output is provided on the HaMStR web pages.

Discussion
ESTs constitute a paradigm of a random sample of fragments from a species' gene set. We show that HaMStR is capable of automatically predicting orthologs for a predefined set of genes (core-orthologs) in such data. The results are consistent with orthology predictions obtained with reciprocal BLAST based orthology prediction tools, which however require completely sequenced and annotated genomes. HaMStR performs a targeted search to extend predefined ortholog clusters derived from the analysis of complete genome sequences. This approach has three major advantages.
First, we search for orthologs only to genes for which whole genome comparisons provide prior evidence that orthologs can be identified. For example, core-orthologs including sequences from fungi and animals represent evolutionary old genes that were already present in the last common ancestor of both groups. Thus, an ortholog search for these genes in incomplete and fragmentary data, such as ESTs, from fungi or animals has a good chance of success. In the other extreme, genes with predicted orthologs confined to e.g. vertebrates are presumably of evolutionary recent origin or have diverged beyond recognition in taxa outside the vertebrates. Thus, a search for orthologs in ESTs from non-vertebrates will probably fail.
Second, the pHMMs constructed from the core-orthologs allow for a more refined ortholog search compared to a conventional BLAST. In our case example with the human ESTs HaMStR has a very low fraction of false positives (3%, c.f. Table 3). The low fraction of false positive orthology predictions by HaMStR comes at the price of low sensitivity. This is due to the fact that we try to match the pHMM over its entire length to the individual query sequences. ESTs, however, are generally short and cover in most cases only a fraction of the coding sequence. Our search may, therefore, be too stringent. Clustering ESTs prior to the HaMStR search is an obvious strategy to address this problem. However, especially in small EST projects many genes will have only a small fraction of their CDS covered by ESTs, even after clustering. Performing local alignments between sequence and pHMM substantially increases the sensitivity both on the level of detected genes and ESTs (100% and 92%, respectively). Thus, the gain in sensitivity leads to a pronounced increase of false positive predictions (12%). Still, this false positive rate is substantially smaller compared to that of other orthology prediction programs based on a reciprocal best BLAST hit when used with ESTs (data not shown). However, this emphasizes that short local similarities between sequences make orthology inference hard. Most studies -particularly in molecular systematics-rely on an accurate orthology prediction and results can be severely compromised by false positives in the data set. It is a major advance of HaMStR that it automatically filters sequences for which the orthology prediction is unreliable and we suggest using the global pHMM search as default.
Third, HaMStR is computationally efficient. The use of core-orthologs limits the search to only those genes where we expect to find orthologs. Moreover, also targeted ortholog searches are possible.
The performance of HaMStR depends on the choice of the primer-taxa and on the method to determine the coreorthologs. Primer-taxa should cover the diversity of the taxa of interest and thus constitute a scaffold of the corresponding phylogenetic tree. To address the evolutionary relationships of 35 fungal taxa we have used H. sapiens, C. intestinalis, D. melanogaster, C. elegans, and S. cerevisiae as primer-taxa. In principle, a primer taxon set including several fungi is more appropriate. It would minimize the evolutionary distance to the taxon in which the HaMStR search will be performed. It would also allow searching for orthologs to genes that emerged on the fungal lineage or got lost in metazoa. However, we deliberately did not adapt the primer-taxa. By doing so, we addressed the question of how phylogenomic datasets compiled with HaM-StR perform, when no or only a single completely sequenced genome exists for a taxon group of interest. The reconstructed tree for the 35 fungi is entirely resolved, highly supported and compatible to other published phylogenies [3,30]. This indicates that HaMStR, together with this primer-taxon set serves in general as a good starting point for any phylogenomic study within fungi and animals.
For building and training of the pHMMs, a large number of diverse primer-taxa is favorable. However, this holds only if the annotation of genes in the individual genomes is comprehensive. Despite the small set of primer-taxa, each core-ortholog was extended by at least one fungal sequence. Thus, the considerably small set of sequences used to train the pHMMs appears not to impair the sensitivity of the ortholog search in this particular case.
We determined the core-orthologs with InParanoid-TC (see Additional file 1). However, HaMStR allows the use of core-orthologs determined with any orthology predic-tion method, e.g., orthoMCL [18]. By that HaMStR can be adapted to the particular evolutionary problem one is interested in. If the HaMStR output serves as a data basis for a standard phylogenomics study, many approaches require that each taxon should be represented only by one sequence per gene. In such cases core-orthologs determined with InParanoid-TC or the recently developed OMA-algorithm [32] are a good starting point for HaM-StR. On the other hand, if one is interested in the evolution of a gene family, it may be desirable that the ortholog cluster used by HaMStR also contains co-orthologs. In such cases ortholog groups from more inclusive orthology prediction programs can be used as core-orthologs.

Conclusion
HaMStR extends in a standardized manner a pre-defined set of orthologs with proteins or ESTs from further taxa. A profile Hidden Markov Model that captures the essential features of the known orthologs is used in the ortholog search. The introduction of a reciprocity criterion into the ortholog search makes the orthology predictions symmetric, as is the case when complete proteomes are compared. Variation of the pHMM structure allows searches at different stringency. One may either carry out an alignment of part of the EST-or protein sequence against the entire profile (stringent), or a local alignment of the sequence against part of the profile (relaxed). Profile HMMs used in the HaMStR search can be based on arbitrary primer-taxa and any source of initial orthology predictions. Thus, our procedure is applicable regardless of the molecular systematics problem one is interested in. Testing the performance of HaMStR on a test set of human EST sequences revealed a remarkably low false positive rate of the orthology predictions with good sensitivity. The reconstruction of a phylogenetic tree for 35 fungal taxa based on 178 genes shows that HaMStR can compile informative data sets for contemporary phylogenomic studies. Our procedure may help to standardize data acquisition for studies in molecular systematics, allowing for a better comparability between individual results.

Methods
For phylogeny reconstruction we aligned the ortholog cluster individually with MAFFT --maxiterate 1000 --localpair [22]. The alignments were concatenated using the perl script concatenate_alignments.pl. Alignment columns containing more than 50% undetermined amino acid positions or gaps were removed with the perl script degapper.pl. Both scripts are available upon request. Maximum likelihood phylogeny reconstruction was performed with RAxML [33]. The consensus tree from 100 bootstrap replicates was computed with TREE-PUZZLE [34] and the percentages of bootstrap trees supporting a split in the consensus tree were used as branch support values. All datasets compiled in this study are available upon request.