We used the publicly available genomes from GenBank [45] and the Joint Genome Institute [46]. We relied on the annotations to identify protein-coding genes, but not to identify orthologous sequences. All the software developed for this project is available from the authors upon request.

### Orthologous sequences

We used our own data-mining program (manuscript in submission), similar to an all-against-all BLAST search [47], to find orthologous sequences and rank them by degree of conservation. Given a set of genomes and a sequence length, say 300 amino acid residues, this program returns a ranked list of the 200 most conserved proteins at that length, along with pointers into the genomes for the locations of the sequences. If there are not 200 well-conserved orthologous families within the set of genomes – for example if the set includes a mix of eubacteria and archaea – then the program returns only as many families as are deemed to be obvious homology (approximately 30% pairwise identity). The choice of 200 is somewhat arbitrary, but seemed to be close to the maximum for genome sets containing more than one bacterial phylum, without incurring much misalignment or mutational saturation. The program is not designed to find remote homology, and existing tools such as PSI-BLAST [48] are in fact better for this task. We ran the program for several different sequence lengths from 60 to 300, and included each representative protein only once, at the maximum length for which it was representative.

We could have used existing tools such as BLAST or existing databases such as COGs for ortholog assembly, but our own program offered several advantages. First, it ranks ortholog families by conservation level, measured by the quartile log odds similarity over all pairs, that is, the similarity score greater than 1/4 and smaller than 3/4 of the pairwise similarities. (Thus if a gene is missing from more than 1/4 of the genomes under study, it has a conservation level near zero.) The program limits attention to sequences of fixed length, such as 300 residues, in order to compare conservation levels fairly. Then by examining ortholog families in order of decreasing conservation level, the program screens out families that are not sufficiently conserved. Second, it is much faster than BLAST, so that we could perform ortholog assembly separately for each set of genomes, thereby finding proteins conserved within the taxa under study, but not conserved more generally. Third, because it finds ortholog families using all pairwise alignments of candidate orthologs, rather than by a reciprocal or circular BLAST search, it also minimizes the problem of hidden paralogy [49]. We rarely observed paralogs (as identified by GenBank annotations) when the sequence length was greater than 100 residues. Moreover, selection of representative proteins should filter out remaining hidden paralogy that could mislead the tree-building program.

### Representative sequences

The next step condenses the large amount of information in the conserved sequences to a matrix { **D** (*i*, *j*)} of pairwise evolutionary distances. The step also retains a matrix { *D* (*i, j, k*)} for each protein *k*, that is, the pairwise evolutionary distances given by the *k* -th set of orthologous sequences. The amount of agreement between { *D* (*i, j, k*)} and { **D** (*i*, *j*)} determines whether protein *k* is representative. We start by computing, for each pair of organisms and each protein, a pairwise alignment and a log odds score (Smith-Waterman algorithm using the BLOSUM50 substitution matrix with default gap costs). This gives a three-dimensional array of log odds similarity scores *S* (*i, j, k*), where *i* and *j* index the organisms and *k* indexes the proteins. There are many alternative choices of similarity scores, simpler scores such as percent identity and more complicated scores involving multiple BLOSUM matrices; the alternatives we tried gave nearly identical choices of representative proteins. We assume that each protein has its own "clock" as in Yang's proportional model [50], so that in any given time period, the log odds of a given amount of evolution for protein *k* is a fixed multiple of the log odds of the same amount of evolution for protein *k'*. We compute the multiplier *M*
_{
k
}for protein *k* with an iterative procedure. We first detect whether an organism *i* is missing protein *k*; for such an *i* and *k*, score *S* (*i, j, k*) is very low for all *j*. We discard these *S* (*i, j, k*) scores altogether. Then starting with *M*
_{
k
}= 1 for all *k*, we alternate the following steps. Convergence (to three decimal places) resulted after only three iterations.

- 1.
For each *i* and *j*, set **S** (*i, j*) ← trimmed mean *M*
_{
k
}
*S* (*i, j, k*). For each *i* and *j*, we drop the top and bottom 20% of the *M*
_{
k
}
*S* (*i, j, k*) values and compute the mean of the middle 60%. We chose this trimmed mean after estimating the size of distribution tails (anomalous distances) in histograms such as those shown in Figure 6.

- 2.
For each *k*, set *M*
_{
k
}← median **S** (*i, j*)/*S* (*i, j, k*). Then normalize *M*
_{
k
}← *M*
_{
k
}· *c*, where *c* = *m*/∑_{
k
}
*M*
_{
k
}and *m* is the number of proteins, so that the *M*
_{
k
}'s average 1.0.

Step 1 forms a consensus over proteins and step 2 forms a consensus over organism pairs. We used robust statistics (trimmed means and medians) instead of ordinary means due to the many outliers. Multipliers typically varied from about 0.5 for the most conserved protein to about 1.6 for the 200-th most conserved protein. Novichkov et al. [26] independently developed a similar procedure; however, they used the median rather than trimmed mean in Step 1 and simply ran the two steps once each, without iteration. Because the consensus similarities S (*i, j*) improve with estimates of the multipliers *M*
_{
k
}, iteration gives better results, as we discuss below.

We convert the scaled similarity scores *M*
_{
k
}· *S* (*i, j, k*) to evolutionary distances *D* (*i, j, k*) = *C* - log (*M*
_{
k
}· *S* (*i, j, k*)), where *C* is a constant chosen so that all the *D* (*i, j, k*) are positive. In the case that organism *i* or *j* was identified as missing protein *k*, we set *D* (*i, j, k*) = ∞. Figure 6 shows histograms of *D* (*i, j, k*) values for three different pairs of organisms; the distances for missing proteins appear at 100+.

We set the consensus distance **D** (*i, j*) between organisms *i* and *j* to be the trimmed mean (again using the middle 60%) of the finite (not from missing proteins) *D* (*i, j, k*) values. We tested how well distance matrices conformed to trees by running the Fitch-Margoliash algorithm [1, 51] (program FITCH in the PHYLIP package [52]). For example, the Fitch-Margoliash tree [see Figure S1 in Additional file 1] gives 0.199 relative squared errror, that is, 0.199 = ∑_{
i, j
}(**T** (*i, j*) - **D** (*i, j*))^{2}/**D** (*i, j*)^{2}, where **T** (*i, j*) is the pairwise distance given by the tree, and gives 1.51% average percent standard deviation (APSD) [51, 52], a measure of the typical error of **T** (*i, j*) relative to **D** (*i, j*). For comparison, we tested the median procedure of Novichkov et al. by computing consensus similarities with **S**
_{
N
}(*i, j*) = median *S* (*i, j, k*) and consensus distances with **D**
_{
N
}(*i, j*) = *C* - log **S**
_{
N
}(*i, j*). On 8 out of the 9 distance matrices used in our study, { **D** (*i*, *j*)} conformed to tree distances (computed with FITCH) more closely than did { **D**
_{
N
}(*i, j*)}, with APSDs ranging from 0.63 to 3.67 for { **D** (*i, j*)} and from 0.92 to 4.26 for { **D**
_{
N
}(*i, j*)}. For the organisms in Figure S1 the FITCH tree produced from { **D**
_{
N
}(*i, j*)} broke several accepted clades and achieved a relatively poor APSD of 4.24%. The one case on which { D
_{
N
}(*i, j*)} outperformed { **D** (*i, j*)} was the taxon sample that included the partially sequenced genome Chloroflexus.

The distances { *D* (*i, j, k*)} given by a protein *k* can be regarded as a vector with *N* = *n* (*n* - 1)/2 entries, where *n* is the number of organisms. We computed three different measures of how well { *D* (*i, j, k*)} represents the consensus matrix { **D** (*i, j*)}. The three measures are: (1) correlation coefficient of { *D* (*i, j, k*)} with { **D** (*i, j*)}; (2) standard deviation, that is,
; and (3) max distance, that is, max_{
i, j
}| *D* (*i, j*, *k*) - **D** (*i, j*)|. We chose the top-ranking proteins by max distance, enough proteins to give approximately 10,000 columns; we marked the worst 20% of proteins by each of measures (1) and (2), and dropped all marked proteins, thereby obtaining 6000–8000 columns. Thus a protein had to be good on all three measures to be considered representative; in particular, a protein had to be ubiquitous over the organisms under study. For this reason, we initially left out organisms with very reduced genomes, such as Mycoplasma and Buchnera; these organisms are best added to the tree (Figures 3 and 5) later using representative proteins chosen for a smaller range of taxa.

### Multiple alignment and tree building

We used CLUSTAL W [53] to compute a multiple alignment for each set of orthologs. We removed columns of possibly incorrect alignment using a method recommended by Sidow (personal communication). We first removed all columns containing a gap character -, and then removed columns to the left and right of a gap column until reaching a column of chemical agreement. A column of chemical agreement is one in which all the amino acids in the column are from a single group, where the groups are acidic residues {D, E}, aromatic residues {F, W, Y}, basic residues {H, K, R}, cysteine {C}, nonpolar residues {A, G, I, L, P, V}, and polar residues {M, N, Q, S, T}. We also removed blocks of more than eight consecutive columns in which no column was one of chemical agreement; even if such a highly variable block is a correct alignment (one-for-one amino acid substitution over evolutionary time), it may have too many superimposed mutations to be helpful for phylogeny. Finally we concatenated all the cleaned alignment files, obtaining a multiple alignment of about 5000–7000 columns.

We used SEMPHY Version 0.9 to compute phylogenetic trees [54]. SEMPHY is a relatively fast EM (expectation maximization) program for ML (maximum likelihood) phylogeny, which assumes a Markov model of evolution. We used the JTT model [55], which is SEMPHY's default. The newer SEMPHY Version 1.0 models mutation rate variation among columns with a discrete gamma distribution, but we found only insignificant differences in the resulting ML trees, so we preferred the simpler, homogeneous-rate model. Cleaned alignments of representative proteins tend to be more homogeneous than alignments of random proteins. To estimate the confidence of clades, we used a bootstrapping procedure or – more precisely – a jackknife procedure. To create a random subsample of the entire alignment file we included each block of 80 consecutive columns in the subsample with probability 0.5; this gives a slightly harsher test than the standard bootstrap. We tested one tree (the overall tree for Figure 1) both ways; the jackknife numbers were all 0–10% smaller than the corresponding bootstrap numbers. The rationale for randomly sampling blocks rather than individual columns was to effectively vary the set of representative proteins as in [44]. For each ML tree computation, we created 100 random subsamples and ran SEMPHY on each subsample. Typical running times were 15–20 minutes per subsample, or about 30 hours for 100 subsamples. We also tried PhyML [56], another fast program for ML phylogeny. With a single substition rate category, PhyML Version 2.4.4 is faster than SEMPHY (7 minutes versus 17 minutes for a computation with 29 organisms and 5424 columns). On the other hand, PhyML may be more prone to converge to a suboptimal local optimum, as we sometimes obtained a greater likelihood within PhyML by inputting SEMPHY's solution as PhyML's starting tree.

To produce Figure 1, we made five different runs of tree computations with different sets of genomes. We chose representative proteins separately for each run, because representativeness depends upon the set of genomes under study. An initial run with 30 diverse bacteria [the 30 bacteria shown in Figure S1 of Additional file 1], with all phyla except Planctomycetes represented, sketched out an unrooted phylogeny. The root was placed by a run with the same 30 eubacteria along with two archaea, Aeropyrum and Methanopyrus. We then split the tree in two pieces ("left" and "right") by cutting it at the deep interior edge with best bootstrap support, the doubled edge shown in the figure. Computing a large phylogenetic tree in pieces allows the use of more proteins of greater representativeness, but such a split must be done very cautiously because each organism potentially affects all the unknown sequences at interior nodes. We added more organisms, such as Xanthomonas and Vibrio, assigning them to the left or right half according to the accepted taxonomy. We then added a right-half organism (Chlorobium) to the left half, and a left-half organism (Escherichia) to the right half, and made two more runs of tree computations, one for the left and one for the right. These runs corroborated the split by their placements of Chlorobium and Escherichia. In Figure 1 all edge lengths and bootstrap-support numbers refer to these last two runs, except for the numbers on the doubled edge and the root. (Edge lengths on the left and right sides of the tree are thus not strictly comparable as they refer to different proteins.) Finally, we added Chloroflexus by rerunning the overall computation, that is, 31 organisms in all. The draft genome, however, is missing many typically representative proteins such as GroEL and elongation factor TU, so we could harvest only about 1200 columns of representative proteins, and we indicate its position by a dashed edge without bootstrap support. (The 84 in Figure 1 is the support for the Actinobacteria/Deinococcus/Cyanobacteria clade.) In order to produce Figure 2, we reran the organisms of the left half along with some other fully sequenced Proteobacteria, Spirochaetes, and Chlamydiales. For Figure 3, we reran our procedures on *γ* -and *β* -proteobacteria, now including the reduced genomes of Buchnera and Wigglesworthia. For Figure 4, we reran most of the organisms of the right half along with some other genomes, and finally for Figure 5 we ran Actinobacteria and Firmicutes (along with Fusobacterium and Deinococcus), now including the reduced genomes of Mycoplasma and Ureaplasma. We attempted to include these reduced genomes at the higher level of Figure 4, but obtained inconsistent results depending upon the proteins selected.