# Covariance of maximum likelihood evolutionary distances between sequences aligned pairwise

- Christophe Dessimoz†
^{1, 2}and - Manuel Gil†
^{1, 2}Email author

**8**:179

https://doi.org/10.1186/1471-2148-8-179

© Dessimoz and Gil; licensee BioMed Central Ltd. 2008

**Received: **19 March 2008

**Accepted: **23 June 2008

**Published: **23 June 2008

## Abstract

### Background

The estimation of a distance between two biological sequences is a fundamental process in molecular evolution. It is usually performed by maximum likelihood (ML) on characters aligned either pairwise or jointly in a multiple sequence alignment (MSA). Estimators for the covariance of pairs from an MSA are known, but we are not aware of any solution for cases of pairs aligned independently. In large-scale analyses, it may be too costly to compute MSAs every time distances must be compared, and therefore a covariance estimator for distances estimated from pairs aligned independently is desirable. Knowledge of covariances improves any process that compares or combines distances, such as in generalized least-squares phylogenetic tree building, orthology inference, or lateral gene transfer detection.

### Results

In this paper, we introduce an estimator for the covariance of distances from sequences aligned pairwise. Its performance is analyzed through extensive Monte Carlo simulations, and compared to the well-known variance estimator of ML distances. Our covariance estimator can be used together with the ML variance estimator to form covariance matrices.

### Conclusion

The estimator performs similarly to the ML variance estimator. In particular, it shows no sign of bias when sequence divergence is below 150 PAM units (i.e. above ~29% expected sequence identity). Above that distance, the covariances tend to be underestimated, but then ML variances are also underestimated.

## Background

The estimation of evolutionary distances between gene/protein sequences is one of the most important problems in molecular evolution. In particular, it lies at the heart of most phylogenetic tree construction methods. The estimation of such distances is a two step process: first, homologous characters are identified, then the distances are estimated from the character substitution patterns. The most accurate matching of homologous characters is obtained by multiple sequence alignments (MSAs). Indeed, by considering all sequences simultaneously, MSAs yield a consistent and in principle optimal grouping of the homologous characters. Unfortunately, MSAs are hard to compute optimally (time complexity exponential in the number of sequences), and thus are in practice computed using heuristics. Alternatively, the sequences can be analyzed exclusively on the basis of pairs of sequences, using an algorithm such as Smith-Waterman [1] that yields optimal pairwise alignments (OPAs). This approach is often taken by large-scale comparative genomics analysis such as MIPS, OMA or RoundUp [2–4], which analyze the sequences pairwise due to computational constraints.

In all cases, the estimation of evolutionary distances is subject to inference uncertainty, which is commonly quantified by their variances and covariances. Indeed, the distance variance information can be used to build confidence intervals around the estimate; covariances of pairs of distances can be used to build the confidence intervals of combinations of distances. Examples of applications include generalized least squares (GLS) phylogenetic tree building [5] construction of confidence sets of trees [6], test for monophyly using likelihood ratios [7], comparison of evolutionary distances for orthology inference [3], or distance-based lateral gene transfer detection [8]

Variance estimates are provided by ML theory in both joint and pairwise distances estimation. However, ML theory only provides covariance estimates if all distances are estimated jointly. Covariance estimates for distances computed from IPAs in the context of specific parametric substitution models have been reported by Hasegawa et al. [9] and Bulmer [6], and were generalized by Susko [10] to all Markovian models of evolution. Furthermore, the covariance of distances from IPAs can also be estimated (though much more slowly) through bootstrapping [11]. As for the covariance of distances obtained from OPAs, the main difficulty in computing them is that, since sequence pairs are aligned individually, they usually have inconsistencies in their inference of the homologous characters (or else, computing an MSA from pairwise alignments would be trivial). Thus, the alignments cannot be partitioned in consistent "columns" of characters, and neither Susko's method nor resampling approaches such as bootstrapping can be applied. Indeed, in the case of analyses relying exclusively on pairwise comparison and distance estimation, i.e. where no MSA computation can be afforded, we are not aware of any previously published estimator for the covariance of distances estimates from pairwise alignments.

We have shown in a previous article [12] a numerical approximation for the constrained case of the covariance of two OPA distances involving a common sequence (i.e. on a triplet of sequences), for empirical substitution models such as PAM or JTT. In this article, we present an estimator for the covariance of ML distances estimated from OPAs that works on triplets and quartets of sequences. This solves the problem of sets of sequences of arbitrary size, because each covariance involves at most four sequences at a time. Thus, the full covariance matrix is naturally obtained through quartet analysis. We analyze the performances of the estimator in terms of bias and variance. Finally, we compare the results obtained on triplets of sequences to our previous work.

## Results and Discussion

### Case of dependence

The two distances are estimated between four distinct sequences, and they have some evolution in common (i.e. the two distance involve a common branch on the tree). With such an evolutionary history, the two distances estimates covary positively.

### Case of independence

The two distances are estimated between four distinct sequences, but they have no evolution in common (i.e. the two distance involve distinct branches on the tree). This case is informative, because a central assumption in most evolutionary models is that evolution on different branches is independent [13]. With no branch in common, the distances should not covary [6]. Thus, such a topology can be used to test the estimators as negative control.

### Case of triplet

The two distances involve a common sequence, and have some evolution in common. This case is of special interest, because we have previously presented an alternate estimator for this particular case using a different approach [12]. Thus, we can compare our results to this approach, hereafter called "the numerical approximation".

Note that the covariances are estimated using the same algorithm in all three cases: we only distinguish them from each another for the purpose of this analysis.

To assess the performance of the covariance estimator, it was compared against the Monte Carlo covariance estimator. In short, each point shown in the figures was obtained from 40,000 sets of sequences mutated along a random quartet subtree of the tree of life (see *Methods* below). That way, the evaluation is based on tree samples that are distributed as closely as possible to real biological data. To account for gene families with varying rates, each quartet was scaled with a random factor uniformly distributed between 0.5 and 2. Note that results corresponding to very large distance constitute extreme cases; for instance, when sequences are 150 PAM units apart, each position has, on average, mutated 1.5 times.

It is instructive to compare the absolute bias of the covariance estimator to the well-known ML variance estimator (see e.g. [14]). As can be seen in Fig. 3b, the ML variance is also biased for high variance values. We conjecture that this is mainly due to mis-aligned positions, which cause model violations in the parameter estimation. This problem is also likely to affect the covariance estimator. Even more directly, the ML variance estimator is a factor in the expression of the covariance estimator (see *Methods*), so any error in the ML variance is propagated to the covariance estimator. At this point, improving the ML estimator for cases of high divergence is likely to require better alignments, or an explicit modeling of the mis-aligned positions, which is beyond the scope of the present work.

In the case of triplets, the bias exceeds the standard deviation already when the fraction of anchors is about 80%. The ML variance estimator has this transition around 75% of anchors. In the case of independence, where we expect our covariance estimator to be zero, its bias is always much smaller than its standard deviation (data not shown).

*A*. We refer to $\frac{||\stackrel{\u02c6}{A}-A|{|}_{2}}{||A|{|}_{2}}$ as the relative error in $\widehat{A}$, where ||·||

_{2}denotes the two-norm. Fig. 6 shows the relative error of the 2 × 2 variance-covariance matrices computed with the ML variance estimator in the diagonal entries and our covariance estimator in the off-diagonal entries, and the same 2 × 2 matrices with only diagonal entries. The plots show that for the dependence case the the matrices with both covariance and ML variance estimators have a equal or lower relative error than the matrices with the ML variance only, except for a few cases in the region with a high fraction of anchors. In the triplet case, the variance-covariance matrices have always lower error then variance matrices. Finally, in the case of independence, the matrices with covariance do not always have lower relative error, but this is expected, because the true covariance is null in this special case.

## Conclusion

We have presented a method to estimate the covariances of distances estimated from pairwise alignments. It does not require the construction of MSAs, which are hard to compute and therefore are only approximated in practice. Furthermore, it does not rely on phylogenetic trees as it is the case with covariance estimation from joint ML, or in covariance estimation methods that model the covariances as a function of shared branch lengths [15, 16]. Tree building is not only a costly process, but is also subject to inference errors.

The accuracy of our estimator is comparable to the ML variance estimator. Both estimators are biased but in both cases the bias is, for distances below 150 PAM, far smaller than their standard deviation. The bias of the covariance estimator (as well as the ML variance, and to some extent the distance estimators) becomes worse with declining percentage of anchors. These biases arise because the alignment positions under scrutiny do not constitute an unbiased subsample of the true homologous positions. Note that misaligned positions are likely to affect distances from MSAs too. A solution to this problem would lead to better distance estimates in the first place. In the meanwhile, it is probably best to issue a warning if the percentage of anchors falls below some threshold.

The estimation of evolutionary distances is a very important process in molecular evolution, and therefore the covariance estimator presented here will be of use for various applications, such as the construction of GLS trees on OPA distances, the construction of confidence sets of trees based on the GLS test statistic, relative-rate tests, distance-based lateral gene transfer detection, and in general in any process that needs to estimate confidence of distance combinations.

## Methods

### Covariance of distances from OPAs

In this section we derive a covariance estimator for ML distances from OPAs.

#### Preliminaries

*s*

_{i,j}be the character at position

*j*in a sequence

*s*

_{ i }. Only characters in bold, for example {

*s*

_{1,1},

*s*

_{2,1},

*s*

_{3,1},

*s*

_{4,1}}, are consistently aligned in the OPAs. We call such a consistent set of characters an

*anchor*. On the other hand,

*s*

_{1,2}is aligned to

*s*

_{2,2}and to

*s*

_{3,2}, so in a consistent situation it would follow that

*s*

_{2,2}and

*s*

_{3,2}should be aligned, but it is not the case.

Given *m* sequences, the anchors can formally be defined as follows: Define a graph *G*({*s*_{
i
}}) with ${\sum}_{i}^{m}\left|{s}_{i}\right|$ vertices labeled by *s*_{i,j}. We join vertices ${s}_{{i}_{1},{j}_{1}}$ and ${s}_{{i}_{2},{j}_{2}}$ if the corresponding characters are aligned in the $\text{OPA}({s}_{{i}_{1},}{s}_{{i}_{2}})$. The set of anchors for the $\left(\begin{array}{c}m\\ 2\end{array}\right)$ OPAs is defined as the set of all cliques of size $\left(\begin{array}{c}m\\ 2\end{array}\right)$ in *G*({*s*_{
i
}}). By construction, the sub-alignments induced by the anchors define an MSA. In the derivation of our covariance estimator, we assume that the anchor positions are correctly aligned. For the non-anchor positions, we know that some proportion is wrongly aligned in at least one of the $\left(\begin{array}{c}m\\ 2\end{array}\right)$ OPAs. We do not know, though, which positions and in which alignments. In this paper we are interested in the covariance of distances from two OPAs. In each case the anchors are determined from the particular sequences involved in the corresponding covariance estimation. If the two OPAs share a sequence *m* = 3, otherwise *m* = 4. The following pseudocode shows how the anchors can be found for *m* = 4. It uses a function $\text{M}({s}_{{i}_{1}},{s}_{{i}_{2}},{j}_{1})$ which returns the index *j*_{2} of the character ${s}_{{i}_{2},{j}_{2}}$ of *s*_{i2 }aligned to ${s}_{{i}_{1},{j}_{1}}$ in $\text{OPA}({s}_{{i}_{1},}{s}_{{i}_{2}})$.

Anchors ← {}

**for** *j*_{1} ← 1 **to** *length*(*s*_{1}) **do**

*j*_{2} ← M(*s*_{1}, *s*_{2}, *j*_{1}); *j*_{3} ← M(*s*_{1}, *s*_{3}, *j*_{1}); *j*_{4} ← M(*s*_{1}, *s*_{4}, *j*_{1})

**if** M(*s*_{2}, *s*_{3}, *i*_{2}) = *j*_{3} **and** M(*s*_{2}, *s*_{4}, *i*_{2}) = *j*_{4}

**and** M(*s*_{3}, *s*_{4}, *i*_{3}) = *j*_{4} **then**

Anchors ← Anchors ∪ $\{[{s}_{1,{j}_{1}},{s}_{2,{j}_{2}},{s}_{3,{j}_{3}},{s}_{4,{j}_{4}}]\}$

**end**

**end**

#### Formulation of the covariance estimator

*p*(

*X*

_{ j },

*d*) denote the probability of a homologous character-pair

*X*

_{ j }for the

*j*-th OPA when the distance is taken to be

*d*. We assume that the gap-positions have been removed from the alignments and that the

*j*-th OPA has length

*n*

_{ j }. Denote ${\widehat{d}}_{j}$ the distance obtained by ML and

*d*

_{ j }the true distance. It is well known from ML theory (see e.g. [14]) that under appropriate smoothness conditions, the variance of ${\widehat{d}}_{j}$ is

*j*-th OPA be

*x*

_{j,l}is the realization of

*X*

_{ j }at position

*l*. To abbreviate, we set ${u}_{j,l}(d)=\frac{\partial}{\partial d}\mathrm{ln}\phantom{\rule{0.5em}{0ex}}(p({x}_{j,l},d))$. As mentioned by Susko [10], ML results yield

*A*refer to anchors,

*N*refer to non-anchors. Since virtually all Markovian models of evolution assume independent positions, we can split the score functions in a part corresponding to the anchor positions and a non-anchor part:

*l*=

*m*. Since, up to high order terms, (${\widehat{d}}_{j}$ -

*d*

_{ j }) is equal to

*-V*

_{ j }

*u*

_{ j }(

*d*

_{ j }) we can write for the covariance of ${\widehat{d}}_{j}$ and ${\widehat{d}}_{k}$

*n*

_{ A }is the number of anchors. Because of the correctness assumption of the anchors, all pairs that are not part of the same anchor are non-homologous, and therefore, their covariance is zero, i.e. for

*l*≠

*m*, $cov\left({u}_{j,l}^{A}({d}_{j}),{u}_{k,m}^{A}({d}_{k})\right)=0$ and we get

*V*

_{ j }is estimated by

where ${\overline{u}}_{j}^{A}$ denotes the sample mean.

### Simulation methods

To evaluate the performance of the covariance estimator we performed a Monte Carlo simulation on quartets and compared our estimator to the sample covariance (also referred to as the Monte Carlo covariance).

#### Sampling of quartets

The quartets were sampled uniformly from a variance weighted least squares (WLS) tree on 352 species. The WLS tree was inferred by the *LeastSquaresTree* function in Darwin [17]. To obtain the input distance and variance matrices for *LeastSquaresTree* we used data from the OMA project [3]. The inter-species distances were determined as average PAM distances over sets of groups of orthologs. A total of 100 quartets were sampled, each one contributing one data-point to the plots shown here.

#### Simulation procedure for one quartet

To explore the branch-length space, while preserving the relative branch-length structure given by the WLS tree we applied an uniformly distributed U(0.5,2) expansion/contraction factor on each quartet. Then, we generated 40,000 times three random sequences of length *m* = {200, 500, 800} and mutated each of them along the dilated model quartet. We assumed a Markovian model of evolution using the updated PAM matrices [18] and introduced gaps of Zipfian distributed length [19].

We applied our covariance estimator on each of the 40,000 quartets and estimated its expected value and variance to compare it against the sample covariance which we also refer to as *Monte Carlo covariance*. In the analysis of the results, we treated the sample covariance as a reference value, as it constitutes an unbiased estimator for the true covariance. The biases reported in the result section are defined as the estimate of the expected value of our covariance estimator minus the Monte Carlo covariance. Note that being an estimator itself, the sample covariance's variance had also to be taken into account in the analysis of the results.

## Notes

## Declarations

### Acknowledgements

The authors thank Gaston Gonnet, Adrian Schneider and Gina Cannarozzi for helpful comments on the manuscript.

## Authors’ Affiliations

## References

- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol. 1981, 147: 195-197. 10.1016/0022-2836(81)90087-5.View ArticlePubMedGoogle Scholar
- Mewes HW, Amid C, Arnold R, Frishman D, Guldener U, Mannhaupt G, Munsterkotter M, Pagel P, Strack N, Stumpflen V, Warfsmann J, Ruepp A: MIPS: analysis and annotation of proteins from whole genomes. Nucl Acids Res. 2004, 32 (suppl 1): D41-44. 10.1093/nar/gkh092.PubMed CentralView ArticlePubMedGoogle Scholar
- Dessimoz C, Cannarozzi G, Gil M, Margadant D, Roth A, Schneider A, Gonnet G: OMA, A Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data: Introduction and First Achievements. RECOMB 2005 Workshop on Comparative Genomics, Volume LNBI 3678 of Lecture Notes in Bioinformatics. Edited by: McLysath A, Huson DH. 2005, Springer-Verlag, 61-72.Google Scholar
- DeLuca TF, Wu IH, Pu J, Monaghan T, Peshkin L, Singh S, Wall DP: Roundup: a multi-genome repository of orthologs and evolutionary distances. Bioinformatics. 2006, 22 (16): 2044-2046. 10.1093/bioinformatics/btl286.View ArticlePubMedGoogle Scholar
- Cavalli-Sforza LL, Edwards AWF: Phylogenetic analysis: models and estimation procedures. Evolution. 1967, 21: 550-570. 10.2307/2406616.View ArticleGoogle Scholar
- Bulmer M: Use of the method of generalized least-squares in reconstructing phylogenies from sequence data. Mol Biol Evol. 1991, 8 (6): 868-883.Google Scholar
- Huelsenbeck JP, Hillis DM, Rasmus N: A likelihood-ratio test of monophyly. Syst Biol. 1996, 45 (4): 546-558. 10.2307/2413530.View ArticleGoogle Scholar
- Dessimoz C, Margadant D, Gonnet GH: DLIGHT – Lateral Gene Transfer Detection Using Pairwise Evolutionary Distances in a Statistical Framework. RECOMB 08: Research in Computational Molecular Biology, 12th Annual International Conference, Singapore, 2008, Proceedings, Lecture Notes in Computer Science, Springer. 2008Google Scholar
- Hasegawa M, Kishino H, Yano T: Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985, 22: 160-174. 10.1007/BF02101694.View ArticlePubMedGoogle Scholar
- Susko E: Confidence regions and hypothesis tests for topologies using generalized least squares. Mol Biol Evol. 2003, 20 (2): 862-868. 10.1093/molbev/msg093.View ArticlePubMedGoogle Scholar
- Efron B, Tibshirani RJ: An introduction to the bootstrap. 1993, Chapman & Hall, New YorkView ArticleGoogle Scholar
- Dessimoz C, Gil M, Schneider A, Gonnet GH: Fast estimation of the difference between two PAM/JTT evolutionary distances in triplets of homologous sequences. BMC Bioinformatics. 2006, 7: 529-10.1186/1471-2105-7-529.PubMed CentralView ArticlePubMedGoogle Scholar
- Felsenstein J: Inferring Phylogenies. 2004, Sinauer Associates, Inc., Sunderland, MAGoogle Scholar
- Rice JA: Mathematical Statistics and Data Analysis. 2001, Duxbury PressGoogle Scholar
- Nei M, Stephens JC, Saitou N: Methods for computing the standard errors of branching points in an evolutionary tree and their application to molecular data from humans and apes. Mol Biol Evol. 1985, 2: 66-85.PubMedGoogle Scholar
- Gascuel O: BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol Biol Evol. 1997, 14 (7): 685-695.View ArticlePubMedGoogle Scholar
- Gonnet GH, Hallett MT, Korostensky C, Bernardin L: Darwin v. 2.0: An Interpreted Computer Language for the Biosciences. Bioinformatics. 2000, 16 (2): 101-103. 10.1093/bioinformatics/16.2.101.View ArticlePubMedGoogle Scholar
- Gonnet GH, Cohen MA, Benner SA: Exhaustive matching of the entire protein sequence database. Science. 1992, 256 (5003): 1443-1445. 10.1126/science.1604319.View ArticlePubMedGoogle Scholar
- Benner SA, Cohen MA, Gonnet GH: Empirical and Structural Models for Insertions and Deletions in the Divergent Evolution of Proteins. J Mol Biol. 1993, 229 (4): 1065-1082. 10.1006/jmbi.1993.1105.View ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.