Knowledge of positions that coevolve in biological sequences can be applied to predict structures of RNAs [1–3] and proteins [4–7]; to predict intermolecular interactions [6, 8]; to identify functionally important regions of molecules [9, 10]; and to identify energetic pathways through molecules [11, 12]. Coevolutionary analyses have frequently been performed on one or a few protein families. However, just as The Adaptive Evolution Database  allows proteome-wide studies of evolutionary rates, proteome-wide studies of coevolution could also be could performed if sufficiently fast and well-characterized methods for detecting coevolution were available.
As biological sequences are the product of an evolutionary process, it intuitively makes sense that the accuracy of analyses of the historical processes affecting them will be improved by explicit representation of those historical processes. Incorporation of phylogenetic information has benefited diverse classes of bioinformatics algorithms, including multiple sequence alignment , comparison of microbial communities , and functional annotation of genes . Accordingly, incorporation of phylogenetic knowledge to control for patterns in biological sequence data that arise from ancestry is regarded as essential for coevolutionary analyses and best achieved by directly incorporating the phylogeny in the metric [7, 17, 18].
Many coevolution algorithms ('tree-aware' methods) have explicitly attempted to control for phylogeny (e.g., [4–6, 9, 17, 19, 20]), while others ('tree-ignorant' methods) have implicitly assumed a star phylogeny (e.g., [10, 12, 21, 22]). Drawbacks have been identified for both approaches. Likelihood based tree-aware methods have the disadvantage of being sensitive to model mis-specification, a property common to all likelihood methods, and generally have a much longer compute time than tree-ignorant methods. Tree-ignorant methods are thought to have decreased specificity due to confounding of correlations arising from selective pressure with correlations arising from shared ancestry represented by the phylogeny [9, 23, 24]. Past evaluations of the effect of tree topology on the performance of coevolution algorithms have used simulated data, and have confirmed that non-star tree topologies can cause false positives [21, 23, 25, 26].
Clearly, controlling for shared ancestry is essential but approaches that do so without explicitly representing the phylogenetic tree are possible. We hypothesized a tree-ignorant statistic can be informative if it is compared to a distribution of the same statistic with the same embedded ancestry but variable in coevolution. In this case, the shared ancestry origin of correlated evolution dominates the background distribution. A greater magnitude of correlated evolution than this background is thus evidence of coevolution.
An additional consideration for estimating coevolution is that encoding protein alignments with reduced-state amino acid alphabets reduces computational complexity, and may also increase statistical power [5, 7, 21, 27]. In a reduced-state alphabet, the twenty amino acids are collapsed to a smaller number of states. For example, a three-state 'charge' alphabet can be achieved by treating His, Lys, and Arg as the 'positively charged' state; Asp and Glu as the 'negatively charged' state; and, all other residues as the 'uncharged' state. The recoding chosen for a group of sequences constitutes an explicit hypothesis concerning the primary biochemical property subjected to coevolutionary pressures by natural selection. The motivation for choosing to recode sequences stems from the bias-variance trade-off, where statistical models with fewer parameters have lower variance (and typically greater statistical power) but more bias. A coevolution algorithm applied to sequences with fewer states should therefore have more power to identify pairs of positions which coevolve as a result of the physicochemical property being modeled (e.g., charge) because variability within each state is hidden. Information is lost in recoding to a reduced-state alphabet, so the power for detecting coevolution arising from other properties of amino acid residues (e.g., side-chain volume) decreases. The sensitivity of inference concerning coevolution to the encoding choice is unclear.
Evaluations of coevolution algorithms on simulated data elucidate the strengths and limitations of the algorithms, but are forced to rely on simplifying assumptions about the biological systems being modeled. Evaluations on biological data are therefore important for understanding how an algorithm will perform under more realistic circumstances. Biologically relevant evaluations are difficult however, because we have little knowledge of when sequence positions truly coevolve, and therefore do not have a good idea of what the true positives are.
Different approaches have been employed to define coevolutionary positive-controls. Individual cases of coevolution are directly supported by observation of variants known to cause disease in one species in another species . This class of variant has been reported for both RNA and protein coding genes and does exhibit strong statistical evidence for coevolution . The suitability of this class of variation for examining the properties of coevolution, however, is low for both practical and biological reasons: the number of cases for which there is sufficient data from related species is low; and, the identification of these variants as pathological suggests the selection coefficients operating on them is very strong and thus may not be representative of the strength of selection responsible for most coevolution.
An alternative approach has been to focus on candidate molecular-structure influences likely to be subjected to natural selection. Past evaluations on (non-simulated) RNA alignments have treated base pairs as positive controls and all other pairs as negative controls [1, 2, 20, 24]. These have been useful for comparing algorithms on RNA, but it is not clear that performance on RNA alignments translates well to performance on protein alignments because interactions between residues in proteins are generally more complex. Protein gold standards have been designed to evaluate a method's ability to identify residue contacts in tertiary structure by defining residues pairs within a certain C
distance in a representative crystal structure as positive controls, and all other pairs as negative controls [5, 9, 20–22, 29–32]. The set of residue pairs within a small C
distance in tertiary structure in a representative crystal structure is recognized as a coarse criteria because it is not clear that close physical proximity is an essential precondition for coevolution , and because a single crystal structure may not accurately describe the tertiary contacts in all sequences in the alignment.
To complement residue-contact-based comparisons, we present a novel secondary-structure-based method for comparing coevolution methods where the known periodic stabilizing interactions between stacked residues in protein alpha helices are taken as positive controls. Double-mutant studies of protein alpha helices have shown that stacked residues in alpha helices interact to stabilize the helix [33–36]. Statistical analyses support these results by showing that the interactions are present in diverse families of alpha helices [5, 9, 37–39]. Stabilization is thought to result from ionic interaction, aromatic-aromatic interaction, or hydrogen bonding between stacked side chains. Although there has been discussion on the validity of these studies , biophysical and statistical analyses continue to support the case for stabilizing interactions. These interactions occur between the stacked positions in the alpha helix, or the positions separated by three residues in primary structure (i, i + 4) (where i refers to the sequence position), and to a lesser extent between positions separated by two residues (i, i + 3), corresponding to the 3.6 residue per turn periodicity of the alpha helix. Since interactions between stacked residues appear important for alpha helix stability, we argue that positions should coevolve to conserve these interactions. Methods for detecting coevolution in proteins should therefore identify stacked residues in alpha helices, as illustrated in [5, 26, 38, 41] providing a positive control for coevolution detection algorithms. We emphasize that we are not presenting the coevolution algorithms that we test as methods for detecting alpha helices from sequence data, but rather exploiting the known regular structure of the helix as a gold standard for detecting coevolution: methods for detecting coevolution should, at minimum, be able to recapture these regularities.
We report an assessment of the hypothesis that appropriately transformed tree-ignorant metrics have similar statistical power to tree-aware approaches by performing a systematic comparison of nine coevolution algorithms. Five of the algorithms – Mutual Information (MI), Normalized Mutual Information (NMI) , Resampled Mutual Information (RMI) (introduced here), Statistical Coupling Analysis (SCA) , and Corrected Mutual Information (MIp)  – use multiple sequence alignments but no phylogenetic trees. The other four methods – LnLCorr [5, 7], Ancestral States (AS) [4, 17], the Generalized Continuous-Time Markov Process Coevolutionary Algorithm (GCTMPCA) [3, 6], and CoMap [20, 24] – use multiple sequence alignments and phylogenetic trees. We additionally considered including the method described in , but opted to include NMI and MIp instead since they are expected to perform better than the former (K. Wollenberg, personal communication). The algorithms were compared by application to real (i.e., non-simulated) protein sequence alignments.
In a secondary study, our alignments are recoded in 52 different reduced-state amino acid alphabets to evaluate the utility of amino acid alphabets which model different chemical properties, and to test the hypothesis that alphabets with fewer states are generally better for detecting coevolution.