Correlates of substitution rate variation in mammalian protein-coding sequences

Background Rates of molecular evolution in different lineages can vary widely, and some of this variation might be predictable from aspects of species' biology. Investigating such predictable rate variation can help us to understand the causes of molecular evolution, and could also help to improve molecular dating methods. Here we present a comprehensive study of the life history correlates of substitution rate variation across the mammals, comparing results for mitochondrial and nuclear loci, and for synonymous and non-synonymous sites. We use phylogenetic comparative methods, refined to take into account the special nature of substitution rate data. Particular attention is paid to the widespread correlations between the components of mammalian life history, which can complicate the interpretation of results. Results We find that mitochondrial synonymous substitution rates, estimated from the 9 longest mitochondrial genes, show strong negative correlations with body mass and with maximum recorded lifespan. But lifespan is the sole variable to remain after multiple regression and model simplification. Nuclear synonymous substitution rates, estimated from 6 genes, show strong negative correlations with body mass and generation time, and a strong positive correlation with fecundity. In contrast to the mitochondrial results, the same trends are evident in rates of nonsynonymous substitution. Conclusion A substantial proportion of variation in mammalian substitution rates can be explained by aspects of their life history, implying that molecular and life history evolution are closely interlinked in this group. The strength and consistency of the nuclear body mass effect suggests that molecular dating studies may have been systematically misled, but also that methods could be improved by incorporating the finding as a priori information. Mitochondrial synonymous rates also show the body mass effect, but for apparently quite different reasons, and the strength of the relationship with maximum lifespan provides support for the hypothesis that mtDNA damage is causally linked to aging.

estimated date for their split; in any case this pair was excluded from the main analyses for other reasons discussed below.) For the substitution rate contrasts, the same procedures were used, and so the contrast in synonymous substitution rates was calculated as ! ln dS 1 " ln dS 2 ( ) 2t . Because the species form a sister pair, substitutions will have accrued for the same period of time along both lineages, and so the difference in log branch lengths, ! ln dS 1 " ln dS 2 = ln dS 1 dS 2 ( ) , is also the maximum likelihood estimate of the difference in log substitution rates, as required. However, the time-dependent nature of the substitution process complicates tests for homogeneity of variance. The reason is that a substitution rate contrast might be associated with high variance for two quite different reasons: (1) the tendency of deeper contrasts to be associated with greater amounts of rate evolution, or (2) the fact that shallower contrasts, or slower lineages, will accumulate fewer substitutions, such that the rate estimate is associated with more stochastic noise. A successful correction of trend (1) may result in significant negative trends due to (2).
Indeed, the tests of Freckleton (2000) and Garland et al. (1992) both show significant negative trends for the standardised substitution rate contrasts (Table S1 "dS all pairs").
These trends could be removed by leaving contrasts unstandardised (Table S1), but Welch and Waxman (in press) show that including the shallow, or slow contrasts (for which rates are the most inaccurately measured), can obscure the true relationship between rate and the predictor variable, and inflate Type II error rates. As such, scatterplots equivalent to Figure S1 were used to identify minimum ages and rates, below which stochastic fluctuations in substitution number appeared to be an important determinant of contrast magnitude. The plots for mitochondrial synonymous rates are shown in Figure S2. We chose to exclude the two slowest contrasts (identified by asterisks), and the nine shallowest contrasts (identified by empty circles). The reduced set of 45 contrasts (filled circles) gave no indications of inhomogeneity of variance.
Furthermore, the reduced set of unstandardised contrasts showed a significant positive trend ( Fig. S2b; Table S1) suggesting that the lack of a trend evident in the complete set of unstandardised contrasts reflects not homogeneity, but a U-shaped function, which is consistent with theoretical predictions (Welch and Waxman, in press). were used. Figure S3 shows the scatterplots for the nuclear dS data. Again, standardised contrasts showed a pattern of decreasing magnitude over time, suggesting that stochastic fluctuations in substitution counts are a major determinant of contrast magnitude for the shallower pairs. Again, too, we chose to exclude the nine shallowest contrasts, but here, the data from the excluded pairs were used to define three new deeper contrasts, shown as triangles in the plots [see Additional file 1].
To demonstrate the importance of the procedures set out here, consider the regression of synonymous rate on body mass for the nuclear data set. From Table 1 in the main text, this regression is highly significant when the set of deeper contrasts are used (i.e., the filled circles and triangles in Fig. S3c), but when the complete set of 22 unstandardised contrasts are used (i.e., the filled and empty circles in Fig. S3b), the relationship between rate and trait is obscured (n = 22; r 2 = 0.054; p = 0.287), presumably due to the inaccuracy of the rate estimates obtained from the shorter branches.
In addition to the tests described above, which are specifically designed for phylogenetic comparative methods, all regressions were subjected to the suite of diagnostic tests implemented in R (R Development Core Team 2006), and the Shapiro-Wilks test of normality.
In common with most comparative studies of substitution rates, we chose not to increase our sample size by calculating additional contrasts between reconstructed states at the internal nodes of the phylogeny (Felsenstein 1985). In addition to the possibility of overparameterisation in the rate estimates, such contrasts are problematic for a number of reasons (Welch and Waxman, in press). First, for traits such as body size, the weighted averages used to generate internodal contrasts need not represent reconstructions of ancestral states, and analyses including only such traits can succeed even if these weighted averages deviate substantially and systematically from the true ancestral states (Grafen 1989). This is not true, however, for rate contrasts, because internal molecular

Figure S2:
Diagnostic plots for the mitochondrial synonymous rate contrasts. In these plots, r = dS/t, denotes the estimated substitution rate per site, with all other details as in Figure S1.
Contrasts were excluded from the analysis if trends in the contrast magnitudes indicated that the change in rate could not be reliably estimated from the molecular branch lengths.
The excluded points are indicated with empty circles (contrasts deemed too shallow), or asterisks (contrasts where rates were too slow).

Figure S3:
Diagnostic plots for the nuclear synonymous rate contrasts. All details are as for Figure   S2 with the addition that triangles indicate the three new deeper contrasts, constructed from lineages excluded from the analysis (empty circles).