LineageSpecificSeqgen: generating sequence data with lineage-specific variation in the proportion of variable sites
© Grievink et al; licensee BioMed Central Ltd. 2008
Received: 07 August 2008
Accepted: 21 November 2008
Published: 21 November 2008
Commonly used phylogenetic models assume a homogeneous evolutionary process throughout the tree. It is known that these homogeneous models are often too simplistic, and that with time some properties of the evolutionary process can change (due to selection or drift). In particular, as constraints on sequences evolve, the proportion of variable sites can vary between lineages. This affects the ability of phylogenetic methods to correctly estimate phylogenetic trees, especially for long timescales. To date there is no phylogenetic model that allows for change in the proportion of variable sites, and the degree to which this affects phylogenetic reconstruction is unknown.
We present LineageSpecificSeqgen, an extension to the seq-gen program that allows generation of sequences with both changes in the proportion of variable sites and changes in the rate at which sites switch between being variable and invariable. In contrast to seq-gen and its derivatives to date, we interpret branch lengths as the mean number of substitutions per variable site, as opposed to the mean number of substitutions per site (which is averaged over all sites, including invariable sites). This allows specification of the substitution rates of variable sites, independently of the proportion of invariable sites.
LineageSpecificSeqgen allows simulation of DNA and amino acid sequence alignments under a lineage-specific evolutionary process. The program can be used to test current models of evolution on sequences that have undergone lineage-specific evolution. It facilitates the development of both new methods to identify such processes in real data, and means to account for such processes. The program is available at: http://awcmee.massey.ac.nz/downloads.htm.
Simulated sequence data are widely used for hypothesis testing , for evaluation of phylogenetic methods under different parameter settings [2–4], for testing the effect of model misspecification on tree reconstruction [5–7], for development of new models and methods [8, 9], and for approximate Bayesian inference . For these applications, it is important that the processes used to produce the simulated data closely model the underlying biological processes. Commonly used phylogenetic sequence generators employ homogeneous, time reversible, stationary models of molecular sequence evolution. These phylogenetic models assume that the overall rate of substitution is the only parameter that may change along the tree and do not allow changes in other parameters, such as the rate matrix, the distribution of rates across sites and the proportion of variable sites.
It is known, however, that as sequences diverge they can acquire independent properties. In particular, the proportion of variable sites can evolve in a lineage-specific manner due to changes in evolutionary constraints [11, 12]. The proportion of variable sites in a lineage will affect its estimated substitution rate . Failure to account for changes in the proportion of variable sites can result in erroneous rate estimates that may affect tree estimation [5, 14]. Indeed, change in the proportion of variable sites is thought to be one of the main causes of long-branch attraction [12, 15].
In addition to the possible shift in the proportion of variable sites, it is known that sites can switch between variable and invariable states due to drift. Note that invariable sites (which we are concerned with in this paper) are sites for which the probability of character substitution is zero; as opposed to invariant sites for which the probability of character substitution is greater than zero but for a certain group (sample) of taxa no substitution is found. The strict covarion model  allows sites to switch between variable and invariable states; however, at equilibrium the proportion of variable sites is constant over the different lineages. Several extensions of the covarion model [17–19] are implemented in the sequence generator seq-gen-aminocov . However, to date, there is no model that allows for change in the proportion of variable sites.
Using partitions (a partition is group of consecutive sites that are simulated on the same underlying tree), lineage-specific proportions of variable sites have been simulated with sequence generators such as seq-gen , seq-gen-cov , and seq-gen-aminocov . These simulations have proven to be very useful in facilitating our understanding of the process of lineage-specific evolution. However, the use of these programs for the purpose of simulating changes in the proportion of variable sites is limited to trees with very few 'events', where an event is defined as a position on the tree where a change in the process of evolution occurs, e.g. a change in the proportion of variable sites. This is because different proportions of variable sites are generated using pre-defined partitions, where each partition is simulated on a tree with different branch lengths (zero branch lengths are used for invariable sites). For two events, in which the proportion of variable sites changes, there are 8 partitions . In general there are 2(1+number of events) partitions (M. Steel, personal communication), so creating the input for such simulations becomes a difficult task. Furthermore, in seq-gen, invariable sites can be incorporated into sequences by either simulating on different partitions (where a partition for invariable sites is simulated on a zero length tree), or specifying a proportion of invariable sites (Pinv) using the -i option. Intuitively, one might expect the processes of evolution simulated by these two methods to be equivalent, but this is not the case. In seq-gen and its modifications published to date, branch lengths are defined as the mean expected number of substitutions per site. When sequences are simulated with a specified proportion of invariable sites, the branch lengths specified by the user are rescaled (increased) by the program to compensate for the proportion of invariable sites. Hence, increasing the proportion of invariable sites (for which the substitution rate is zero) forces a greater substitution rate on the variable sites. For example, with 80% invariable sites and an expected mean number of substitutions of 0.02, the mean number of substitutions of the variable sites will be rescaled to 0.1. Although this branch rescaling is consistent with the definition of branch lengths as the mean expected number of substitutions per site, we found that many researchers are not aware of it. Moreover, using partitions does not allow changes in the on/off switch rate of the covarion model. We have developed a program that allows the user to simulate sequence data containing changes in the proportion of variable sites, and changes in the covarion switch rate, without the need to specify partitions or rescale branch lengths.
For each input tree, the program will generate n subtrees (n = 1+number of events), with each event on the tree defining a cutting point (see Figure 1). For each input tree and each dataset, sequences are first simulated on the subtree under the root and then on the subtree under each event in an iterative manner. An array holding the state (variable/invariable) of each position is updated at each event according to the change in the proportion of variable sites specified by the user. Events can be specified as correlated, although by default they are non-correlated. For correlated events the positions of sites that switch state are identical, for non-correlated events these positions are independent. An array holding the hidden states of the covarion model is also passed down the tree. For each site, along each branch, exponential times for switches are generated; the hidden states array is updated at the internal nodes of each subtree according to the specified covarion model and the switch rate for each event. The sequence at each event is used as the ancestral sequence for the subtree beneath it. The output is an alignment of the resulting tip sequences.
For the reasons described in the Background, we added a default option where branch lengths are defined as the mean expected numbers of substitutions per variable site. This definition allows the substitution rate across variable sites to be independent of the proportion of invariable sites. When branch lengths are defined as the mean expected numbers of substitutions per variable site, the processes of evolution simulated by both specified partitions and specified Pinv are equivalent.
Results and discussion
Example 1 – generating data containing a change in the proportion of variable sites
((((A:0.1, B:0.1):0.1,(C:0.1, D:0.1):0.1):0.2,((E:0.1, F:0.1):0.1,(G:0.1, H:0.1):0.1):0.1$1st_event:0.1):0.02,(((I:0.1, J:0.1):0.1,(K:0.1, L:0.1):0.1):0.1$2nd_event:0.1,((M:0.1, N:0. 1):0.1,(O:0.1, P:0.1):0.1):0.2):0.02);
Number of sites that vary in each of the four groups, and each pair of groups, for the three datasets.
Pinv = 0 (all sites are variable)
Pinv = 0.8
Pinv = 0.8
Two correlated events
Pvar+ = 0.2
1 and 2
1 and 3
1 and 4
2 and 3
2 and 4
3 and 4
Example 2 – testing tree reconstruction accuracy for data containing a change in the proportion of variable sites
This tree is input as:
((A:0.4, B:0.3$1st_event:0.1):0.02,(C:0.3$2nd_event:0.1, D:0.4):0.02);
LineageSpecificSeqgen is a sequence generator that allows simulation of changes in the proportion of variable sites, a biochemically realistic process of evolution. It is useful for testing current models of evolution on sequences that have undergone lineage-specific evolution, developing methods to identify such processes in real data, and developing means to account for such processes.
Availability and requirements
Project name: LineageSpecificSeqgen
LineageSpecificSeqgen, including the source code and documentation, can be downloaded from http://awcmee.massey.ac.nz/downloads.htm.
Operating System: The program can be compiled and run on Unix, Linux, and Mac OS.
Programming Language: ANSI C.
Other requirements: None.
License: GNU GPL.
Any restrictions to use by non-academics: None.
LineageSpecificSeqgen is provided with no guarantee or warranty of any kind, although the authors are happy to provide assistance if needed.
We thank Pete Lockhart, Bill Martin and Mike Steel for their invaluable contribution in discussions. We also thank Matthew Spencer for his suggestions for debugging. This work was financially supported by the New Zealand Marsden fund (05-MAU-033 to BRH).
- Goldman N: Statistical Tests of Models of DNA Substitution. Journal of Molecular Evolution. 1993, 36 (2): 182-198. 10.1007/BF00166252.View ArticlePubMedGoogle Scholar
- Hillis DM, Huelsenbeck JP, Cunningham CW: Application and Accuracy of Molecular Phylogenies. Science. 1994, 264 (5159): 671-677. 10.1126/science.8171318.View ArticlePubMedGoogle Scholar
- Holland BR, Penny D, Hendy MD: Outgroup misplacement and phylogenetic inaccuracy under a molecular clock – A simulation study. Systematic Biology. 2003, 52 (2): 229-238. 10.1080/10635150390192771.View ArticlePubMedGoogle Scholar
- Shavit L, Penny D, Hendy MD, Holland BR: The Problem of Rooting Rapid Radiations. Mol Biol Evol. 2007, 24 (11): 2400-2411. 10.1093/molbev/msm178.View ArticlePubMedGoogle Scholar
- Gruenheit N, Lockhart PJ, Steel M, Martin W: Difficulties in testing for covarion-like properties of sequences under the confounding influence of changing proportions of variable sites. Mol Biol Evol. 2008, 25 (7): 1512-1520. 10.1093/molbev/msn098.View ArticlePubMedGoogle Scholar
- Kolaczkowski B, Thornton JW: A mixed branch length model of heterotachy improves phylogenetic accuracy. Mol Biol Evol. 2008, 25 (6): 1054-1066. 10.1093/molbev/msn042.PubMed CentralView ArticlePubMedGoogle Scholar
- Ruano-Rubio V, Fares MA: Artifactual phylogenies caused by correlated distribution of substitution rates among sites and lineages: the good, the bad, and the ugly. Syst Biol. 2007, 56 (1): 68-82. 10.1080/10635150601175578.View ArticlePubMedGoogle Scholar
- Pagel M, Meade A: A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst Biol. 2004, 53 (4): 571-581. 10.1080/10635150490468675.View ArticlePubMedGoogle Scholar
- Soria-Carrasco V, Talavera G, Igea J, Castresana J: The K tree score: quantification of differences in the relative branch length and topology of phylogenetic trees. Bioinformatics. 2007, 23 (21): 2954-2956. 10.1093/bioinformatics/btm466.View ArticlePubMedGoogle Scholar
- Beaumont MA, Zhang W, Balding DJ: Approximate Bayesian computation in population genetics. Genetics. 2002, 162 (4): 2025-2035.PubMed CentralPubMedGoogle Scholar
- Lopez P, Casane D, Philippe H: Heterotachy, an important process of protein evolution. Mol Biol Evol. 2002, 19 (1): 1-7.View ArticlePubMedGoogle Scholar
- Lockhart P, Novis P, Milligan BG, Riden J, Rambaut A, Larkum T: Heterotachy and tree building: A case study with plastids and eubacteria. Molecular Biology and Evolution. 2006, 23 (1): 40-45. 10.1093/molbev/msj005.View ArticlePubMedGoogle Scholar
- Dickerson RE: The structures of cytochrome c and the rates of molecular evolution. Molecular Evolution. 1971, 1: 26-45. 10.1007/BF01659392.View ArticleGoogle Scholar
- Lockhart PJ, Steel MA: A Tale of Two Processes. Systematic Biology. 2005, 54 (6): 948-951. 10.1080/10635150500234682.View ArticlePubMedGoogle Scholar
- Inagaki Y, Susko E, Fast NM, Roger AJ: Covarion shifts cause a long-branch attraction artifact that unites microsporidia and archaebacteria in EF-1 alpha phylogenies. Molecular Biology and Evolution. 2004, 21 (7): 1340-1349. 10.1093/molbev/msh130.View ArticlePubMedGoogle Scholar
- Tuffley C, Steel M: Modeling the covarion hypothesis of nucleotide substitution. Mathematical Biosciences. 1998, 147 (1): 63-91. 10.1016/S0025-5564(97)00081-3.View ArticlePubMedGoogle Scholar
- Galtier N: Maximum-likelihood phylogenetic analysis under a covarion-like model. Mol Biol Evol. 2001, 18 (5): 866-873.View ArticlePubMedGoogle Scholar
- Huelsenbeck JP: Testing a covariotide model of DNA substitution. Mol Biol Evol. 2002, 19 (5): 698-707.View ArticlePubMedGoogle Scholar
- Wang HC, Spencer M, Susko E, Roger AJ: Testing for covarion-like evolution in protein sequences. Molecular Biology and Evolution. 2007, 24 (1): 294-305. 10.1093/molbev/msl155.View ArticlePubMedGoogle Scholar
- Rambaut A, Grassly NC: Seq-Gen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic frees. Computer Applications in the Biosciences. 1997, 13 (3): 235-238.PubMedGoogle Scholar
- Ane C, Burleigh JG, McMahon MM, Sanderson MJ: Covarion structure in plastid genome evolution: A new statistical test. Molecular Biology and Evolution. 2005, 22 (4): 914-924. 10.1093/molbev/msi076.View ArticlePubMedGoogle Scholar
- Ronquist F, Huelsenbeck JP: MRBAYES 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003, 19 (12): 1572-1574. 10.1093/bioinformatics/btg180.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.