MC-Net: a method for the construction of phylogenetic networks based on the Monte-Carlo method

Background A phylogenetic network is a generalization of phylogenetic trees that allows the representation of conflicting signals or alternative evolutionary histories in a single diagram. There are several methods for constructing these networks. Some of these methods are based on distances among taxa. In practice, the methods which are based on distance perform faster in comparison with other methods. The Neighbor-Net (N-Net) is a distance-based method. The N-Net produces a circular ordering from a distance matrix, then constructs a collection of weighted splits using circular ordering. The SplitsTree which is a program using these weighted splits makes a phylogenetic network. In general, finding an optimal circular ordering is an NP-hard problem. The N-Net is a heuristic algorithm to find the optimal circular ordering which is based on neighbor-joining algorithm. Results In this paper, we present a heuristic algorithm to find an optimal circular ordering based on the Monte-Carlo method, called MC-Net algorithm. In order to show that MC-Net performs better than N-Net, we apply both algorithms on different data sets. Then we draw phylogenetic networks corresponding to outputs of these algorithms using SplitsTree and compare the results. Conclusions We find that the circular ordering produced by the MC-Net is closer to optimal circular ordering than the N-Net. Furthermore, the networks corresponding to outputs of MC-Net made by SplitsTree are simpler than N-Net.


Background
Phylogenetics is concerned with the construction and analysis of phylogenetic trees or networks to understand the evolution of species, populations, and individuals. Evolutionary processes such as hybridization between species, lateral transfer of genes, recombination within a population, and convergent evolution can all lead to evolutionary histories that are distinctly non-treelike. Moreover, even when the underlying evolution is treelike, the presence of conflicting or ambiguous signals can make a single tree representation inappropriate. In these situations, phylogenetic network methods can be particularly useful.
Phylogenetic network is a generalization of phylogenetic trees that can represent several trees simultaneously. For any network construction method, the conflicting signals should be represented in the network but it is vital that the network does not depict more conflict than is found in the data. At the same time, when the data fits well to a tree, the method should return a network that is close to a tree. Recently, in addition to biology, the phylogenetic networks methods are widely used for classifying different types of data such as those finding in linguistics, music, etc. There are many different methods to construct phylogenetic trees or networks which are based on distance matrix such as ME (minimum evolution) [1], LS (least squares) [2,3], NJ (neighbor-joining) [4], AddTree [5], N-Net (neighbor-net) [6] and Q-Net [7]. All these methods are called distance-based methods. ME is one of the most well-known methods. It was first introduced by Kidd and Sgamarella-Zonta [1]. Given a distance matrix, the ME principle consists of selecting the tree whose length (sum of its branch lengths) is minimal among all tree topologies for taxa.
Comparative studies of tree-building methods show that ME generally is an accurate criterion for selecting a true tree. Nei and Rzhetsky have shown that ME principle is statistically consistent when branch lengths are assigned by ordinary least-squares (OLS) fitting [8]. In the OLS framework, we simply minimize where δ ij is an estimation of input d ij and X is the set of taxa. In fact, the main goal is to find a tree whose induced metric is closer to d ij . The LS was first introduced in [2] and [3].
Nearly 20 years have passed by since the landmark paper in Molecular Biology and Evolution introducing NJ [4]. The method has become the most widely used method for building phylogenetic trees from distances. Steel and Gascuel showed that NJ is a greedy algorithm for ME principle [9]. The N-Net is a hybrid of NJ and split decomposition [10]. It is applicable to data sets containing hundreds of taxa. The N-Net is an algorithm for constructing phylogenetic networks.
Split decomposition, implemented in SplitsTree [11], decomposes the distance matrix into simple components based on weighted splits. These splits are then represented using a special type of phylogenetic network called split network. The N-Net works in a similar way: it first produces a circular ordering from distance matrix and then constructs a collection of weighted splits. Dan Levy and Lior Patcher showed that the N-Net is a greedy algorithm for the traveling salesman problem that minimizes the balanced length of the split system at every step and it is optimal for circular distance matrices [12]. Balanced minimum evolution (BME) is designed under the ME principle [13]. The BME is a special version of the ME principle where tree length is estimated by the weighted least squares [13].
In this work, we introduce MC-Net algorithm (Monte-Carlo Network algorithm) which works in a similar way: First, it finds a circular ordering for taxa, based on Monte-Carlo with simulated annealing, it then extracts splits from the circular ordering and uses non-negative least squares for weighting splits. We compare the results of the N-Net and the MC-Net for several data sets.

Preliminaries
A split of a given set X of taxa is a bipartition of the set X into two non-empty subsets of X. A split is called trivial if one of the two subsets contains only one taxon. Let T be a non-empty tree. Let the leaves of the T are labeled by the set of taxa, X ={x 1 ,...,x n }. Every edge e of T defines a split S = A|B, where A and B are two sets of taxa contained in the two components of T -e. For example, Figure 1 shows an eight-leaf tree. Removing the edge e from the tree produces two sets of leaves In an edge-weighted tree, the weight of each edge is assigned to its corresponding split. The Phyletic distance between any two taxa x and y in an edge-weighted tree is the sum of the weights of the edges along the path from x to y. Hence, the phyletic distance between x and y equals the sum of split weights for all those splits in which x and y belong to separate components.
Two different splits S 1 = A 1 |B 1 , and S 2 = A 2 |B 2 , are compatible, if one of the following conditions holds: , .
A collection of splits is called compatible, if all possible pairing of splits are compatible. A compatible collection of splits is represented by a phylogenetic tree [14,15]. Dress and Huson introduced SplitsTree to display more complex evolutionary patterns [16]. For a set of incompatible splits, SplitsTree outputs the split network using bands of parallel edges.
Circular collection of splits is a mathematical generalization of compatible collections of splits. Formally, a collection of splits of X is circular if there exists an ordering x 1 ,...,x n of X such that every split is of the form {x i , x i+1 ,...,x j }|X -{x i ,...,x j } for some i and j, 1 ≤ i ≤ j ≤ n. A Compatible collection of splits are always circular [10]. On the other hand, the class of circular collection of splits contains the class of the collection of splits corresponding to a tree. Andreas Dress and Daniel Huson proved that circular collections of splits always have a planar splits graph representation [16]. A distance matrix is circular (also called Kalmanson) if it is the phyletic distances for a circular collection of splits with positive weights. Because compatible splits are circular, treelike distances are circular too [6].
As mentioned above, the ME principle consists of selecting a tree whose length is minimal. In fact, the ME principle is equivalent to finding a circular ordering s = {x s(1) ,...,x s(n) } in order to find the minimum of the function h Where Σ is the set of all circular orderings of taxa x 1 ,...,x n . We call function h the energy function, and any circular ordering that minimizes h is called the optimal circular ordering.

Methods
There are a number of different methods for constructing various kinds of phylogenetic networks. A phylogenetic network can be constructed from a collection of weighted splits. N-Net uses circular ordering to construct a collection of weighted splits. Since finding an optimal circular ordering is an NP-hard problem, so we introduce a heuristic algorithm based on the Monte-Carlo method to find optimal circular ordering. The MC-Net seeks to find an optimal circular ordering from the distance matrix and then extracts a collection of weighted splits based on that ordering.

Algorithms
In this section, a new algorithm called the MC-Net, is presented to construct a set of weighted splits for taxa set X = {x 1 ,...,x n }with a given distance matrix. The MC-Net consists of two steps. In the first step, we find a circular ordering. In the second step, the splits which are obtained from the circular ordering are weighted. The core of the first step contains two procedures, namely, INITIAL and the Monte-Carlo. The INITIAL is a greedy algorithm to obtain a circular ordering, namely, the initial circular ordering. The INITIAL works in the following way: Suppose x s(1) ,...,x s(k) are ordered and let x be an is considered. This process stops when all taxa are ordered. The second procedure, or the Monte-Carlo procedure, relies on random iteration to find the optimal circular ordering. The Monte-Carlo algorithm starts its movement from the initial circular ordering, s 0 For each circular ordering s, we define the neighborhood of s, N (s), by: where Σ is the set of all circular orderings. We choose s 1 N (s 0 ) randomly. if h (s 1 ) ≤ h (s 0 ), then the system moves into ordering s 1 . However we allow non-greedy movements for the system in order to avoid having the system trapped in local minima. In other words, if h (s 1 ) >h (s 0 ), then the system moves into ordering σ 1 with a small probability e For each temperature, these movements are carried out t times. To compute the minimum energy we allow this process to continue until the temperature drops to zero (see the appendix for more details). Pseudo code of the Monte-Carlo algorithm is shown in Table 1. It is notticeable that the second procedure can start from any circular ordering other than the one obtained by the INITIAL procedure.
In the final step, we use the least squares algorithm to weight the splits of obtained circular ordering. Let A be the matrix with rows indexed by pairs of taxa and columns indexed by splits. Then for each pair of texa i and j and for each split k, A ij,k is defined by: if and are on opposite of split otherwise.

⎩ ⎩
The matrix A = [A ij,k ] is full rank [17]. Let d = (d 12 , d 13 ,...,d (n-1)n ) be an n(n -1)/2 dimensional vector corresponding to rows of A where d ij is obtained by distance matrix. Let b be the weight vector of splits, then the phyletic distance vector is p = Ab.
Return s and h(s) If we discard splits with negative weights and leave the remaining splits unchanged, the weight of the remaining splits are often grossly overestimated. Similar to the N-Net algorithm, we compute the optimal least square estimates with a non-negative constraint. In this paper, we use the FNNLS algorithm [18].

Results and Discussion
In this section, we compare the results of the MC-Net and the N-Net on some data sets. We use SplitsTree4 program [19] for drawing phylogenetic networks. Due to the limitation of space, we insert only six figures in this article.

Data sets
One of the data sets, a collection of 110 Salmonella

Optimal threshold for cooling coefficient and T low
There are two parameters, T low and cooling coefficient, in the Monte-Carlo procedure. We first adjust T low between 10 5 and 0.2 to obtain the best cooling coefficient. The value of energy function and running time of algorithm for each T low for JSA data are given in Figure  2 (for the other data sets, the figures are the same as JSA). According to Figure 2, when cooling coefficient is 0.95, running time of the algorithm compared to other coefficients increases considerably. On the other hand, the value of energy function for 0.95 or 0.9 as a cooling coefficient is significantly better than the other cooling coefficients. Hence, we conclude that the best value of energy function with respect to running time of the algorithm is achieved when cooling coefficient is 0.9 and T low < 10 -3 .

Results
The initial test for performance of our method is done by calculating the value of energy function for circular orderings obtained by the MC-Net and the N-Net ( Table 2). The first two rows of Table 2 show that in all   data sets except Salmonella, the value of energy function for the MC-Net is less than those obtained from the N-Net. The interesting feature of the MC-Net algorithm is in finding different circular orderings by changing initial ordering. So, the MC-Net algorithm could take the circular ordering obtained by the N-Net as initial ordering. The third row of Table 2 shows the values of energy function for circular orderings achieved by the MC-Net with the circular ordering obtained by the N-Net as an initial ordering. For four data sets, Its, Rubber, Salmonella, Myosin, the third row indicates better results than the first row. But for the other data sets, the conclusions mentioned above are the vice versa. Another test for the performance of our method is comparing the number of splits obtained by both the algorithms. In Table 3, the number of splits of circular orderings obtained by the MC-Net and the N-Net on different data sets are shown. In all data sets the number of splits obtained by the MC-Net is less than the N-Net except Primates. In this case, these two numbers are equal.
Let d be the input distance vector and P and P' are the phyletic distance vector of weighted splits obtained by the MC-Net and the N-Net, respectively. In Table 4, the value of norm of P -d and P' -d for each data set are shown. The norm of P -d is less than P' -d in all data sets even in Primates. It means that the results of the MC-Net algorithm give better approximation for input distance vector.
To illustrate difference between two algorithms, we present some examples of networks obtained by both the MC-Net and the N-Net using SplitsTree4 ( Figures  3,4,5,6, 7 and 8). It is obvious that both algorithms give the same classification of taxa and exhibit the same major splits. For example, in Figures 5 and 6, we highlight some edges such that by removing the same-colored edges, the same clustering of taxa is obtained. But according to what we see in Tables 3  and 4, split networks obtained by the MC-Net are less complicated than split networks obtained by the N-Net. It means that the networks obtained by the MC-Net have less noise than the networks obtained by the N-Net. According to Corollary 1 (see Appendix), when t approaches to 1, the MC-Net finds optimal circular ordering with the probability 1. We examined our algorithm on several treelike distance matrices and it returned corresponding trees quickly. The MC-Net has been implemented in Matlab and is available for download at http://bioinf.cs.ipm.ac.ir/softwares/ mc.net.

Conclusions
In this work, we propose an algorithm, MC-Net, which is a distance based method for constructing phylogenetic networks. The MC-Net scales well and can quickly produce detailed and informative networks for large number of taxa. We compare the performance of the

Appendix
Let S = {E 1 ,...,E s } be a finite set of states, and consider a physical process having these discrete states at time t. A Markov chain is a stochastic model of this system, such that the state of system at time t + 1 depends only on the state of system at time t. Consider X 0 , X 1 ,..., be a collection of Markov random variables, such that X n is the state of the system at time n. Let p ij be the probability that the system enters into the state E j from the state E i , where i, j {1,...,s} The matrix P = (p ij ) 1≤i, j≤s is called transition matrix. A probability distribution q = (q 1 ,...,q s ) such that q i is the probability that system starts its movement from the state E i , is called initial probability distribution. A Markov chain In other words, the Markov chain is irreducible, if there exist n such that the probability that the system enters into the state E j from the state E i after n times is positive. The irreducible Markov chain is called aperiodic, if for some n ≥ 0 and some state E j , The probability distribution is π is called stationary probability of the Markov chain.
It means that if P is the transition matrix and P (t) is the t th power of P, when t ∞ the j th column of transition matrix is approximately equal to π j . In the Monte-Carlo algorithm, a special kind of Markov chain is used.
Let Σ be the finite set of states and q = … ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ the initial probability distribution. For each state i the neighborhood of i, N(i), is defined as the set of all the states that are reachable from i by one movement. In this system the set of neighborhoods have to satisfy the following properties: where T is a positive constant number (constant temperature). The third property of the neighborhood shows that this Markov chain is irreducible. Also, if P ii T > 0 and P T contains non-negative entries then ( ) ( ) P T ii t > 0 for all t ≥ 0. So, it is a finite, aperiodic and irreducible Markov chain. The theorem 1 shows that for each constant temperature T and i Σ, there exists a stationary probability distribution  i T such that:  The corollary 1 illustrates that by cooling temperature (T 0 + ), system enters into one of the states of h 0 with the probability 1 after t (t ∞) time. In this article, we define the set of all circular orderings of taxa as the finite set of states. Our definition of neighborhood in the MC-Net satisfies in three properties of neighborhood and every elements of h 0 is an optimal circular ordering. Therefore, the MC-Net yields a circular ordering with approximately minimal energy function.