Birth and death of protein domains: A simple model of evolution explains power law behavior

Background Power distributions appear in numerous biological, physical and other contexts, which appear to be fundamentally different. In biology, power laws have been claimed to describe the distributions of the connections of enzymes and metabolites in metabolic networks, the number of interactions partners of a given protein, the number of members in paralogous families, and other quantities. In network analysis, power laws imply evolution of the network with preferential attachment, i.e. a greater likelihood of nodes being added to pre-existing hubs. Exploration of different types of evolutionary models in an attempt to determine which of them lead to power law distributions has the potential of revealing non-trivial aspects of genome evolution. Results A simple model of evolution of the domain composition of proteomes was developed, with the following elementary processes: i) domain birth (duplication with divergence), ii) death (inactivation and/or deletion), and iii) innovation (emergence from non-coding or non-globular sequences or acquisition via horizontal gene transfer). This formalism can be described as a birth, death and innovation model (BDIM). The formulas for equilibrium frequencies of domain families of different size and the total number of families at equilibrium are derived for a general BDIM. All asymptotics of equilibrium frequencies of domain families possible for the given type of models are found and their appearance depending on model parameters is investigated. It is proved that the power law asymptotics appears if, and only if, the model is balanced, i.e. domain duplication and deletion rates are asymptotically equal up to the second order. It is further proved that any power asymptotic with the degree not equal to -1 can appear only if the hypothesis of independence of the duplication/deletion rates on the size of a domain family is rejected. Specific cases of BDIMs, namely simple, linear, polynomial and rational models, are considered in details and the distributions of the equilibrium frequencies of domain families of different size are determined for each case. We apply the BDIM formalism to the analysis of the domain family size distributions in prokaryotic and eukaryotic proteomes and show an excellent fit between these empirical data and a particular form of the model, the second-order balanced linear BDIM. Calculation of the parameters of these models suggests surprisingly high innovation rates, comparable to the total domain birth (duplication) and elimination rates, particularly for prokaryotic genomes. Conclusions We show that a straightforward model of genome evolution, which does not explicitly include selection, is sufficient to explain the observed distributions of domain family sizes, in which power laws appear as asymptotic. However, for the model to be compatible with the data, there has to be a precise balance between domain birth, death and innovation rates, and this is likely to be maintained by selection. The developed approach is oriented at a mathematical description of evolution of domain composition of proteomes, but a simple reformulation could be applied to models of other evolving networks with preferential attachment.

parameters of these models suggests surprisingly high innovation rates, comparable to the total domain birth (duplication) and elimination rates, particularly for prokaryotic genomes.

Conclusions:
We show that a straightforward model of genome evolution, which does not explicitly include selection, is sufficient to explain the observed distributions of domain family sizes, in which power laws appear as asymptotic. However, for the model to be compatible with the data, there has to be a precise balance between domain birth, death and innovation rates, and this is likely to be maintained by selection. The developed approach is oriented at a mathematical description of evolution of domain composition of proteomes, but a simple reformulation could be applied to models of other evolving networks with preferential attachment.

Background
Sequencing of numerous genomes from all walks of life, including multiple representatives of diverse lineages of bacteria, archaea and eukaryotes, creates unprecedented opportunities for comparative-genomic studies [1][2][3]. One of the mainstream approaches of genomics is comparative analysis of protein or domain composition of predicted proteomes [2,4,5]. These studies often concentrate on domains rather than entire proteins because many proteins have variable multidomain architectures, particularly in complex eukaryotes (throughout this work, we use the term domain to designate a distinct evolutionary unit of proteins, which can occur either in the standalone form or as part of multidomain architectures; often but not necessarily, such a unit corresponds to a structural domain). As soon as genome sequences of bacteria became available, it has been shown that a substantial fraction of the genome of each species, from approximately one third in bacteria with very small genomes, to a significant majority in species with larger genomes, consists of families of paralogs, genes that evolved via gene duplication at different stages of evolution [6][7][8][9]. Again, a comprehensive analysis of paralogous relationships between genes is probably best performed at the level of individual protein domains, first, because many proteins share only a subset of common domains, and second, because domains can be conveniently and with a reasonable accuracy detected using available collections of domain-specific sequence profiles [10][11][12]. Comparisons of domain repertoires revealed both substantial similarities between different species, particularly with respect to the relative abundance of house-keeping domains, and major differences [4,5]. The most notable manifestation of such differences is lineage-specific expansion of protein/domain families, which probably points to unique adaptations [13,14]. Furthermore, it has been demonstrated that more complex organisms, e.g. vertebrates, have a greater variety of domains and, in general, more complex domain architectures of proteins than simpler life forms [1,2].
Lineage-specific expansions and gene loss events detected as the result of comparative analysis of the domain compositions of different proteomes have been examined mostly at a qualitative level, in terms of the underlying biological phenomena, such as adaptation associated with expansion or coordinated loss of functionally linked sets of genes [15]. A complementary approach involves quantitative comparative analysis of the frequency distributions of proteins or domains in different proteomes. Several studies pointed out that these distributions appeared to fit the power law: P(i) Ϸ ci -γ where P(i) is the frequency of domain families including exactly i members, c is a normalization constant and γ is a parameter, which typically assumes values between 1 and 3 [16][17][18][19]. Obviously, in double-logarithmic coordinates, the plot of P as a function of i is a straight line with a negative slope. Power laws appear in numerous biological, physical and other contexts, which seem to be fundamentally different, e.g. distribution of the number of links between documents in the Internet, the population of towns or the number of species that become extinct within a year. The famous Pareto law in economics describing the distribution of people by their income and the Zipf law in linguistics describing the frequency distribution of words in texts belong in the same category [20][21][22][23][24][25][26][27][28][29]. Recent studies suggested that power laws apply to the distributions of a remarkably wide range of genome-associated quantities, including the number of transcripts per gene, the number of interactions per protein, the number of genes or pseudogenes in paralogous families and others [30].
Power law distributions are scale-free, i.e. the shape of the distribution remains the same regardless of scaling of the analyzed variable. In particular, scale-free behavior has been described for networks of diverse nature, e.g. the metabolic pathways of an organism or infectious contacts during an epidemic spread [20,[25][26][27]. The principal pattern of network evolution that ensures the emergence of power distributions (and, accordingly, scale-free properties) is preferential attachment, whereby the probability of a node acquiring a new connection increases with the number of connections this node already has.
However, a recent thorough study suggested that many biological quantities claimed to follow power laws, in fact, are better described by the so-called generalized Pareto function: P(i) = c(i+a) -γ where a is an additional parameter [31]. Obviously, although at i >> a, a generalized Pareto distribution becomes indistinguishable from a power law, at small i, it deviates significantly, the magnitude of the deviation depending on the value of a. Furthermore, unlike power law distributions, generalized Pareto distributions do not show scale-free properties.
The importance of the analysis of frequency distributions of domains or proteins lies in the fact that distinct forms of such distributions can be linked to specific models of evolution. Therefore, by exploring the distributions, inferences potentially can be made on the mode and parameters of genome evolution. For this purpose, the connections between domain frequency distributions and evolutionary models need to be explored theoretically within a maximally general class of models. In this work, we undertake such a mathematical analysis using simple models of evolution, which include duplication (birth), elimination (death) and de novo emergence (innovation) of domains as elementary processes (hereinafter BDIM, birth-death-innovation models). All asymptotics of equilibrium frequencies of domain families of different size possible for BDIM are identified and their dependence on the parameters of the model is investigated. In particular, analytical conditions on birth and death rates that produce power asymptotics are determined. We prove that the power law asymptotics appears if, and only if, the model is balanced, i.e. domain duplication and deletion rates are asymptotically equal up to the second order, and that any power asymptotic with the degree not equal to -1 can appear only if the assumption of independence of the duplication/deletion rates on the size of a domain family is rejected. We apply the developed formalism to the analysis of the frequency distributions of domains in individual prokaryotic and eukaryotic genomes and show a good fit of these data to a particular version of the model, the second-order balanced linear BDIM.

Mathematical theory and model Fundamental definitions and assumptions
A genome is treated as a "bag" of coding sequence for protein domains, which we simply call domains for brevity. Domains are treated as independently evolving units disregarding the dependence between domains that tend to belong to the same multidomain protein. Each domain is considered to be a member of a family (including singlemember families). We consider three types of elementary evolutionary events: i) domain birth, which generates a new member within a family; the principal mechanism of birth is duplication with divergence but additional mechanisms may be considered, including acquisition of a family member from a different species via horizontal gene transfer [32], ii) domain death, which results from domain inactivation and/or deletion, and c) domain innovation, which generates a new family with one member. Innovation may occur via horizontal gene transfer from another species, via domain evolution from a non-coding sequence or a sequence of a non-globular protein, or via major change of a domain from a pre-existing family after a duplication, which makes the relationship between the given domain and its family of origin undetectable (this latter process formally combines domain birth, death and innovation in a single event). The innovation rate (ν), is considered constant for a given genome. The rates of elementary events are considered to be independent of time (i. e. only homogeneous models are considered) and of the nature (structure, biological function etc.) of individual families.
In a finite genome, the maximal number of domains in a family cannot exceed the total number of domains and, in reality, is probably much smaller; let N be the maximal possible number of domain family members. We consider classes of domain families, which have only one common feature, namely the number of members (Fig. 1). Let f i be the number of domain families in i-th class, i.e. families that are represented by exactly i domains in the given genome, i = 1,2,...N. Birth of a domain in a family of class i results in the relocation of this family from class i to class i+1 (decrease of f i and increase of f i+1 by 1). Conversely, death of a domain in a family of class i relocates the family to class i-1; death of a domain in class 1 results in the elimination of the corresponding family from the given genome, this being the only considered mechanism of family death. We consider time to be continuous and suppose it very unlikely that more than one elementary event occur during a short time interval; formally, the probability that more than one event occurs during an interval ∆t is o(∆t).
The formulation of the model The simple BDIM Let us formulate the following independence assumption: i) all elementary events are independent of each other; ii) the rates of individual domain birth (λ) and death (δ) do not depend on i (number of domains in a family). Under this assumption, the instantaneous rate, at which a domain family leaves class i, is proportional to i and the following simple BDIM describes the evolution of such a system of domain family classes: Similar models have been considered previously in several different contexts [33 v Let F(t)= f i (t) be the total number of domain families at instant t; it follows from (2.2) that The system (2.2) has an equilibrium solution f 1 ,...f N defined by the equality df i (t)/dt = 0 for all i; this solution is described below under Proposition 1. Accordingly, there exists an equilibrium solution of equation (2.3), which we will designate F eq (the total number of domain families at equilibrium). At equilibrium, ν = δ 1 f 1 , i.e. the processes of innovation and death of single domains (more precisely, the death of domain families of class 1, i.e. singletons) are balanced.
We can rewrite the model (2.2) in terms of the frequency of a domain family of class Applying this identity to p i (t) and rewriting equation (2.3) in the form we obtain the following model for frequencies of the domain family (master BDIM for frequencies), which is equivalent to (2.2): dp 1 (t)/dt = -(λ 1 + δ 1 )p 1 (t) + δ 2 p 2 (t) + ν/F(t) -(ν/F(t)δ 1 p 1 (t))p 1 (t), (2.4) System (2.4) should be solved together with equation

The Master BDIM and Markov processes
Let us note that system (2.4) for frequencies is non-linear, so it is not a system of Kolmogorov equations for state probabilities of any homogeneous Markov process. Let us further suppose that a genome had ample time to arrive at an equilibrium with respect to the total number of domain families, such that F(t) = F eq . This does not imply dp i (t)/dt = 0 or df i (t)/dt = 0; in other words, the system might rearrange the frequencies of individual families, although the total number of families remains stable. If F(t) = F eq , the master system (2.4) turns into

Figure 1
Domain dynamics and elementary evolutionary events under BDIM. d p 1 (t)/dt = -(λ 1 + δ 1 ) p 1 (t) + δ 2 p 2 (t) + ν/F eq (2.5) System (2.5) can be rewritten as a matrix equation where p(t) = {p 1 (t),...p N (t)} and the matrix Q = (q ij ) is defined by equalities q 11 = -(λ 1 + δ 1 ) + ν/F eq , q 21 = δ 2 + ν/F eq , q s1 = ν/F eq for all s > 2; It is easy to see that the sum of elements of each row (except for the first one) of the matrix Q is equal to ν/F eq > 0. Therefore the matrix Q cannot be a matrix of transition rates for any Markov process (the sum of elements of each row of a matrix of transition rates for Markov process with continuous time should be non-positive [33 v Remark. If, in system (2.5), ν = 0, then this system turns into a system of state probabilities for a Markov birth and death process with continuous time.

Equilibrium in BDIMs Equilibrium sizes and frequencies of the domain family system
Let us suppose that the genome had ample time to arrive at a complete equilibrium state, in which not only dF(t)/ dt = 0, but also df i (t)/dt = 0 for all i. Thus, the equilibrium sizes of domain families f i satisfy the system It should be emphasized that the master model does not assign a priori the value of F eq ; this value has to be computed depending on the model parameters.
The following statement is central for further analysis.
The unique equilibrium state (3.2) is globally asymptotically stable.
In addition (formally assuming λ j = 1 for i = 1), This proposition ascertains that all evolutionary trajectories of the system (2.2) exponentially (with respect to time) approach the equilibrium state (3.2). The proof is given in the Mathematical Appendix.
Remark. Let us denote the ratio of the birth rate to the innovation rate G(N) ≡ λ i f i /ν, and the ratio of the death rate to the innovation rate Then, according to Proposition 1, for any BDIM in equilibrium, The principal goal of the treatment that follows is the analysis of the asymptotic behavior of equilibrium frequencies and sizes of domain families (f 1 ,...f N ) at large N. We will differentiate two cases of asymptotic behavior according to the following Definition. Let {q i }, {s i } be sequences of real numbers; let us denote q i s i if lim q i /s i = 1 and q i ~ s i if lim q i /s i = c = const and 0<c<∞. We will also use this notation for finite but sufficiently long sequences.
Equilibrium frequencies for the simple BDIM Let us apply Proposition 1 to the simple BDIM (2.1) with λ i = λi, δ i = δi.

Definition.
A simple BDIM is balanced if θ = λ/δ = 1, i.e. if the rates of individual domain birth and death are equal.
Let us recall that a random discrete variable ξ has the logarithmic distribution with parameter θ < 1 if A random variable ξ has the truncated logarithmic distribution with parameter θ if

7)
and for all i = 1,2,...N The proof is given in the Mathematical Appendix.
Thus, a simple BDIM can have equilibrium frequencies only of the form p i = Cθ i /i, where C = const and θ is the distribution parameter. In particular, the equilibrium frequencies for a balanced simple BDIM have the power distribution with the degree equal to -1.
Simple methods exist for preliminary graphical estimation of the single distribution parameter θ [36 ch. 7, s. 7]. We will prove in the following section that, if we observe a power asymptotic for empirically observed equilibrium frequencies, then (assuming that the system can be described by a BDIM), the rates λ i and δ i should be asymptotically equal at large i. If, additionally, the degree of the asymptotic is not equal to -1, then the system dynamics cannot be described by a simple BDIM. In this case, it is necessary to consider more general models, such as the Master BDIM (2.2).

Asymptotic behavior of equilibrium frequencies of a Master BDIM: Main Theorems
Let us consider the master BDIM (2.2); we showed in 3.1 that its equilibrium frequencies are the solution of the system -(λ 1 + δ 1 )p 1 + δ 2 p 2 + ν/F eq = 0, (3.9) The following theorem gives all possible types of asymptotic behavior of the equilibrium frequencies and defines the connections between these asymptotics depending on model parameters. In particular, if there is no information on the exact form of dependence of the rates of birth and death of domains on the size of a domain family, the theorem can be used to qualitatively describe the dynamics of the asymptotic behavior of the equilibrium frequencies.
We will prove that the asymptotic behavior of a solution of system (3.9) is completely defined by the asymptotic relation between λ i and δ i . More precisely, let us define a function χ (i)= λ i-1 /δ i ; we consider only functions of power growth, i.e. χ (i) ~ i s at i→∞ for a real s. We will see that this is not a serious restriction because the most realistic situations correspond to the case of s = 0. So, let us suppose that, for large i, the following expansion is valid: where s, a are real numbers and θ > 0. Evidently, if s ≠ 0, χ (i) tends either to 0 (s < 0) or to ∞ (s > 0) with the increase of i.
We will show that the first three coefficients, s, θ and a, of asymptotic expansion (3.10) for χ (i) = λ i-1 /δ I exactly specify all possible asymptotic behaviors of BDIM equilibrium frequencies.

Theorem 1. The equilibrium frequencies p i of BDIM (2.2) have the following asymptotics
i. if the model is non-balanced, then ii. if the model is first-order balanced, then iii. if the model is second-order balanced, then iv. if the model is high-order balanced, then The proof is given in the Mathematical Appendix. The classification of BDIM according to the order of balance is illustrated in Fig. 2 and the asymptotics for different types of BDIMs are shown in Fig. 3.
It follows from this theorem that, if a BDIM is non-balanced, then its equilibrium frequencies p i (and equilibrium family sizes f i ) increase or decrease extremely fast (hyper-exponentially) with the increase of i. In contrast, if a BDIM has a non-zero order of balance, asymptotic behavior is observed.
Let us recall that a random discrete variable ξ has the Pascal (or negative binomial) distribution with parameters (r,q), [36]. We will say that sequence {p i } follows (or asymptotically has) a discrete probabilistic distribution {q i } if p i ~ q i for large enough i.

Figure 2
Different orders of balance in BDIMs.

Figure 3
Asymptotics of equilibrium distributions for balanced BDIMs of different orders.
The following implication of Theorem 1 is of principal interest.

Corollary 2. Equilibrium frequencies of a BDIM have a power asymptotic behavior if and only if the BDIM is second-order balanced.
Corollary 3. For high-order balanced BDIM, if λ i-1 /δ i = 1 for all i, the only possible distribution of equilibrium frequencies is uniform, p i = const for all i. Moreover, even if λ i-1 /δ i = 1 + O(1/i 2 ), the equilibrium frequencies asymptotically tend to the uniform distribution.

Rational BDIM
Rational models comprise a general class of BDIM (Fig. 4), for which the asymptotic behavior of the equilibrium frequencies and equilibrium sizes of domain families can be completely investigated.
Let us suppose that the birth and death rates are of the form where λ, δ are positive constants, α k , β k are real and a k , b k are non-negative for all k = 1,...N.
It is known that a wide class of mathematical functions can be well approximated by rational functions of the form (4.1) (see, e.g. [37]).
Specific cases of the rational BDIM are simple BDIM with where a 1 , b 1 are constants, and polynomial BDIM, if P(i) and Q(i) are polynomials on i.
The following theorem describes all possible asymptotic behaviors of the equilibrium frequencies of a rational BDIM. Let us denote

Theorem 2. The equilibrium sizes of domain families of a rational BDIM have the following asymptotics
where the constant The proof is given in the Mathematical Appendix.
Corollary 1. If η = 0, then the rational BDIM is first-order balanced and the sequence of equilibrium numbers of domain families {f i } has a power-exponential asymptotics

Figure 4
The hierarchy of BDIM types.
Formula (4.4) gives the asymptotics for the equilibrium sizes of domain families f i and, accordingly, for the total number of families F eq . The exact expressions for these quantities are given in the proofs of Theorem 2 and Lemma (see Mathematical Appendix).

Proposition 3.
i. The equilibrium sizes of domain families f i of a balanced (first or higher order) rational BDIM are ii. The total number of domain families at equilibrium is For the rational, second-order balanced BDIM, the ratio of the birth rate to the innovation rate is The asymptotic formulas for equilibrium frequencies of rational BDIM could be considered as particular cases of the corresponding formulas of general theorem 1. Proposition 3 allows one to calculate the constants in the corresponding asymptotic formulas for the sizes of domain families for a rational BDIM. If only equilibrium frequencies are analyzed, the values of these constants become irrelevant because they contract. However, if the actual values of f i and F eq are of interest, the values of the constants are required.
Properties of the main types of rational BDIM Simple BDIM As shown above, a simple BDIM can have equilibrium frequencies only of the form p i = Cθ i /i, C = const;in particular, if the distribution parameter θ < 1, we get the (truncated) logarithmic distribution. Logarithmic distributions are seen in many biological contexts, e.g., the distribution of species by the number of individuals in populations or, what is more relevant, the distribution of protein folds by the number of families per fold [38]. Thus, a simple BDIM could be potentially used for modeling the dynamics of biological systems with a logarithmic distribution of equilibrium densities. We examine this possibility in greater detail starting with the case λ = δ (second-order balanced simple BDIM).
We can extract from Proposition 2 some additional information, which could be helpful for estimating the model parameters. It is known that More precisely, the approximation has an error less then 10 -6 for N > 10. Thus, from (3.7), we obtain an interesting formula This means that, in the equilibrium state of the system, the total number of domain families grows only slowly Formula (5.1) can be used for estimating the model parameters on the basis of empirical data.
In the more general case λ ≠ δ, we can also obtain an estimate of the rate of innovation ν. If λ < δ (θ < 1), then the series in the right part of (3.5) quickly converges, so -ln(1-θ)/θ is a good approximation for the sum Then

we have a relation
F eq /f 1 -ln(1-θ)/θ, (5.4) which allows the parameter θ to be estimated on the basis of empirical data.
If N can be estimated independently and is not very large, we can use more exact relations: where the function .
Further, if (1-θ)N is small (i.e., θ is very close to 1), then the approximation has an error less then [N(1-θ)] 2 /4 and, in this case, If (1 -θ)N is large, then the following inequalities provide simple bounds for F eq /f 1 = θ i-1 /i: For the simple BDIM, the ratio of the rate of duplications to the innovation rate is If the simple BDIM is the 2 nd order balanced, θ = 1, then Thus, for the simple, second-order balanced BDIM, the number of duplications per time unit is N-1 times greater than the number of innovations.
The total number of domains in the equilibrium state for the simple BDIM is If a simple BDIM is second-order balanced, then G(N) = ν/λ N.

Linear BDIM
We saw that the assumption of independence of birth and death rates of individual domains on each other and on the size of domain families is incompatible with any power distribution of the equilibrium frequencies with the degree not equal to -1. The simplest case of a BDIM, which can have, depending on the parameters, three types of asymptotic behavior described by Theorem 1 (excluding the first one, hyper-exponential, which corresponds to a nonbalanced BDIM; all linear BDIMs are balanced) and, in particular, any power asymptotics, is a model with linear birth and death rates of the form: , where a and b are constants. More precisely, according to (5.7), the average birth rate per domain in a family of size i is λ i /i = λ + λa/i. So, for small i, the average birth rate is close to λ + λa, whereas, for large i, it tends to λ. Similarly, the average death rate changes from δ + δb in a small family to the limit value δ in a large family. Thus, if a and b are positive (which seems to be the case for the available data; see below), both the birth rate and the death rate per domain decrease with the increase of the class number (size of the respective domain families); conversely, if a and b are negative, these rates increase with the class number (Fig. 5).
Corollary 1 of Theorem 2 implies that equilibrium frequencies p i of a linear BDIM have asymptotics In particular, if λ ≠ δ and a = b, the linear BDIM is first-order balanced and the equilibrium frequencies p i follow the logarithmic distribution (in this case, the linear BDIM is asymptotically equivalent to the simple BDIM). If λ = δ, the linear BDIM is second-order balanced and the equilibrium frequencies p i follow the power distribution Thus, the dependence of the domain frequency on the family size is actually determined by the difference a -b. If a > b, the birth rate decreases faster than the death rate with the increase of family size, i. e. there seems to be a "competition" between domains in a family; in contrast, if a <b, the death rate drops faster, i.e. a "synergy" between domains appears to exist (Fig. 4).
More detailed information can be obtained using Proposition 4: i) for a first-order balanced linear BDIM, the equilibrium sizes f i of domain families are and the total number of domain families at equilibrium is ii) for a second-order balanced linear BDIM (θ = 1), and According to (2.3), in the equilibrium state of a linear BDIM, f 1 = ν/δ 1 = ν/(δ(1 + b)) and so, for a second-order balanced linear BDIM, we have the formula Suppose that equilibrium frequencies obtained from empirical data follow the power distribution p i ~ i -γ ; in this case, -γ is the slope of the empirical curve (lnf i versus lni) and can be estimated from the data. Assuming that the system is well described by a linear BDIM, it follows from (5.9) that a -b = 1 -γ and λ = δ. Thus, where a is the single free parameter.
For the linear second-order balanced BDIM, the ratio of the birth rate to the innovation rate is The case 1 + a -b = 0 (slope of the asymptote in double logarithmic coordinates equal to a -b -1 = -2) is a critical one.
In this case,

Accordingly, G(N)→∞ at N→∞.
The total number of domains in the equilibrium state for a second-order balanced linear BDIM is If the slope of the asymptote γ = -1, the linear second-ordered BDIM shows the same asymptotic behavior as a simple BDIM (2.1), but behaves differently at small i. If γ ≠ -1, the system cannot be described by a simple BDIM even asymptotically, but can be described by a linear BDIM. As indicated above, in this case, the average per-domain birth and death rates depend on the size of the domain family and the difference a-b characterizes this dependence.

Polynomial BDIMs
The quadratic models take into account the dependence of birth and death rates of individual domains on the simplest, pairwise interactions. If interactions of higher orders are postulated, λ i ~ P n (i) and/or δ i ~ Q m (i), where P n (i), Q m (i) are polynomials on i of the n-th and m-th degrees. Again, if the degrees n and m are different, then the BDIM is non-balanced and equilibrium frequencies have hyperexponential asymptotics. Thus, let n = m, where r k , q k are constants and r 0 = q 0 = 1. We suppose, of course, that R(i), Q(i) are positive for all integer i. Note that, in this case, χ (i) ≡ λ i-1 /δ i = θ (1+(r 1 -q 1 -m)/i+O(1/ i 2 )), where θ = λ/δ. We will suppose that θ ≤ 1.
According to Theorem 3, the polynomial BDIM with rates (5.21) has equilibrium sizes of domain families with power-exponential asymptotics where ρ = r 1 -q 1 .
In particular, if ρ -m > -1, the equilibrium frequencies p i follow the Pascal distribution with parameters (ρ -m + 1, θ); if ρ -m = -1, the equilibrium frequencies p i follow the (truncated) logarithmic distribution; if ρ -m = 0, the equilibrium frequencies p i follow the geometric distribution; if λ = δ, the polynomial BDIM is second-order balanced and the equilibrium frequencies p i follow the power distribution Note that the degree of the power distribution (5.23) depends only on m, the common degree of the polynomials (5.21), and on ρ, the difference between the coefficients r 1 and q 1 , and does not depend on other coefficients. In par-ticular, if r 1 = q 1 , then p i ~ i -m . This relation could be interpreted as follows: if the first two coefficients of polynomial rates λ i and δ i are equal, then the degree of the power distribution (5.19) is equal to the "order of interactions" of domains.
, and the equilibrium total number of domain families For the polynomial model f 1 = ν/δ 1 = ν/(δ q k ), so This formula can be used for estimating the model parameters.
For the polynomial second-order balanced BDIM, the ratio of the death rate to the innovation rate is

Approximation of the observed domain family size distributions in prokaryotic and eukaryotic genomes with different BDIMs
Having developed the mathematical theory of BDIMs, we sought to determine which of these models, if any, adequately described the empirical data on domain family size distribution. To identify the domain sets of domains encoded in each of the genomes, the CDD library of position-specific scoring matrices (PSSMs), which includes the domains from the Pfam and SMART databases, was used in RPS-BLAST searches [12] against the protein sequences from a set of completely sequenced eukaryotic and prokaryotic genomes [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome]. The CDD domain library is partially redundant, so when the results obtained from individual PSSMs showed significant overlap (more than 50% of hits overlapped for more than 50% of their length), the corresponding domains were examined caseby-case for redundancy. PSSMs representing structurally similar domains and producing overlapping lists of hits were joined into "synonymy clusters".
The results of RPS-BLAST searches against the sets of protein sequences from individual genomes were interpreted as follows: non-overlapping hits to the same protein were treated independently; among overlapping hits, only the strongest one (lowest E-value) was recorded; all hits from a synonymy cluster were assigned to one representative domain family. The number of hits that a domain family had in a genome, with the cut-off of E = 0.001, was recorded as the number of domains of the given family encoded in the given genome. The CDD domain library certainly does not include all existing domains. In practice, domains from this collection were detected in >50% in each of the analyzed species, with the only exception of human, for which the analyzed protein set is likely to contain a substantial fraction of false predictions (Table 1).
To enable statistical analysis using the χ 2 -method for the entire range of the data, including the sparsely distributed classes corresponding to large families, the data needed to be combined. For each genome, the observed domain family frequencies were grouped into bins, each containing at least 10 families; typically, bins corresponding to families with small number of members included a single size class (e.g. all single-member families, two-member families etc), whereas bins corresponding to large families may span hundreds of size classes, most of them empty.
Theoretical distribution values for a bin combining observed data from m-th to n-th class were computed as   The simplest model that resulted in a good fit to the observed domain family size distributions was the secondorder balanced linear BDIM (Fig.   6, 7,8,9,10,11,12,13,14,15). For all analyzed genomes, P(χ 2 ) for this model was >0.05, i.e. no significant difference between the model predictions and the observed data was detected. Considering the first-order balanced linear BDIM, which involves varying the parameter θ, did not result in a significant improvement of fit for any of the analyzed genomes (data not shown). In contrast, a fit to a truncated logarithmic distribution (prediction of a simple BDIM) failed for all genomes (P(χ 2 ) < 10 -5 ; Fig. 16, 17, and data not shown). An exact power-law distribution, which is often used to approximate protein family frequency distributions, similarly failed to adequately fit the observed data, even when the most deviant class 1 fami- . Notably, second-order balanced linear BDIM results in a correct prediction of the number of very large families, whereas simple BDIM systematically underestimates the number of families in the highest bins. Conversely, the power-law fit underestimates the slope of the best-fit line (in double logarithmic coordinates) compared to the asymptote of the linear BDIM prediction and, accordingly, significantly overestimates the number of families in the highest bins (Fig. 16, 17). These results are compatible with the recent observation that the domain family size distributions are better described by the generalized Pareto distribution than by power laws [31].
Fitting the observed domain family size distribution with the second-order balanced linear BDIM resulted in positive values of the parameters a and b, with a <b, for all analyzed genomes (Table 1). Accordingly, domain family size distributions in all cases asymptotically tend to the power law with the power k < -1 and, for all species with the exception of C. elegans, k < -2 (Table 1 and Fig. 8). As discussed above, this seems to indicate the existence of "synergy" between domains in a family whereby the likelihood of survival is greater for a domain that belongs to  5). For all species, we find that the innovation rate is approximately three orders of magnitude greater than the per domain birth (death) rate. Accordingly, the total per genome birth (duplication) rate is comparable to but, typically, several times greater than the innovation rate (Table 1). The ratio of the per genome birth rate to the innovation rate increases with the number of genes in a genome or the number of detected domains, with nearly identical rates seen for small prokaryotic genomes and values as high as 20 for the largest plant and animal genomes (Table 1).
The data used to fit the BDIM typically included 50-60% of the proteins encoded in a given genome (Table 1); the remaining proteins were not represented by sufficiently similar domains in the current CDD collection. It cannot be ruled out that the fit would be significantly affected as a result of including all proteins encoded in the genome, in case the proteins currently not recognized in CDD searches have a family size distribution substantially different from that of the recognized ones. However, secondorder balanced linear BDIM can accommodate considerable perturbations of the distribution through adjustment of the parameters, so we believe that this model is likely to approximate well also the size distribution of domain nome. An alternative approach that at least partially circumvents the sampling problem involves analysis of all families of paralogs detectable using clustering by sequence similarity, with employing a predefined library of domains; this analysis is beyond the scope of the present work but may be a subject of further investigation.

General discussion and conclusions
Here, we presented a complete mathematical description of the size distribution of protein domain families encoded in genomes for simple but not unrealistic models of ev-olution, which include three types of events: domain duplication (birth), domain elimination (death), and domain innovation. In biological terms, innovation could involve gene acquisition via horizontal gene transfer, emergence of a new domain from a non-coding sequence or a non-globular protein sequence, or major modification of a domain obliterating its connection with a pre-existing family. Innovation via horizontal gene transfer appears to be particularly common in prokaryotes [32,39], which might account for the apparent higher relative innovation rate in prokaryotic genomes observed in the present analysis (Table 1).

Figure 12
Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the thermophilic euryarchaeon Methanothermobacter thermautotrophicus. The panels and the designations are as in Fig. 6 We showed that birth-death-innovation models (BDIMs) with different levels of complexity lead to readily distinguishable predictions regarding the distribution of domain family sizes in genomes. In particular, we defined the exact analytic conditions that lead, exactly or asymptotically, to power law distributions, which have recently received ample attention, as they were uncovered in various biological and social contexts [20,25]. In contrast to previous analyses [16,17,30] but in agreement with the results of a recent re-examination [31], we showed that the power law only asymptotically approximates the domain family size distributions.
Three groups of observations made in this work seem to have the greatest potential of enhancing our understanding of genome evolution and, perhaps, evolution of other complex systems. First, we proved that, within the BDIM framework, there is a unique equilibrium state of the system, which is approached exponentially, with respect to time, from any initial state. In this equilibrium state, the number of domain families in each size class remains constant and follows a unique distribution depending on the type and parameters of the BDIM. In particular, power asymptotics emerges when and only when a BDIM is second-order balanced, i.e. the rates of domain birth and death are asymptotically equal. Since we showed that the and distribution (f i ). Taking a broader biological perspective, this result might indicate that evolving lineages go through lengthy periods of relative stasis when the level of genomic complexity remains more or less the same. Under this view, the stasis epochs are punctuated by relatively short periods of dramatic changes when the complexity either greatly increases (the emergence of eukaryotes is the most obvious case in point) or decreases (e.g. evolution of parasites). These bursts of evolution might be described as transitions between different BDIMs in the parameter space, with some of the trajectories potentially involving non-balanced BDIMs. The analogy between this emerging picture of genome evolution and the punctuated equilibrium concept of species evolution, which has been developed through analysis of the paleontological record [40], is obvious.
Second, we showed that the simplest model that adequately describes the observed domain family size distributions is the second-order balanced linear BDIM; in contrast, simple BDIMs do not show a good fit to the observations. This has potentially important implications for the mode of domain family evolution. Simple BDIMs are based on the notion that the likelihood of duplication (birth) or elimination (death) of a domain is uniform across the genome and does not depend on the size or other characteristics of domain families (the independence assumption). Clearly, under the independence assumption, a duplication (birth) as well as elimination (death) of a domain is more likely to occur in a large family than in a small one, but only because the overall probability of such an event is proportional to the number of family members, whereas the birth (death) rate per domain remains the same. The key observation of this work, that the actual domain frequency distributions are well described by a linear but not by a simple BDIM, suggests that the independence assumption is an oversimplification. Instead, the linear BDIM includes parameters that describe the dependence of the per domain birth (death)   rate on the family size. The asymptotics of the theoretical distribution that is the best fit for the actual data is a power law, with the power equal to a-b-1, where a and b are the parameters of a linear BDIM. We observed that, for all analyzed genomes, a-b-1 < -1 (a <b), which corresponds to "synergy" between domains in a family. Both the domain birth rate and the death rate drop with the increase of the size of a domain family, but the death rate decreases faster (Fig. 5). In general terms, this suggests that small families are more dynamic during evolution than large families. In particular, under the BDIM formalism, innovation contributes only to single-member families (class 1), which have the greatest evolutionary mobility, and either quickly proliferate and are stabilized or perish. An implication of these observations is that, in general, large families are older than small ones. Exceptions to this generalization, i.e. the existence of small, ancient families, probably point to selection for a specific family size; for example, it seems likely that selection acts against proliferation of certain essential proteins, e.g. ribosomal proteins, which typically form single-member families [41]. Another pertinent observation is that the linear BDIM seems to adequately accommodate even the largest of the identified domain families. Lineage-specific expansion of paralogous families appears to be one of the principal modes of organismic adaptation during evolution [13,14,42]. Thus, quantitatively, adaptive family expansion appears to fit within the BDIM framework, although these models do not explicitly incorporate the notion of selection. Of course, for BDIMs, it is irrelevant which families expand, and this choice is determined by selection.
Third, the BDIM equilibrium condition with respect to the total number of domain families, ν = δ 1 f 1 (ν is the innovation rate, δ 1 is the domain death rate for class 1 families, and f 1 is the number of domain families in class 1) allows us to estimate the ratio between domain innovation rate and the domain death and birth rates. Indeed, according to the above and the definition of a second-order linear BDIM, which is the best fit for the actual data, λ = δ = ν/ f 1 (1+b). Since the number of domain families in class 1 (families with only one member) is in the hundreds for each genome, this translates into an innovation rate that is much greater than the duplication or elimination rate per domain (Table 1). Such high innovation rates might appear counter-intuitive, but let us note that the duplication rate over all domain families is a number that tends to be nearly identical to ν for small prokaryotic genomes and several-fold greater than ν for large eukaryotic genomes (Table 1). Thus, under the second-order balanced linear BDIM, the likelihood of appearance of a new domain in a genome is close to or several times less than the likelihood of a duplication or elimination of an existing domain. Nevertheless, the finding that the innovation rate is comparable to the overall duplication/elimination rate seems surprising. If the linear BDIM is indeed a realistic evolutionary model, this emphasizes the critical role of innovation in maintaining the balance (steady state) in genome evolution. In prokaryotes, innovation via horizontal gene transfer appears to be particularly extensive [32,39], which might underlie the greater relative innovation rate in these organisms (Table 1).
As already indicated, BDIMs do not explicitly incorporate selection. However, the present analysis shows that only models with precisely balanced domain birth, death and innovation rates can account for the observed distribution of domain family size in each of the analyzed genomes. It seems likely that the balance between these rates is itself a product of selection. There is little doubt that BDIMs will be eventually replaced by more sophisticated formalisms that will more realistically capture the mechanisms of genome evolution. Nevertheless, even the crude modeling described here seems to reveal several potentially interesting and non-trivial aspects of the evolutionary process.

Mathematical Appendix. Proofs of some statements
Proof of Proposition 1 When system (3.1) is solved consecutively from the last equation to the second one, it becomes obvious that the solution is unique up to a constant multiplier.
Next The following theorem (see [43]) gives the desired criterion: the real part of all the characteristic values of a real ma-   To apply this theorem, let us consider the n × n matrix, n ≤ N It is easy to see that det A n = -δ n det B n-1 -λ n-1 δ n det B n-2 .
Further, it is easy to see that for any n det B n = det A n -λ n det B n-1 .
Taking into account that B 1 = -(λ 1 + δ 1 ) < 0 and that the sign of det A n coincides with (-1) n , it is easy to prove that det J n > det A n if det A n > 0 and det J n < det A n if det A n < 0.
Thus, the sign of det B n coincides with the sign of det A n and so (-1) n B n > 0 for all n = 1,..N. According to the aforementioned theorem, the real parts of all the characteristic values of a real matrix A N are negative and so the single equilibrium is asymptotically stable, QED.
According to Proposition 1, Applying the main asymptotic property of Γ-function, i.e. Γ (i+c)/Γ(i)~i c at large i for any c, we have Γ (i+a+1)/ Γ (i+1) ~ i a , and so p i ~ Γ (i) s θ i i a .

Proofs of Corollaries 1-3
If a discrete random variable ξ has the Pascal distribution, then   and it becomes evident that, for a > -1, equilibrium frequencies p j of the first-order balanced BDIM follow the Pascal distribution with parameters (a+1,θ).
If a = -1, then p i ~ θ i /i and so p i follows the truncated logarithmic distribution with parameter θ. If a = 0, then p j θ i and p i follows the geometric distribution.
Further, p i ~ i a , that is the sequence p i follows the power distribution with the power a, if and only if θ = 1, that is, if the BDIM is second-order balanced.
Finally, if λ i-1 /δ i = 1 + O(1/i 2 ), that is, if θ = 1 and a = 0, then p i ~ const; in particular, if λ i-1 = δ i for all i, then, according to Proposition 1, f i = ν for all i and p i = 1/N.

Contributions of individual authors
GPK developed most of the mathematical formalism and wrote the draft of the mathematical part of the manuscript; YIW performed the identification of domain in sequenced genomes and the statistical analysis of the resulting distributions and wrote the draft of the corresponding part of the manuscript; FSB proved some of the theorems; AYR largely incepted the work and contributed to the formulation of the models; EVK contributed to the inception of the work and the formulation of the models, gave the biological interpretation of the results, wrote the background and discussion sections and extensively edited the entire manuscript.