 Research article
 Open Access
 Published:
Birth and death of protein domains: A simple model of evolution explains power law behavior
BMC Evolutionary Biology volume 2, Article number: 18 (2002)
Abstract
Background
Power distributions appear in numerous biological, physical and other contexts, which appear to be fundamentally different. In biology, power laws have been claimed to describe the distributions of the connections of enzymes and metabolites in metabolic networks, the number of interactions partners of a given protein, the number of members in paralogous families, and other quantities. In network analysis, power laws imply evolution of the network with preferential attachment, i.e. a greater likelihood of nodes being added to preexisting hubs. Exploration of different types of evolutionary models in an attempt to determine which of them lead to power law distributions has the potential of revealing nontrivial aspects of genome evolution.
Results
A simple model of evolution of the domain composition of proteomes was developed, with the following elementary processes: i) domain birth (duplication with divergence), ii) death (inactivation and/or deletion), and iii) innovation (emergence from noncoding or nonglobular sequences or acquisition via horizontal gene transfer). This formalism can be described as a b irth, d eath and i nnovation m odel (BDIM). The formulas for equilibrium frequencies of domain families of different size and the total number of families at equilibrium are derived for a general BDIM. All asymptotics of equilibrium frequencies of domain families possible for the given type of models are found and their appearance depending on model parameters is investigated. It is proved that the power law asymptotics appears if, and only if, the model is balanced, i.e. domain duplication and deletion rates are asymptotically equal up to the second order. It is further proved that any power asymptotic with the degree not equal to 1 can appear only if the hypothesis of independence of the duplication/deletion rates on the size of a domain family is rejected. Specific cases of BDIMs, namely simple, linear, polynomial and rational models, are considered in details and the distributions of the equilibrium frequencies of domain families of different size are determined for each case. We apply the BDIM formalism to the analysis of the domain family size distributions in prokaryotic and eukaryotic proteomes and show an excellent fit between these empirical data and a particular form of the model, the secondorder balanced linear BDIM. Calculation of the parameters of these models suggests surprisingly high innovation rates, comparable to the total domain birth (duplication) and elimination rates, particularly for prokaryotic genomes.
Conclusions
We show that a straightforward model of genome evolution, which does not explicitly include selection, is sufficient to explain the observed distributions of domain family sizes, in which power laws appear as asymptotic. However, for the model to be compatible with the data, there has to be a precise balance between domain birth, death and innovation rates, and this is likely to be maintained by selection. The developed approach is oriented at a mathematical description of evolution of domain composition of proteomes, but a simple reformulation could be applied to models of other evolving networks with preferential attachment.
Background
Sequencing of numerous genomes from all walks of life, including multiple representatives of diverse lineages of bacteria, archaea and eukaryotes, creates unprecedented opportunities for comparativegenomic studies [1–3]. One of the mainstream approaches of genomics is comparative analysis of protein or domain composition of predicted proteomes [2, 4, 5]. These studies often concentrate on domains rather than entire proteins because many proteins have variable multidomain architectures, particularly in complex eukaryotes (throughout this work, we use the term domain to designate a distinct evolutionary unit of proteins, which can occur either in the standalone form or as part of multidomain architectures; often but not necessarily, such a unit corresponds to a structural domain). As soon as genome sequences of bacteria became available, it has been shown that a substantial fraction of the genome of each species, from approximately one third in bacteria with very small genomes, to a significant majority in species with larger genomes, consists of families of paralogs, genes that evolved via gene duplication at different stages of evolution [6–9]. Again, a comprehensive analysis of paralogous relationships between genes is probably best performed at the level of individual protein domains, first, because many proteins share only a subset of common domains, and second, because domains can be conveniently and with a reasonable accuracy detected using available collections of domainspecific sequence profiles [10–12]. Comparisons of domain repertoires revealed both substantial similarities between different species, particularly with respect to the relative abundance of housekeeping domains, and major differences [4, 5]. The most notable manifestation of such differences is lineagespecific expansion of protein/domain families, which probably points to unique adaptations [13, 14]. Furthermore, it has been demonstrated that more complex organisms, e.g. vertebrates, have a greater variety of domains and, in general, more complex domain architectures of proteins than simpler life forms [1, 2].
Lineagespecific expansions and gene loss events detected as the result of comparative analysis of the domain compositions of different proteomes have been examined mostly at a qualitative level, in terms of the underlying biological phenomena, such as adaptation associated with expansion or coordinated loss of functionally linked sets of genes [15]. A complementary approach involves quantitative comparative analysis of the frequency distributions of proteins or domains in different proteomes. Several studies pointed out that these distributions appeared to fit the power law: P(i) ≈ ci^{γ} where P(i) is the frequency of domain families including exactly i members, c is a normalization constant and γ is a parameter, which typically assumes values between 1 and 3 [16–19]. Obviously, in doublelogarithmic coordinates, the plot of P as a function of i is a straight line with a negative slope. Power laws appear in numerous biological, physical and other contexts, which seem to be fundamentally different, e.g. distribution of the number of links between documents in the Internet, the population of towns or the number of species that become extinct within a year. The famous Pareto law in economics describing the distribution of people by their income and the Zipf law in linguistics describing the frequency distribution of words in texts belong in the same category [20–29]. Recent studies suggested that power laws apply to the distributions of a remarkably wide range of genomeassociated quantities, including the number of transcripts per gene, the number of interactions per protein, the number of genes or pseudogenes in paralogous families and others [30].
Power law distributions are scalefree, i.e. the shape of the distribution remains the same regardless of scaling of the analyzed variable. In particular, scalefree behavior has been described for networks of diverse nature, e.g. the metabolic pathways of an organism or infectious contacts during an epidemic spread [20, 25–27]. The principal pattern of network evolution that ensures the emergence of power distributions (and, accordingly, scalefree properties) is preferential attachment, whereby the probability of a node acquiring a new connection increases with the number of connections this node already has.
However, a recent thorough study suggested that many biological quantities claimed to follow power laws, in fact, are better described by the socalled generalized Pareto function: P(i) = c(i+a)^{γ} where a is an additional parameter [31]. Obviously, although at i >>a, a generalized Pareto distribution becomes indistinguishable from a power law, at small i, it deviates significantly, the magnitude of the deviation depending on the value of a. Furthermore, unlike power law distributions, generalized Pareto distributions do not show scalefree properties.
The importance of the analysis of frequency distributions of domains or proteins lies in the fact that distinct forms of such distributions can be linked to specific models of evolution. Therefore, by exploring the distributions, inferences potentially can be made on the mode and parameters of genome evolution. For this purpose, the connections between domain frequency distributions and evolutionary models need to be explored theoretically within a maximally general class of models. In this work, we undertake such a mathematical analysis using simple models of evolution, which include duplication (birth), elimination (death) and de novo emergence (innovation) of domains as elementary processes (hereinafter BDIM, birth death innovation models). All asymptotics of equilibrium frequencies of domain families of different size possible for BDIM are identified and their dependence on the parameters of the model is investigated. In particular, analytical conditions on birth and death rates that produce power asymptotics are determined. We prove that the power law asymptotics appears if, and only if, the model is balanced, i.e. domain duplication and deletion rates are asymptotically equal up to the second order, and that any power asymptotic with the degree not equal to 1 can appear only if the assumption of independence of the duplication/deletion rates on the size of a domain family is rejected. We apply the developed formalism to the analysis of the frequency distributions of domains in individual prokaryotic and eukaryotic genomes and show a good fit of these data to a particular version of the model, the secondorder balanced linear BDIM.
Results and Discussion
Mathematical theory and model
Fundamental definitions and assumptions
A genome is treated as a "bag" of coding sequence for protein domains, which we simply call domains for brevity. Domains are treated as independently evolving units disregarding the dependence between domains that tend to belong to the same multidomain protein. Each domain is considered to be a member of a family (including singlemember families). We consider three types of elementary evolutionary events: i) domain birth, which generates a new member within a family; the principal mechanism of birth is duplication with divergence but additional mechanisms may be considered, including acquisition of a family member from a different species via horizontal gene transfer [32], ii) domain death, which results from domain inactivation and/or deletion, and c) domain innovation, which generates a new family with one member. Innovation may occur via horizontal gene transfer from another species, via domain evolution from a noncoding sequence or a sequence of a nonglobular protein, or via major change of a domain from a preexisting family after a duplication, which makes the relationship between the given domain and its family of origin undetectable (this latter process formally combines domain birth, death and innovation in a single event). The innovation rate (ν), is considered constant for a given genome. The rates of elementary events are considered to be independent of time (i. e. only homogeneous models are considered) and of the nature (structure, biological function etc.) of individual families.
In a finite genome, the maximal number of domains in a family cannot exceed the total number of domains and, in reality, is probably much smaller; let N be the maximal possible number of domain family members. We consider classes of domain families, which have only one common feature, namely the number of members (Fig. 1). Let f_{ i } be the number of domain families in ith class, i.e. families that are represented by exactly i domains in the given genome, i = 1,2,...N. Birth of a domain in a family of class i results in the relocation of this family from class i to class i+1 (decrease of f_{ i } and increase of f_{i+1} by 1). Conversely, death of a domain in a family of class i relocates the family to class i1; death of a domain in class 1 results in the elimination of the corresponding family from the given genome, this being the only considered mechanism of family death. We consider time to be continuous and suppose it very unlikely that more than one elementary event occur during a short time interval; formally, the probability that more than one event occurs during an interval Δt is o(Δt).
The formulation of the model
The simple BDIM
Let us formulate the following independence assumption: i) all elementary events are independent of each other; ii) the rates of individual domain birth (λ) and death (δ) do not depend on i (number of domains in a family). Under this assumption, the instantaneous rate, at which a domain family leaves class i, is proportional to i and the following simple BDIM describes the evolution of such a system of domain family classes:
df_{1}(t)/dt = (λ + δ) f_{1}(t) + 2δf_{2}(t) + ν
df_{ i }(t)/dt = (i  1)λf_{i1}(t)  i(λ + δ)f_{ i }(t) + (i + 1) δ f_{i+1}(t) for 1<i<N, (2.1)
df_{ N }(t)/dt = (N  1)λ f_{N1}(t)  N δ f_{ N }(t).
Similar models have been considered previously in several different contexts [33 v. 1, ch. 17, 34]. We will see in 3.2 that the solution of model (2.1) evolves to equilibrium, with a unique distribution of domain family sizes, f_{ i }~(λ/δ)^{i}/i; in particular, if λ = δ, then f_{ i }~1/i. Thus, under the simple BDIM, if the birth rate equals the death rate, the abundance of a domain class is inversely proportional to the size of the families in this class. When the observations do not fit this particular asymptotic (as observed in several studies on distributions of protein family sizes), a different, more general model needs to be developed.
The Master BDIM
A more general BDIM emerges when the independence assumption is abandoned. Instead of constructing specific hypotheses regarding the dependence between the elementary events, let us simply suppose that the domain birth and death rates for a family of class i do not necessarily show proportionality to i. For the general case, we designate these rates, respectively, λ_{ i } and δ_{ i }; in the specific case of the simple BDIM (2.1), λ_{ i } = λi and δ_{ i } = δi. Then we have the following master BDIM:
df_{1}(t)/dt = (λ_{1} + δ_{1})f_{1}(t) + δ_{2}f_{2}(t) + ν
df_{ i }(t)/dt = λ_{i1}f_{i1}(t)  (λ_{ i } + δ_{ i })f_{ i }(t) + δ_{i+1}f_{i+1}(t) for 1<i<N, (2.2)
df_{ N }(t)/dt = λ_{N1}f_{N1}(t)  δ_{ N }f_{ N } (t).
Let F(t)= f_{ i }(t) be the total number of domain families at instant t; it follows from (2.2) that
dF(t)/ dt = ν  δ_{1}f_{1}(t) (2.3)
The system (2.2) has an equilibrium solution f_{1},...f_{ N } defined by the equality df_{ i }(t)/dt = 0 for all i; this solution is described below under Proposition 1. Accordingly, there exists an equilibrium solution of equation (2.3), which we will designate F_{ eq } (the total number of domain families at equilibrium). At equilibrium, ν = δ_{1}f_{1}, i.e. the processes of innovation and death of single domains (more precisely, the death of domain families of class 1, i.e. singletons) are balanced.
We can rewrite the model (2.2) in terms of the frequency of a domain family of class i p_{ i }(t) = f_{ i }(t)/F(t). Let x(t) = y(t)/Y(t); then
dx/dt = [dy/dt /y  dY/dt /Y] x.
Applying this identity to p_{ i }(t) and rewriting equation (2.3) in the form
[dF(t)/dt]/F(t) = ν/F(t)  δ_{1}p_{1}(t) (2.3')
we obtain the following model for frequencies of the domain family (master BDIM for frequencies), which is equivalent to (2.2):
dp_{1}(t)/dt = (λ_{1} + δ_{1})p_{1}(t) + δ_{2}p_{2}(t) + ν/F(t)  (ν/F(t)  δ_{1}p_{1}(t))p_{1}(t), (2.4)
dp_{ i }(t)/dt = λ_{i1}p_{i1}(t)  (λ_{ i } + δ_{ i })p_{ i }(t) + δ_{i+1}p_{i+1}(t)  (ν/F(t)  δ_{1}p_{1}(t)) p_{ i }(t) for 1<i<N,
dp_{ N }(t)/dt = λ_{N1}p_{N1}(t)  δ_{ N }p_{ N } (t)  (ν/F(t)  δ_{1}p_{1}(t))] p_{ N }(t).
System (2.4) should be solved together with equation (2.3).
The Master BDIM and Markov processes
Let us note that system (2.4) for frequencies is nonlinear, so it is not a system of Kolmogorov equations for state probabilities of any homogeneous Markov process. Let us further suppose that a genome had ample time to arrive at an equilibrium with respect to the total number of domain families, such that F(t) = F_{ eq }. This does not imply dp_{ i }(t)/dt = 0 or df_{ i }(t)/dt = 0; in other words, the system might rearrange the frequencies of individual families, although the total number of families remains stable. If F(t) = F_{ eq }, the master system (2.4) turns into
d p_{1}(t)/dt = (λ_{1} + δ_{1}) p_{1}(t) + δ_{2}p_{2}(t) + ν/F_{ eq } (2.5)
d p_{ i }(t)/dt = λ_{i1}p_{i1}(t)  (λ_{ i } + δ_{ i }) p_{ i }(t) + δ_{i+1}p_{i+1}(t) for 1<i<N,
d p_{ N }(t)/dt = λ_{N1}p_{N1}(t)  δ_{ N }p_{ N } (t).
System (2.5) can be rewritten as a matrix equation
d p(t)/dt = p(t)Q,
where p(t) = {p_{1}(t),...p_{ N }(t)} and the matrix Q = (q_{ ij }) is defined by equalities
q_{11} = (λ_{1} + δ_{1}) + ν/F_{ eq }, q_{21} = δ_{2} + ν/F_{ eq }, q_{s 1} = ν/F_{ eq } for all s > 2;
q_{i1,i} = λ_{i1}, q_{i,i} = (λ_{ i } + δ_{ i }), q_{i+1,i} = δ_{i+1}, q_{k,i} = 0 for all k, ik > 1, i = 2,...N1,
q_{N1,N} = λ_{N1}, q_{N,N} = δ_{ N }.
It is easy to see that the sum of elements of each row (except for the first one) of the matrix Q is equal to ν/F_{ eq } > 0. Therefore the matrix Q cannot be a matrix of transition rates for any Markov process (the sum of elements of each row of a matrix of transition rates for Markov process with continuous time should be nonpositive [33 v. 1, ch.17, s. 8, 35 v. 2, ch. 3, s. 2]; in other words, there is no Markov process with continuous time and state space {1,2,...N} whose state probabilities satisfy system (2.5).
Thus, neither the initial BDIMs (2.1) or (2.2) nor the equilibrium model (2.5) can be described by any Markov process with continuous time.
Remark. If, in system (2.5), ν = 0, then this system turns into a system of state probabilities for a Markov birth and death process with continuous time.
Equilibrium in BDIMs
Equilibrium sizes and frequencies of the domain family system
Let us suppose that the genome had ample time to arrive at a complete equilibrium state, in which not only dF(t)/dt = 0, but also df_{ i }(t)/dt = 0 for all i. Thus, the equilibrium sizes of domain families f_{ i } satisfy the system

(λ_{1} + δ_{1}) f_{1} + δ_{2}f_{2} + ν = 0,
λ_{i1}f_{i1}  (λ_{ i } + δ_{ i })f_{ i } + δ_{i+1}f_{i+1} = 0 for 1<i<N, (3.1)
λ_{N1}f_{N1}  δ_{ N }f_{ N } = 0.
It should be emphasized that the master model does not assign a priori the value of F_{ eq }; this value has to be computed depending on the model parameters.
The following statement is central for further analysis.
Proposition 1. The master BDIM (2.2) has a unique equilibrium state (f_{1},...f_{ N }), which is the sole solution of system (3.1):
f_{1} = ν/δ_{1}
f_{ i } = ν λ_{ j } / δ_{ j } for all i = 2,...N. (3.2)
The unique equilibrium state (3.2) is globally asymptotically stable.
In addition (formally assuming λ_{ j } = 1 for i = 1),
F_{ eq } = ν ( λ_{ j } / δ_{ j } (3.3)
This proposition ascertains that all evolutionary trajectories of the system (2.2) exponentially (with respect to time) approach the equilibrium state (3.2). The proof is given in the Mathematical Appendix.
Remark. Let us denote the ratio of the birth rate to the innovation rate
G(N) ≡ λ_{ i }f_{ i }/ν,
and the ratio of the death rate to the innovation rate
I(N) ≡ δ_{ i }f_{ i }/ν.
Then, according to Proposition 1, for any BDIM in equilibrium,
G(N)  I(N) = λ_{ j } / δ_{ j }  λ_{ j }/δ_{ j }  1 = 1.
The principal goal of the treatment that follows is the analysis of the asymptotic behavior of equilibrium frequencies and sizes of domain families (f_{1},...f_{ N }) at large N. We will differentiate two cases of asymptotic behavior according to the following
Definition. Let {q_{ i }}, {s_{ i }} be sequences of real numbers; let us denote q_{ i }s_{ i } if lim q_{ i }/s_{ i } = 1 and q_{ i } ~ s_{ i } if lim q_{ i }/s_{ i } = c = const and 0<c<∞. We will also use this notation for finite but sufficiently long sequences.
Equilibrium frequencies for the simple BDIM
Let us apply Proposition 1 to the simple BDIM (2.1) with λ_{ i } = λi, δ_{ i } = δi.
Definition. A simple BDIM is balanced if θ = λ/δ = 1, i.e. if the rates of individual domain birth and death are equal.
Let us recall that a random discrete variable ξ has the logarithmic distribution with parameter θ < 1 if
P(ξ = i) = θ^{i}/i [ln(1θ)]^{1}, i = 1,2,...
A random variable ξ has the truncated logarithmic distribution with parameter θ if
P (ξ = i) = C_{ n } θ^{i} / i, i = 1,2,...n, C_{ n } = 1/ θ^{j}/j.
Then, we have
Proposition 2.Para>1) For any simple BDIM (2.1)
f_{ i } = (ν/δ)θ^{i1}/i = (ν/λ)θ^{i}/i, (3.4)
F_{ eq } = f_{ i } = ν/δ θ^{j1}/j, (3.5)
and
p_{ i } = (1/F_{ eq })(ν/δ)θ^{i1}/i = (θ^{i}/i) / θ^{j}/j (3.6)
that is, the equilibrium frequencies have the truncated logarithmic distribution if θ < 1.
2) If a simple BDIM is balanced, then
F_{ eq } = ν/δ 1/j, (3.7)
and for all i = 1,2,...N
p_{ i } = ν/δF_{ eq }/i = ( 1/j)^{1} / i. (3.8)
The proof is given in the Mathematical Appendix.
Thus, a simple BDIM can have equilibrium frequencies only of the form p_{ i } = C θ^{i}/i, where C = const and θ is the distribution parameter. In particular, the equilibrium frequencies for a balanced simple BDIM have the power distribution with the degree equal to 1.
Simple methods exist for preliminary graphical estimation of the single distribution parameter θ [36 ch. 7, s. 7]. We will prove in the following section that, if we observe a power asymptotic for empirically observed equilibrium frequencies, then (assuming that the system can be described by a BDIM), the rates λ_{ i } and δ_{ i } should be asymptotically equal at large i. If, additionally, the degree of the asymptotic is not equal to 1, then the system dynamics cannot be described by a simple BDIM. In this case, it is necessary to consider more general models, such as the Master BDIM (2.2).
Asymptotic behavior of equilibrium frequencies of a Master BDIM: Main Theorems
Let us consider the master BDIM (2.2); we showed in 3.1 that its equilibrium frequencies are the solution of the system
 (λ_{1} + δ_{1})p_{1} + δ_{2}p_{2} + ν/F_{ eq } = 0, (3.9)
λ_{i1}p_{i1}  (λ_{ i } + δ_{ i })p_{ i } + δ_{i+1}p_{i+1} = 0 for 1<i<N,
λ_{N1}p_{N1}  δ_{ N }p_{ N } = 0.
The following theorem gives all possible types of asymptotic behavior of the equilibrium frequencies and defines the connections between these asymptotics depending on model parameters. In particular, if there is no information on the exact form of dependence of the rates of birth and death of domains on the size of a domain family, the theorem can be used to qualitatively describe the dynamics of the asymptotic behavior of the equilibrium frequencies.
We will prove that the asymptotic behavior of a solution of system (3.9) is completely defined by the asymptotic relation between λ_{ i } and δ_{ i }. More precisely, let us define a function χ (i)= λ_{i1}/δ_{ i }; we consider only functions of power growth, i.e. χ (i) ~ i^{s} at i→∞ for a real s. We will see that this is not a serious restriction because the most realistic situations correspond to the case of s = 0. So, let us suppose that, for large i, the following expansion is valid:
χ (i) ≡ λ_{i1}/δ_{ i } = i^{s} θ (1+a/i + O(1/i^{2})) (3.10)
where s, a are real numbers and θ > 0. Evidently, if s ≠ 0, χ (i) tends either to 0 (s < 0) or to ∞ (s > 0) with the increase of i.
Definition. Let us refer to a BDIM (2.2), (3.10) as
i. nonbalanced, if s ≠ 0;
ii. firstorder balanced, if s = 0 and θ ≠ 1, i.e.
λ_{i1}/δ_{ i } = θ (1+a/i + O(1/i^{2})) at large i; (3.11)
iii. secondorder balanced, if s = 0, θ = 1 and a ≠ 0, i.e.
λ_{i1}/δ_{ i } = 1 + a/i + O(1/i^{2})) for large i; (3.12)
iv. highorder balanced, if s = 0, θ = 1 and a = 0, i.e.
λ_{i1}/δ_{ i } = 1 + O(1/i^{2})) for large i.
We will show that the first three coefficients, s, θ and a, of asymptotic expansion (3.10) for χ (i) = λ_{i1}/δ_{ I } exactly specify all possible asymptotic behaviors of BDIM equilibrium frequencies.
Theorem 1. The equilibrium frequencies p _{ i } of BDIM (2.2) have the following asymptotics
i. if the model is nonbalanced, then
p_{ i } ~ Γ (i)^{s}θ^{i}i^{a}, where Γ (i) is the Γfunction;
ii. if the model is firstorder balanced, then
p_{ i } ~ θ^{i}i^{a};
iii. if the model is secondorder balanced, then
p_{ i } ~ i^{a};
iv. if the model is highorder balanced, then
p_{ i } ~ 1
The proof is given in the Mathematical Appendix. The classification of BDIM according to the order of balance is illustrated in Fig. 2 and the asymptotics for different types of BDIMs are shown in Fig. 3.
It follows from this theorem that, if a BDIM is nonbalanced, then its equilibrium frequencies p_{ i } (and equilibrium family sizes f_{ i }) increase or decrease extremely fast (hyperexponentially) with the increase of i. In contrast, if a BDIM has a nonzero order of balance, asymptotic behavior is observed.
Let us recall that a random discrete variable ξ has the Pascal (or negative binomial) distribution with parameters (r,q), r > 0, 0 <q < 1, if P(ξ = k) = Γ(r+k)/[Γ(r) Γ(1+k)] (1q)^{r}q^{k}[36]. We will say that sequence {p_{ i }} follows (or asymptotically has) a discrete probabilistic distribution {q_{ i }} if p_{ i } ~ q_{ i } for large enough i.
Corollary 1.For a firstorder balanced BDIM with θ < 1,
i. if a > 1, the equilibrium frequencies p_{ i } follow Pascal distribution with parameters (a+1,θ);
ii. if a = 1, the equilibrium frequencies follow truncated logarithmic distribution with parameter θ;
iii. if a = 0, the equilibrium frequencies follow geometric distribution with parameter θ.
The following implication of Theorem 1 is of principal interest.
Corollary 2. Equilibrium frequencies of a BDIM have a power asymptotic behavior if and only if the BDIM is secondorder balanced.
Corollary 3. For highorder balanced BDIM, if λ_{i1}/δ_{ i } = 1 for all i, the only possible distribution of equilibrium frequencies is uniform, p_{ i } = const for all i. Moreover, even if λ_{i1}/δ_{ i } = 1 + O(1/i^{2}), the equilibrium frequencies asymptotically tend to the uniform distribution.
Rational BDIM
Rational models comprise a general class of BDIM (Fig. 4), for which the asymptotic behavior of the equilibrium frequencies and equilibrium sizes of domain families can be completely investigated.
Let us suppose that the birth and death rates are of the form
λ_{ i } = λ P(i) = λ (i + a_{ k })^α_{ k }, (4.1)
δ_{ i } = δ Q(i) = δ (i + b_{ k })^β_{ k }
for i > 0, where λ, δ are positive constants, α_{ k }, β_{ k } are real and a_{ k }, b_{ k } are nonnegative for all k = 1,...N.
We will refer to BDIM (2.2.), (4.1) as rational BDIM.
It is known that a wide class of mathematical functions can be well approximated by rational functions of the form (4.1) (see, e.g. [37]).
Specific cases of the rational BDIM are simple BDIM with P(i) = i, Q(i) = i, linear BDIM with P(i) = i + a_{1}, Q(i) = i + b_{1}, where a_{1}, b_{1} are constants, and polynomial BDIM, if P(i) and Q(i) are polynomials on i.
The following theorem describes all possible asymptotic behaviors of the equilibrium frequencies of a rational BDIM. Let us denote
θ =λ/δ,
η =
α_{ k }  β_{ k },
ρ =
a_{ k }α_{ k }  b_{ k }β_{ k },
β =
β_{ k }.
Theorem 2. The equilibrium sizes of domain families of a rational BDIM have the following asymptotics
f_{ i }C ν/λ Γ(i)^{η} θ^{i}i^{ρβ} (4.2)
where the constant C = (Γ(1 + b_{ k })^β_{ k }/ Γ(1 + a_{ k })^α_{ k }. (4.3)
The proof is given in the Mathematical Appendix.
Corollary 1. If η = 0, then the rational BDIM is firstorder balanced and the sequence of equilibrium numbers of domain families {f_{ i }} has a powerexponential asymptotics
f_{ i }C ν/λ θ^{i}i^{ρβ}. (4.4)
In particular, if ρ  β > 1, the equilibrium frequencies p_{ i } follow the Pascal distribution with parameters (ρ  β + 1, θ);
if ρ  β = 1, then frequencies p_{ i } follow the truncated logarithmic distribution;
if ρ  β = 0, then frequencies p_{ i } follow the geometric distribution.
Corollary 2. The equilibrium sizes of domain families f_{ i } and equilibrium frequencies p_{ i } for a rational BMID have the power asymptotics if and only if η = 0 and λ = δ, i.e. the BDIM is secondorder balanced, in which case
f_{ i }C ν/λ i^{ρβ}. (4.5)
Formula (4.4) gives the asymptotics for the equilibrium sizes of domain families f_{ i } and, accordingly, for the total number of families F_{ eq }. The exact expressions for these quantities are given in the proofs of Theorem 2 and Lemma (see Mathematical Appendix).
Proposition 3.
i. The equilibrium sizes of domain families f _{ i } of a balanced (first or higher order) rational BDIM are
f_{ i } = C ν/δθ^{i1} [(Γ(i + a_{ k }))^α_{ k }]/ [(Γ(i + 1 + b_{ k }))^β_{ k }] for all i = 1,2,...
where
C = [(Γ(1 + b_{ k }))^β_{ k }]/ (Γ(1 + a_{ k })^α_{ k }].
ii. The total number of domain families at equilibrium is
F_{ eq } = C ν/δ( θ^{j1} (Γ(j + a_{ k }))^α_{ k }/ (Γ(j + 1 + b_{ k }))^β_{ k }).
For the rational, secondorder balanced BDIM, the ratio of the birth rate to the innovation rate is
G(N) = θ^{i} [Γ(i + 1 + a_{ k })/Γ(1 + a_{ k })]^α_{ k } / [Γ(i + 1 + b_{ k }) / Γ(1 + b_{ k })]^β_{ k }.
The asymptotic formulas for equilibrium frequencies of rational BDIM could be considered as particular cases of the corresponding formulas of general theorem 1. Proposition 3 allows one to calculate the constants in the corresponding asymptotic formulas for the sizes of domain families for a rational BDIM. If only equilibrium frequencies are analyzed, the values of these constants become irrelevant because they contract. However, if the actual values of f_{ i } and F_{ eq } are of interest, the values of the constants are required.
Properties of the main types of rational BDIM
Simple BDIM
As shown above, a simple BDIM can have equilibrium frequencies only of the form p_{ i } = C θ^{i}/i, C = const;in particular, if the distribution parameter θ < 1, we get the (truncated) logarithmic distribution. Logarithmic distributions are seen in many biological contexts, e.g., the distribution of species by the number of individuals in populations or, what is more relevant, the distribution of protein folds by the number of families per fold [38]. Thus, a simple BDIM could be potentially used for modeling the dynamics of biological systems with a logarithmic distribution of equilibrium densities. We examine this possibility in greater detail starting with the case λ = δ (secondorder balanced simple BDIM).
We can extract from Proposition 2 some additional information, which could be helpful for estimating the model parameters. It is known that
1/i = lnN + C_{E} + O (1/N), where C_{E} is the Euler constant, C_{E} = 0.5772157...
More precisely, the approximation
1/i = lnN + C_{E} + N^{1}/2  N^{2}/12 has an error less then 10^{6} for N > 10. Thus, from (3.7), we obtain an interesting formula
F_{ eq } (ν/δ) [lnN + C_{E}] (5.1)
This means that, in the equilibrium state of the system, the total number of domain families grows only slowly (~ln N) with the increase of the maximal number (N) of domains in a family (which is equal to the maximal possible number of domain family size classes).
Furthermore, according to equation (2.3), in the equilibrium state of a simple BDIM ν/δ = f_{1}, so we have
F_{ eq } / f_{1} lnN + C_{E} (5.2)
Formula (5.1) can be used for estimating the model parameters on the basis of empirical data.
In the more general case λ ≠ δ, we can also obtain an estimate of the rate of innovation ν. If λ < δ (θ < 1), then the series in the right part of (3.5) quickly converges,
θ^{i1}/i → ln(1θ)/θ,
so ln(1θ)/θ is a good approximation for the sum
θ^{i1}/i for large N. Then
F_{ eq } = (ν/δ) θ^{i1}/i = (ν/λ) θ^{i}/i ν/λ (ln(1θ)),
and
ν/δ = F_{ eq } θ/(ln(1θ)). (5.3)
Taking into account that ν/δ = f_{1} (2.3), we have a relation
F_{ eq }/f_{1} ln(1θ)/θ, (5.4)
which allows the parameter θ to be estimated on the basis of empirical data.
If N can be estimated independently and is not very large, we can use more exact relations:
θ^{i}/i ln(1θ) + Ei(N(1θ))  N^{1}/2 + N^{2}/12.
where the function
.
Further, if (1θ)N is small (i.e., θ is very close to 1), then the approximation
θ^{i}/iC_{ E }  N(1θ)
has an error less then [N(1θ)]^{2}/4 and, in this case,
F_{ eq }/f_{1} (C_{ E }  N(1  θ))/θ. (5.5)
If (1  θ)N is large, then the following inequalities provide simple bounds for F_{ eq }/f_{1} = θ^{i1}/i:
 (ln(1θ)/θθ^{N}/[(N+1)(1θ)] < θ^{i1}/i < ln(1θ)/θθ^{N}[1/(N+1)θ/(N+2)]. (5.6)
For the simple BDIM, the ratio of the rate of duplications to the innovation rate is
G(N) = λ_{ i }f_{ i }/ν = θ^{i} = θ(1θ^{N1})/(1θ),
so G(N) → ∞ if θ > 1 and G(N) → 1/(1θ) if θ < 1 at N→∞.
If the simple BDIM is the 2^{nd} order balanced, θ = 1, then G(N) = N  1.
Thus, for the simple, secondorder balanced BDIM, the number of duplications per time unit is N1 times greater than the number of innovations.
The total number of domains in the equilibrium state for the simple BDIM is
M(N) = if_{ i } = ν/λθ(1θ^{N})/(1θ).
If a simple BDIM is secondorder balanced, then G(N) = ν/λ N.
Linear BDIM
We saw that the assumption of independence of birth and death rates of individual domains on each other and on the size of domain families is incompatible with any power distribution of the equilibrium frequencies with the degree not equal to 1. The simplest case of a BDIM, which can have, depending on the parameters, three types of asymptotic behavior described by Theorem 1 (excluding the first one, hyperexponential, which corresponds to a nonbalanced BDIM; all linear BDIMs are balanced) and, in particular, any power asymptotics, is a model with linear birth and death rates of the form:
λ_{ i } = λ (i + a), δ_{ i } = δ (i + b), where a and b are constants. (5.7)
The parameters a and b account, in the simplest possible form, for the deviation of the domain birth and death rates from those under the independence assumption. More precisely, according to (5.7), the average birth rate per domain in a family of size i is λ_{ i }/i = λ + λa/i. So, for small i, the average birth rate is close to λ + λa, whereas, for large i, it tends to λ. Similarly, the average death rate changes from δ + δb in a small family to the limit value δ in a large family. Thus, if a and b are positive (which seems to be the case for the available data; see below), both the birth rate and the death rate per domain decrease with the increase of the class number (size of the respective domain families); conversely, if a and b are negative, these rates increase with the class number (Fig. 5).
Corollary 1 of Theorem 2 implies that equilibrium frequencies p_{ i } of a linear BDIM have asymptotics
p_{ i } ~ θ^{i}i^{ab1}, where θ = λ/δ. (5.8)
In particular, if λ ≠ δ and a = b, the linear BDIM is firstorder balanced and the equilibrium frequencies p_{ i } follow the logarithmic distribution (in this case, the linear BDIM is asymptotically equivalent to the simple BDIM). If λ = δ, the linear BDIM is secondorder balanced and the equilibrium frequencies p_{ i } follow the power distribution
p_{ i } ~ i^{ab1}. (5.9)
Thus, the dependence of the domain frequency on the family size is actually determined by the difference a  b. If a >b, the birth rate decreases faster than the death rate with the increase of family size, i. e. there seems to be a "competition" between domains in a family; in contrast, if a <b, the death rate drops faster, i.e. a "synergy" between domains appears to exist (Fig. 4).
More detailed information can be obtained using Proposition 4:
i) for a firstorder balanced linear BDIM, the equilibrium sizes f _{ i } of domain families are
f_{ i } = c ν/δθ^{i1}Γ(i + a)/(Γ(i + 1 + b)) for all i
where
c = Γ (1 + b)/Γ (1 + a)
and the total number of domain families at equilibrium is
F_{ eq } = c ν/δ[ θ^{j1}Γ(j + a) / (Γ(j + 1 + b)]. (5.10)
ii) for a secondorder balanced linear BDIM (θ = 1),
f_{ i } = c_{1}ν/δ Γ (i + a)/Γ (i + 1 + b)
and
According to (2.3), in the equilibrium state of a linear BDIM, f_{1} = ν/δ_{1} = ν/(δ(1 + b)) and so, for a secondorder balanced linear BDIM, we have the formula
Suppose that equilibrium frequencies obtained from empirical data follow the power distribution p_{ i } ~ i^{γ}; in this case, γ is the slope of the empirical curve (lnf_{ i } versus lni) and can be estimated from the data. Assuming that the system is well described by a linear BDIM, it follows from (5.9) that a  b = 1  γ and λ = δ. Thus,
f_{ i } = c ν/δ Γ (i + a)/Γ (i + a + γ), where c = Γ (γ + a)/Γ (1 + a), (5.12)
and
where a is the single free parameter.
For the linear secondorder balanced BDIM, the ratio of the birth rate to the innovation rate is
if 1 + a  b ≠ 0. As
if 1 + a  b < 0 and G(N)→∞ if 1 + a  b > 0 at N→∞.
The case 1 + a  b = 0 (slope of the asymptote in double logarithmic coordinates equal to a  b  1 = 2) is a critical one.
In this case,
G(N) = Γ(1 + b) / Γ(b) Γ (i + b) / (Γ (i + 1 + b) =
b_{1}/(i + b) = b [PolyGamma(0, b+N)  PolyGamma(0, b+1)].
Accordingly, G(N)→∞ at N→∞.
The total number of domains in the equilibrium state for a secondorder balanced linear BDIM is
If the slope of the asymptote γ = 1, the linear secondordered BDIM shows the same asymptotic behavior as a simple BDIM (2.1), but behaves differently at small i. If γ ≠ 1, the system cannot be described by a simple BDIM even asymptotically, but can be described by a linear BDIM. As indicated above, in this case, the average perdomain birth and death rates depend on the size of the domain family and the difference ab characterizes this dependence.
Quadratic BDIM
The linear BDIM takes into account the dependence of average birth and death rates of individual domains on the size of domain family, but does not imply a specific form of interaction between domains. Let us consider the simplest, pairwise interaction, which leads to λ_{ i } ~ i^{2} and/or δ_{ i } ~ i^{2}, i.e. one or both rates are polynomials on i of the second degree. If these degrees are different (i.e., λ_{ i } ~ i and δ_{ i } ~ i^{2}), then the corresponding BDIM is nonbalanced and equilibrium frequencies have hyperexponential asymptotics. Thus, let
λ_{ i } = λ (i^{2} + r_{1}i + r_{2}), δ_{ i } = δ (i^{2} + q_{1}i + q_{2}), (5.13)
where r_{ k }, q_{ k }, k = 1,2 are constants (such that λ_{ i }, δ_{ i } are positive for all i) or
λ_{ i } = λ (i + a_{1})(i + a_{2}),
δ_{ i } = δ (i + b_{1})(i + b_{2})
Then, r_{1} = a_{1} + a_{2}, q_{1} = b_{1} + b_{2}, and
χ (i) = λ_{i1}/δ_{ i } = θ (1 + (r_{1}q_{1}2)/i + O(1/i^{2})),
where θ = λ/δ.
According to theorem 3 and Proposition 3, the quadratic BDIM with rates (5.13) has equilibrium sizes of domain families
f_{ i } = c_{2} ν/δ θ^{i1} Γ (i + a_{1}) Γ (i + a_{2}) / (Γ (i + 1 + b_{1}) (Γ (i + 1 + b_{2})) c_{2}ν/δ θ^{i1}i^{ρ2} (5.14)
where ρ = r_{1}  q_{1} and the constant c_{2} = [(Γ (1+b_{1}) Γ (1+b_{2})] / [Γ (1+a_{1}) Γ (1+a_{2})], and the total number of domain families at equilibrium
F_{ eq } = c_{2}ν/δ ( θ^{j1} Γ(j+a_{1}) Γ(j+a_{2}) / (Γ(j+1+b_{1}) (Γ(j+1+b_{2})). (5.15)
Note that the asymptotic behavior of frequencies p_{ i } does not depend on free coefficients r_{2}, q_{2} in (5.13), but only on θ and r_{1}q_{1} (as follows from (5.14)), although the values of f_{ i } are proportional to the constant c_{2}, which could depend on the free coefficients r_{2}, q_{2}. Let us consider the case r_{2} = q_{2} = 0 in more detail.
If only square terms are present in the expressions for the birth and death rates, λ_{ i } = λi^{2}, δ_{ i } = δi^{2}, then a_{ k } = b_{ k } = 0, k = 1,2 and so c_{2} = 1, f_{ i } = ν/δ θ^{i1}/i^{2} and F_{ eq } = ν/δ θ^{j1}/j^{2}. So at N→∞
F_{ eq } ν/δ θ^{j1} / j^{2} = ν/λ Polylog(2,θ) (5.16)
where Polylog is a special function, Polylog(k,x) = x^{j}/j^{k}.
According to (3.2), f_{1} = ν/δ_{1}; for this particular case of quadratic BDIM, f_{1} = ν/δ and
F_{ eq }/f_{1} Polylog(2,θ). (5.17)
Formula (5.17) allows estimating parameter θ from empirical data if N is large enough.
More precisely, F_{ eq } = ν/λ θ^{j}/j^{2} = ν/λ (Polylog(2,θ)θ^{1+N} LerchPhi(θ,2,1+N)), where LerchPhi is a special function (these and other special functions used below can be computed using program packages Mathematika or Maple).
If, additionally, θ = 1 (the BDIM is secondorder balanced), then
f_{ i } = (ν/δ)/i^{2} = f_{1}/i^{2} (5.18)
and, at large N
F_{ eq } ν/δ π^{2}/6 1.645 ν/δ = 1.645f_{1}. (5.19)
From formulas (5.8), (5.15), we can extract some additional information, which could be helpful for estimating the model parameters at relatively small N. Let us recall definitions of some special functions.
The digamma function φ(z) is logarithmic derivative of Γ(z), φ(z) = Γ'(z)/Γ(z).
The function PolyGamma(n,z) is n^{th} derivative of φ(z), PolyGamma(n,z) = d^{n}φ(z)/dz^{n}, such that φ(z)= PolyGamma(0,z).
It is known that
1/i^{2} = π^{2}/6PolyGamma(1,1+N),
Thus we have
F_{ eq } = ν/δ 1/j^{2} = ν/δ [π^{2}/6PolyGamma(1,1+N)] (5.20)
F_{ eq }/f_{1} = π^{2}/6PolyGamma(1,1+N),
which can be used for estimating unknown parameters of the model.
The values of PolyGamma(1,x) are tabulated and can be computed using standard program packages; for a rough preliminary estimate, PolyGamma(1,x) = 1/x+1/2x^{2}+O(1/x^{4}).
If linear terms are also present in the quadratic BDIM, λ_{ i } = λ (i^{2}+a_{1}i), δ_{ i } = δ (i^{2}+b_{1}i), then
f_{ i } = c_{2}ν/δ θ^{i1}/i Γ (i+a_{1})/Γ (i+1+b_{1})
where c_{2} = Γ (1+b_{1})/Γ (1+a_{1}); F_{ eq } = Σf_{ i } can be computed using special functions. In particular, if the BDIM is secondorder balanced, θ = 1, then
f_{ i } = c_{2}ν/δ Γ (i+a_{1}) / (i Γ (i+1+b_{1})).
For this variant of the model, f_{1} = ν/δ_{1} = ν/(δ(1+b_{1})), and so
Polynomial BDIMs
The quadratic models take into account the dependence of birth and death rates of individual domains on the simplest, pairwise interactions. If interactions of higher orders are postulated, λ_{ i } ~ P_{ n }(i) and/or δ_{ i } ~ Q_{ m }(i), where P_{ n }(i), Q_{ m }(i) are polynomials on i of the nth and mth degrees. Again, if the degrees n and m are different, then the BDIM is nonbalanced and equilibrium frequencies have hyperexponential asymptotics. Thus, let n = m,
λ_{ i } = λR (i) = λ r_{ k }i^{mk}, δ_{ i } = δQ(i) = δ q_{ k }i^{mk} (5.21)
where r_{ k }, q_{ k } are constants and r_{0} = q_{0} = 1. We suppose, of course, that R(i), Q(i) are positive for all integer i. Note that, in this case, χ (i) ≡ λ_{i1}/δ_{ i } = θ (1+(r_{1}  q_{1}  m)/i+O(1/i^{2})), where θ = λ/δ. We will suppose that θ ≤ 1.
According to Theorem 3, the polynomial BDIM with rates (5.21) has equilibrium sizes of domain families with powerexponential asymptotics
f_{ i } ~ θ^{i}i^{ρm} (5.22)
where ρ = r_{1}  q_{1}.
In particular, if ρ  m > 1, the equilibrium frequencies p_{ i } follow the Pascal distribution with parameters (ρ  m + 1, θ);
if ρ  m = 1, the equilibrium frequencies p_{ i } follow the (truncated) logarithmic distribution;
if ρ  m = 0, the equilibrium frequencies p_{ i } follow the geometric distribution;
if λ = δ, the polynomial BDIM is secondorder balanced and the equilibrium frequencies p_{ i } follow the power distribution
p_{ i } ~ i^{ρm}. (5.23)
Note that the degree of the power distribution (5.23) depends only on m, the common degree of the polynomials (5.21), and on ρ, the difference between the coefficients r_{1} and q_{1}, and does not depend on other coefficients. In particular, if r_{1} = q_{1}, then p_{ i } ~ i^{m}. This relation could be interpreted as follows: if the first two coefficients of polynomial rates λ_{ i } and δ_{ i } are equal, then the degree of the power distribution (5.19) is equal to the "order of interactions" of domains.
Formula (5.22) can be refined. Let R(i) = (i+a_{ k }), Q(i) = (i+b_{ k }).
Then (see Proposition 3) the equilibrium numbers of domain families f_{ i } of the polynomial BDIM (5.18) are
f_{ i } = C ν/δθ^{i1} [Γ(i+a_{ k })/Γ(i+1+b_{ k })]
where C = [Γ(1+b_{ k })/Γ(1+a_{ k })], and the equilibrium total number of domain families
F_{ eq } = C ν/δ θ^{j1} [Γ(j+a_{ k })/Γ(j+1+b_{ k })].
For the polynomial model f_{1} = ν/δ_{1} = ν/(δ q_{ k }), so
F_{ eq }/f_{1} = C θ^{j1} (Γ(j+a_{ k })/Γ(j+1+b_{ k }))/q_{ k }.
This formula can be used for estimating the model parameters.
For the polynomial secondorder balanced BDIM, the ratio of the death rate to the innovation rate is
G(N) = λ_{ i }f_{ i }/ν = ( Γ(1+b_{ k })/Γ(1+a_{ k })) Γ(i+1+a_{ k })/Γ(i+1+b_{ k }) =
[Γ(i+1+a_{ k })/Γ(1+a_{ k })]/[Γ(i+1+b_{ k })/Γ(1+b_{ k })].
Approximation of the observed domain family size distributions in prokaryotic and eukaryotic genomes with different BDIMs
Having developed the mathematical theory of BDIMs, we sought to determine which of these models, if any, adequately described the empirical data on domain family size distribution. To identify the domain sets of domains encoded in each of the genomes, the CDD library of positionspecific scoring matrices (PSSMs), which includes the domains from the Pfam and SMART databases, was used in RPSBLAST searches [12] against the protein sequences from a set of completely sequenced eukaryotic and prokaryotic genomes http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome. The CDD domain library is partially redundant, so when the results obtained from individual PSSMs showed significant overlap (more than 50% of hits overlapped for more than 50% of their length), the corresponding domains were examined casebycase for redundancy. PSSMs representing structurally similar domains and producing overlapping lists of hits were joined into "synonymy clusters".
The results of RPSBLAST searches against the sets of protein sequences from individual genomes were interpreted as follows: nonoverlapping hits to the same protein were treated independently; among overlapping hits, only the strongest one (lowest Evalue) was recorded; all hits from a synonymy cluster were assigned to one representative domain family. The number of hits that a domain family had in a genome, with the cutoff of E = 0.001, was recorded as the number of domains of the given family encoded in the given genome. The CDD domain library certainly does not include all existing domains. In practice, domains from this collection were detected in >50% in each of the analyzed species, with the only exception of human, for which the analyzed protein set is likely to contain a substantial fraction of false predictions (Table 1).
To enable statistical analysis using the χ^{2}method for the entire range of the data, including the sparsely distributed classes corresponding to large families, the data needed to be combined. For each genome, the observed domain family frequencies were grouped into bins, each containing at least 10 families; typically, bins corresponding to families with small number of members included a single size class (e.g. all singlemember families, twomember families etc), whereas bins corresponding to large families may span hundreds of size classes, most of them empty. Theoretical distribution values for a bin combining observed data from mth to nth class were computed as , where f'_{ i } is the predicted number of families in the ith class and depends on the model parameters. Since the model displays only a weak dependence on the maximum number of domains in a family (N), instead of including N as a model parameter, the sum (where i_{ max } is the number of domains in the most abundant of the detected families), was normalized to equal the total number of families detected in the given genome (a requirement for the χ^{2} analysis). χ^{2} values were computed to measure the quality of fit between the observed and the theoretical distributions. The distribution parameters (θ for the simple BDIM, a and b for the secondorder balanced linear BDIM) were adjusted to minimize the χ^{2} value.
The simplest model that resulted in a good fit to the observed domain family size distributions was the secondorder balanced linear BDIM (Fig. 6,7,8,9,10,11,12,13,14,15). For all analyzed genomes, P(χ^{2}) for this model was >0.05, i.e. no significant difference between the model predictions and the observed data was detected. Considering the firstorder balanced linear BDIM, which involves varying the parameter θ, did not result in a significant improvement of fit for any of the analyzed genomes (data not shown). In contrast, a fit to a truncated logarithmic distribution (prediction of a simple BDIM) failed for all genomes (P(χ^{2}) < 10^{5}; Fig. 16, 17, and data not shown). An exact powerlaw distribution, which is often used to approximate protein family frequency distributions, similarly failed to adequately fit the observed data, even when the most deviant class 1 families were excluded (P(χ^{2}) = 0.0013 for T. maritima; P(χ^{2}) < 10^{5} for the rest of the genomes; Fig. 16, 17 and data not shown). Notably, secondorder balanced linear BDIM results in a correct prediction of the number of very large families, whereas simple BDIM systematically underestimates the number of families in the highest bins. Conversely, the powerlaw fit underestimates the slope of the bestfit line (in double logarithmic coordinates) compared to the asymptote of the linear BDIM prediction and, accordingly, significantly overestimates the number of families in the highest bins (Fig. 16, 17). These results are compatible with the recent observation that the domain family size distributions are better described by the generalized Pareto distribution than by power laws [31].
Fitting the observed domain family size distribution with the secondorder balanced linear BDIM resulted in positive values of the parameters a and b, with a <b, for all analyzed genomes (Table 1). Accordingly, domain family size distributions in all cases asymptotically tend to the power law with the power k < 1 and, for all species with the exception of C. elegans, k < 2 (Table 1 and Fig. 8). As discussed above, this seems to indicate the existence of "synergy" between domains in a family whereby the likelihood of survival is greater for a domain that belongs to a large family than for a domain from a small family (Fig. 5). For all species, we find that the innovation rate is approximately three orders of magnitude greater than the per domain birth (death) rate. Accordingly, the total per genome birth (duplication) rate is comparable to but, typically, several times greater than the innovation rate (Table 1). The ratio of the per genome birth rate to the innovation rate increases with the number of genes in a genome or the number of detected domains, with nearly identical rates seen for small prokaryotic genomes and values as high as 20 for the largest plant and animal genomes (Table 1).
The data used to fit the BDIM typically included 50–60% of the proteins encoded in a given genome (Table 1); the remaining proteins were not represented by sufficiently similar domains in the current CDD collection. It cannot be ruled out that the fit would be significantly affected as a result of including all proteins encoded in the genome, in case the proteins currently not recognized in CDD searches have a family size distribution substantially different from that of the recognized ones. However, secondorder balanced linear BDIM can accommodate considerable perturbations of the distribution through adjustment of the parameters, so we believe that this model is likely to approximate well also the size distribution of domain families for complete sets of proteins encoded in a genome. An alternative approach that at least partially circumvents the sampling problem involves analysis of all families of paralogs detectable using clustering by sequence similarity, with employing a predefined library of domains; this analysis is beyond the scope of the present work but may be a subject of further investigation.
General discussion and conclusions
Here, we presented a complete mathematical description of the size distribution of protein domain families encoded in genomes for simple but not unrealistic models of evolution, which include three types of events: domain duplication (birth), domain elimination (death), and domain innovation. In biological terms, innovation could involve gene acquisition via horizontal gene transfer, emergence of a new domain from a noncoding sequence or a nonglobular protein sequence, or major modification of a domain obliterating its connection with a preexisting family. Innovation via horizontal gene transfer appears to be particularly common in prokaryotes [32, 39], which might account for the apparent higher relative innovation rate in prokaryotic genomes observed in the present analysis (Table 1).
We showed that birthdeathinnovation models (BDIMs) with different levels of complexity lead to readily distinguishable predictions regarding the distribution of domain family sizes in genomes. In particular, we defined the exact analytic conditions that lead, exactly or asymptotically, to power law distributions, which have recently received ample attention, as they were uncovered in various biological and social contexts [20, 25]. In contrast to previous analyses [16, 17, 30] but in agreement with the results of a recent reexamination [31], we showed that the power law only asymptotically approximates the domain family size distributions.
Three groups of observations made in this work seem to have the greatest potential of enhancing our understanding of genome evolution and, perhaps, evolution of other complex systems. First, we proved that, within the BDIM framework, there is a unique equilibrium state of the system, which is approached exponentially, with respect to time, from any initial state. In this equilibrium state, the number of domain families in each size class remains constant and follows a unique distribution depending on the type and parameters of the BDIM. In particular, power asymptotics emerges when and only when a BDIM is secondorder balanced, i.e. the rates of domain birth and death are asymptotically equal. Since we showed that the observed size distributions of domain families in all analyzed genomes indeed tend to power law asymptotics, the results are compatible with the notion that the genomes are close to a steady state with respect to the domain diversity (F_{ eq }, the number of distinct domain families at equilibrium, under the using the BDIM convention) and distribution (f_{ i }). Taking a broader biological perspective, this result might indicate that evolving lineages go through lengthy periods of relative stasis when the level of genomic complexity remains more or less the same. Under this view, the stasis epochs are punctuated by relatively short periods of dramatic changes when the complexity either greatly increases (the emergence of eukaryotes is the most obvious case in point) or decreases (e.g. evolution of parasites). These bursts of evolution might be described as transitions between different BDIMs in the parameter space, with some of the trajectories potentially involving nonbalanced BDIMs. The analogy between this emerging picture of genome evolution and the punctuated equilibrium concept of species evolution, which has been developed through analysis of the paleontological record [40], is obvious.
Second, we showed that the simplest model that adequately describes the observed domain family size distributions is the secondorder balanced linear BDIM; in contrast, simple BDIMs do not show a good fit to the observations. This has potentially important implications for the mode of domain family evolution. Simple BDIMs are based on the notion that the likelihood of duplication (birth) or elimination (death) of a domain is uniform across the genome and does not depend on the size or other characteristics of domain families (the independence assumption). Clearly, under the independence assumption, a duplication (birth) as well as elimination (death) of a domain is more likely to occur in a large family than in a small one, but only because the overall probability of such an event is proportional to the number of family members, whereas the birth (death) rate per domain remains the same. The key observation of this work, that the actual domain frequency distributions are well described by a linear but not by a simple BDIM, suggests that the independence assumption is an oversimplification. Instead, the linear BDIM includes parameters that describe the dependence of the per domain birth (death) rate on the family size. The asymptotics of the theoretical distribution that is the best fit for the actual data is a power law, with the power equal to ab1, where a and b are the parameters of a linear BDIM. We observed that, for all analyzed genomes, ab1 < 1 (a <b), which corresponds to "synergy" between domains in a family. Both the domain birth rate and the death rate drop with the increase of the size of a domain family, but the death rate decreases faster (Fig. 5). In general terms, this suggests that small families are more dynamic during evolution than large families. In particular, under the BDIM formalism, innovation contributes only to singlemember families (class 1), which have the greatest evolutionary mobility, and either quickly proliferate and are stabilized or perish. An implication of these observations is that, in general, large families are older than small ones. Exceptions to this generalization, i.e. the existence of small, ancient families, probably point to selection for a specific family size; for example, it seems likely that selection acts against proliferation of certain essential proteins, e.g. ribosomal proteins, which typically form singlemember families [41]. Another pertinent observation is that the linear BDIM seems to adequately accommodate even the largest of the identified domain families. Lineagespecific expansion of paralogous families appears to be one of the principal modes of organismic adaptation during evolution [13, 14, 42]. Thus, quantitatively, adaptive family expansion appears to fit within the BDIM framework, although these models do not explicitly incorporate the notion of selection. Of course, for BDIMs, it is irrelevant which families expand, and this choice is determined by selection.
Third, the BDIM equilibrium condition with respect to the total number of domain families, ν = δ_{1}f_{1} (ν is the innovation rate, δ_{1} is the domain death rate for class 1 families, and f_{1} is the number of domain families in class 1) allows us to estimate the ratio between domain innovation rate and the domain death and birth rates. Indeed, according to the above and the definition of a secondorder linear BDIM, which is the best fit for the actual data, λ = δ = ν/f_{1}(1+b). Since the number of domain families in class 1 (families with only one member) is in the hundreds for each genome, this translates into an innovation rate that is much greater than the duplication or elimination rate per domain (Table 1). Such high innovation rates might appear counterintuitive, but let us note that the duplication rate over all domain families is a number that tends to be nearly identical to ν for small prokaryotic genomes and severalfold greater than ν for large eukaryotic genomes (Table 1). Thus, under the secondorder balanced linear BDIM, the likelihood of appearance of a new domain in a genome is close to or several times less than the likelihood of a duplication or elimination of an existing domain. Nevertheless, the finding that the innovation rate is comparable to the overall duplication/elimination rate seems surprising. If the linear BDIM is indeed a realistic evolutionary model, this emphasizes the critical role of innovation in maintaining the balance (steady state) in genome evolution. In prokaryotes, innovation via horizontal gene transfer appears to be particularly extensive [32, 39], which might underlie the greater relative innovation rate in these organisms (Table 1).
As already indicated, BDIMs do not explicitly incorporate selection. However, the present analysis shows that only models with precisely balanced domain birth, death and innovation rates can account for the observed distribution of domain family size in each of the analyzed genomes. It seems likely that the balance between these rates is itself a product of selection. There is little doubt that BDIMs will be eventually replaced by more sophisticated formalisms that will more realistically capture the mechanisms of genome evolution. Nevertheless, even the crude modeling described here seems to reveal several potentially interesting and nontrivial aspects of the evolutionary process.
Mathematical Appendix. Proofs of some statements
Proof of Proposition 1
When system (3.1) is solved consecutively from the last equation to the second one, it becomes obvious that the solution is unique up to a constant multiplier.
Next, if f_{ i } = f_{i1}λ_{i1}/δ_{ i }, f_{i+1} = f_{i1}λ_{i1}λ_{ i }/(δ_{ i }δ_{i+1}), then the substitution shows that (f_{i1},f_{ i },f_{i+1}) satisfy the ith equation of system (3.2). Substituting f_{2} = f_{1}λ_{1}/δ_{2} in the first equation, we get f_{1} = ν/δ_{1} and, consequently, for all i = 2,...N. By definition, F_{ eq } = f_{ i }, so we have (3.3).
Since system (2.2) is linear, the equilibrium state (f_{1},...f_{ N }) is asymptotically stable if the real parts of all characteristic values of the matrix
are negative.
The following theorem (see [43]) gives the desired criterion: the real part of all the characteristic values of a real matrix C = c_{ ij }, i,j = 1,..n with nonnegative nondiagonal elements are negative if and only if (1)^{k}D_{ k } > 0 for all k = 1,..n, where D_{ k } is the main minor of the matrix C of the kth order.
To apply this theorem, let us consider the n × n matrix, n ≤ N
It is easy to see that
det B_{ n } = (λ_{ n } + δ_{ n })det B_{n1}  λ_{n1}δ_{ n } det B_{n2}, (A1)
det A_{ n } = δ_{ n }det B_{n1}  λ_{n1}δ_{ n } det B_{n2}.
Using these equalities, we can prove that for any n
det A_{ n } = (1)^{n}δ_{ n }δ_{n1...} δ_{2}δ_{1}.
Indeed,
det A_{ n } = δ_{ n } det B_{n1}λ_{n1}δ_{ n } det B_{n2}=
δ_{ n }((λ_{n1}+δ_{n1}) det B_{n2} + λ_{n2}δ_{n1} det B_{n3})  λ_{n1}δ_{ n } det B_{n2}=
δ_{ n }δ_{n1} (det B_{n2} + λ_{n2} det B_{n3})= (subsequently using (A1))=
(1)^{n2}δ_{ n }δ_{n1...} δ_{3}(det B_{2} + λ_{2} det B_{1}) = (1)^{n}δ_{ n }δ_{n1...} δ_{2}δ_{1}.
Further, it is easy to see that for any n
det B_{ n } = det A_{ n }  λ_{ n } det B_{n1}.
Taking into account that B_{1} = (λ_{1} + δ_{1}) < 0 and that the sign of det A_{ n } coincides with (1)^{n}, it is easy to prove that
det J_{ n } > det A_{ n } if det A_{ n } > 0 and det J_{ n } < det A_{ n } if det A_{ n } < 0.
Thus, the sign of det B_{ n } coincides with the sign of det A_{ n } and so (1)^{n}B_{ n } > 0 for all n = 1,..N. According to the aforementioned theorem, the real parts of all the characteristic values of a real matrix A_{ N } are negative and so the single equilibrium is asymptotically stable, QED.
Proof of Proposition 2
For simple BDIM (2.1) f_{ i } = ν λ_{ k } / δ_{ k } = (ν/δ)θ^{i1}/i = (ν/λ)θ^{i}/i, so
F_{ eq } = f_{ i } = ν/λ θ^{i}/i, and
p_{ i } = f_{ i }/F_{ eq } = (θ^{i}/i)/ θ^{j}/j.
If a simple BDIM is balanced, then θ = 1 and so
F_{ eq } = ν/λ θ^{j}/j.
p_{ i } = ν/λ F_{ eq }/i = 1/i ( 1/j)^{1}.
Proof of Theorem 1
The condition (3.10) can be rewritten as λ_{i1}/δ_{ i } = i^{s}θ(1+a/i+O(1/i^{2})) = i^{s}θ (1+a/i)(1+O(1/i^{2})). Thus, we can choose S in such a way that (1 + O(1/s^{2})) converge, 0 < (1 + O(1/s^{2})) < ∞. It follows that
(λ_{s1}/δ_{ s }) ~ Γ(j)^{s} θ^{j} (1+a/s).
According to Proposition 1, p_{ i } = f_{ i }/F_{ eq } ~ λ_{ k } / δ_{ k }. So
p_{ i } ~ (λ_{s1}/δ_{ s }) ~ Γ(i)^{s} θ^{i} (1+a/s) = Γ(i)^{s}θ^{i}(i+a+1)/Γ(i+1).
Applying the main asymptotic property of Γfunction, i.e. Γ (i+c)/Γ(i)~i^{c} at large i for any c, we have
Γ (i+a+1)/ Γ (i+ 1) ~ i^{a}, and so p_{ i } ~ Γ (i)^{s} θ^{i}i^{a}.
Proofs of Corollaries 1–3
If a discrete random variable ξ has the Pascal distribution, then
P(ξ = i) 1 / Γ (r) (1q)^{r}i^{r1}q^{i} ~ q^{i}i^{r1} for large i,
and it becomes evident that, for a > 1, equilibrium frequencies p_{ j } of the firstorder balanced BDIM follow the Pascal distribution with parameters (a+1,θ).
If a = 1, then p_{ i } ~ θ^{i}/i and so p_{ i } follows the truncated logarithmic distribution with parameter θ. If a = 0, then p_{ j } ~ θ^{i} and p_{ i } follows the geometric distribution.
Further, p_{ i } ~ i^{a}, that is the sequence p_{ i } follows the power distribution with the power a, if and only if θ = 1, that is, if the BDIM is secondorder balanced.
Finally, if λ_{i1}/δ_{ i } = 1 + O(1/i^{2}), that is, if θ = 1 and a = 0, then p_{ i } ~ const; in particular, if λ_{i1} = δ_{ i } for all i, then, according to Proposition 1, f_{ i } = ν for all i and p_{ i } = 1/N.
Proof of Theorem 2
According to Proposition 1, system (3.1) has the unique solution:
f_{1} = νδ_{1}, f_{ i } = ν λ_{ s } / δ_{ s } for all i = 2,...N. So
f_{ i } = ν/λθ^{i}P(s) / Q(s), i > 1.
Applying the Lemma (see below), we get
f_{ i }C ν/λ θ^{i} Γ (i)^{η}i^{ρβ}, as i→∞,
where the constant C = [(Γ(1+b_{ k }))^β_{ k }] / [Γ(1+a_{ k })^α_{ k }].
Lemma. Let P(j) = (j+a_{ k })^α_{ k }, Q(j) = (j+b_{ k })^β_{ k }, where a_{ k }, b_{ k } are positive. Let us denote
η =
α_{ k }  β_{ k }, ρ = a_{ k } α_{ k }  b_{ k }β_{ k }, β = β_{ k },.
Then with fixed j
N(j) = P(s) / Q(s) C Γ(j)^{η}j^(ρβ)
as j→∞, where
C = [(Γ(1+b_{ k }))^β_{ k }] / [Γ(1+a_{ k })^α_{ k }].
Proof.
(s+a_{ k })^α_{ k } = [Γ(j+a_{ k }) / Γ(1+a_{ k })]^α_{ k },
(s+b_{ k })^β_{ k } = [Γ(j+1+b_{ k }) / Γ(1+b_{ k })]^β_{ k }, so
N(j) = { [Γ(j+a_{ k }) / Γ(1+a_{ k })]^α_{ k }}/{ [Γ(j+1+b_{ k })/Γ(1+b_{ k })]^β_{ k }}=
C [(Γ(j+a_{ k }))^α_{ k }]/ [(Γ(j+1+b_{ k }))^β_{ k }]
where
C = [(Γ(1+b_{ k }))^β_{ k }]/ [Γ(1+a_{ k })^α_{ k }].
Let us use the known asymptotic relation
Γ (t+a)/Γ (t) t^{a} with t→∞.
Thus we have
[(Γ(j+a_{ k }))^α_{ k }]/ [(Γ(j+1+b_{ k }))^β_{ k }]
(Γ(j))^{η} [(Γ(j+a_{ k }) / Γ(j))^α_{ k }] / [(Γ(j+1+b_{ k }) / Γ(j))^β_{ k }]
(Γ(j))^{η}j^[a_{ k } α_{ k }] / j^[ (b_{ k }+1)β_{ k }]=
(Γ(j))^{η}j^(ρβ),
and Lemma is proved.
Proof of Proposition 3
It follows from the proof of the Lemma that
f_{ i } = C ν/λ θ^{i} [(Γ(j+a_{ k }))^α_{ k }] / [(Γ(j+1+b_{ k }))^β_{ k }] for i > 1,
where C = [(Γ(1+b_{ k }))^β_{ k }] / [(Γ(1+a_{ k }))^α_{ k }].
Let us show that f_{1} can be expressed by the same formula if i = 1. Indeed,
C ν/δ [(Γ(1+a_{ k }))^α_{ k }] / [(Γ(1+1+b_{ k }))^β_{ k }=
ν/δ (
(Γ(1+b_{ k }))^β_{ k } / (Γ(1+a_{ k }))^α_{ k })) ( (Γ(1+a_{ k }))^α_{ k } / (Γ(2+b_{ k }))^β_{ k }=
ν/δ (
(Γ(1+b_{ k }))^β_{ k } / (Γ(2+b_{ k }))^β_{ k } = ν/δ ( (1+b_{ k }))^β_{ k } = f_{1}
Thus,
F_{ eq } = C ν/δ ( θ^{j1} (Γ(j+a_{ k }))^α_{ k }/ (Γ(j+1+b_{ k }))^β_{ k }).
QED.
Contributions of individual authors
GPK developed most of the mathematical formalism and wrote the draft of the mathematical part of the manuscript; YIW performed the identification of domain in sequenced genomes and the statistical analysis of the resulting distributions and wrote the draft of the corresponding part of the manuscript; FSB proved some of the theorems; AYR largely incepted the work and contributed to the formulation of the models; EVK contributed to the inception of the work and the formulation of the models, gave the biological interpretation of the results, wrote the background and discussion sections and extensively edited the entire manuscript.
References
 1.
Koonin EV, Aravind L, Kondrashov AS: The impact of comparative genomics on our understanding of evolution. Cell. 2000, 101: 573576.
 2.
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, StangeThomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860921. 10.1038/35057062.
 3.
Dacks JB, Doolittle WF: Reconstructing/deconstructing the earliest eukaryotes: how comparative genomics can help. Cell. 2001, 107: 419425.
 4.
Chervitz SA, Aravind L, Sherlock G, Ball CA, Koonin EV, Dwight SS, Harris MA, Dolinski K, Mohr S, Smith T, Weng S, Cherry JM, Botstein D: Comparison of the complete protein sets of worm and yeast: orthology and divergence. Science. 1998, 282: 20222028. 10.1126/science.282.5396.2022.
 5.
Rubin GM, Yandell MD, Wortman JR, Gabor Miklos GL, Nelson CR, Hariharan IK, Fortini ME, Li PW, Apweiler R, Fleischmann W, Cherry JM, Henikoff S, Skupski MP, Misra S, Ashburner M, Birney E, Boguski MS, Brody T, Brokstein P, Celniker SE, Chervitz SA, Coates D, Cravchik A, Gabrielian A, Galle RF, Gelbart WM, George RA, Goldstein LS, Gong F, Guan P, Harris NL, Hay BA, Hoskins RA, Li J, Li Z, Hynes RO, Jones SJ, Kuehl PM, Lemaitre B, Littleton JT, Morrison DK, Mungall C, O'Farrell PH, Pickeral OK, Shue C, Vosshall LB, Zhang J, Zhao Q, Zheng XH, Zhong F, Zhong W, Gibbs R, Venter JC, Adams MD, Lewis S: Comparative genomics of the eukaryotes. Science. 2000, 287: 22042215. 10.1126/science.287.5461.2204.
 6.
Koonin EV, Tatusov RL, Rudd KE: Sequence similarity analysis of Escherichia coli proteins: functional and evolutionary implications. Proc Natl Acad Sci U S A. 1995, 92: 1192111925.
 7.
Brenner SE, Hubbard T, Murzin A, Chothia C: Gene duplications in H. influenzae. Nature. 1995, 378: 14010.1038/378140a0.
 8.
Labedan B, Riley M: Widespread protein sequence similarities: origins of Escherichia coli genes. J Bacteriol. 1995, 177: 15851588.
 9.
Tatusov RL, Mushegian AR, Bork P, Brown NP, Hayes WS, Borodovsky M, Rudd KE, Koonin EV: Metabolism and evolution of Haemophilus influenzae deduced from a wholegenome comparison with Escherichia coli. Curr Biol. 1996, 6: 279291.
 10.
Ponting CP, Schultz J, Milpetz F, Bork P: SMART: identification and annotation of domains from signalling and extracellular protein sequences. Nucleic Acids Res. 1999, 27: 229232. 10.1093/nar/27.1.229.
 11.
Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, GriffithsJones S, Howe KL, Marshall M, Sonnhammer EL: The Pfam protein families database. Nucleic Acids Res. 2002, 30: 276280. 10.1093/nar/30.1.276.
 12.
MarchlerBauer A, Panchenko AR, Shoemaker BA, Thiessen PA, Geer LY, Bryant SH: CDD: a database of conserved domain alignments with links to domain threedimensional structure. Nucleic Acids Res. 2002, 30: 281283. 10.1093/nar/30.1.281.
 13.
Jordan IK, Makarova KS, Spouge JL, Wolf YI, Koonin EV: Lineagespecific gene expansions in bacterial and archaeal genomes. Genome Res. 2001, 11: 555565. 10.1101/gr.GR1660R.
 14.
Lespinet O, Wolf YI, Koonin EV, Aravind L: The role of lineagespecific gene family expansion in the evolution of eukaryotes. Genome Res. 2002, 12: 10481059. 10.1101/gr.174302.
 15.
Aravind L, Watanabe H, Lipman DJ, Koonin EV: Lineagespecific loss and divergence of functionally linked genes in eukaryotes. Proc Natl Acad Sci U S A. 2000, 97: 1131911324. 10.1073/pnas.200346997.
 16.
Huynen MA, van Nimwegen E: The frequency distribution of gene family sizes in complete genomes. Mol Biol Evol. 1998, 15: 583589.
 17.
Qian J, Luscombe NM, Gerstein M: Protein family and fold occurrence in genomes: powerlaw behaviour and evolutionary model. J Mol Biol. 2001, 313: 673681. 10.1006/jmbi.2001.5079.
 18.
Rzhetsky A, Gomez SM: Birth of scalefree molecular networks and the number of distinct DNA and protein domains per genome. Bioinformatics. 2001, 17: 988996. 10.1093/bioinformatics/17.10.988.
 19.
Wuchty S: Scalefree behavior in protein domain networks. Mol Biol Evol. 2001, 18: 16941702.
 20.
Barabasi AL: Linked: The New Science of Networks. 2002, New York: Perseus Pr
 21.
Barabasi AL, Albert R: Emergence of scaling in random networks. Science. 1999, 286: 509512. 10.1126/science.286.5439.509.
 22.
Albert R, Barabasi AL: Statistical mechanics of complex networks. Reviews of Modern Physics. 2002, 74: 4797. 10.1103/RevModPhys.74.47.
 23.
Albert R, Jeong H, Barabasi AL: Error and attack tolerance of complex networks. Nature. 2000, 406: 378382. 10.1038/35019019.
 24.
Amaral LA, Scala A, Barthelemy M, Stanley HE: Classes of smallworld networks. Proc Natl Acad Sci U S A. 2000, 97: 1114911152. 10.1073/pnas.200327197.
 25.
Gisiger T: Scale invariance in biology: coincidence or footprint of a universal mechanism?. Biol Rev Camb Philos Soc. 2001, 76: 161209. 10.1017/S1464793101005607.
 26.
Jeong H, Mason SP, Barabasi AL, Oltvai ZN: Lethality and centrality in protein networks. Nature. 2001, 411: 4142. 10.1038/35075138.
 27.
Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL: The largescale organization of metabolic networks. Nature. 2000, 407: 651654. 10.1038/35036627.
 28.
Dorogovtsev SN, Mendes JF: Scaling properties of scalefree evolving networks: Continuous approach. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 2001, 63: 05612510.1103/PhysRevE.63.056125.
 29.
Krapivsky PL, Redner S: Organization of growing random networks. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 2001, 63: 06612310.1103/PhysRevE.63.066123.
 30.
Luscombe N, Qian J, Zhang Z, Johnson T, Gerstein M: The dominance of the population by a selected few: powerlaw behaviour applies to a wide variety of genomic properties. Genome Biol. 2002, 3: research0040.00410040.0047. 10.1186/gb200238research0040.
 31.
Kuznetsov VA: Statistics of the numbers of transcripts and protein sequences encoded in the genome. In: 'Computational and Statistical Approaches to Genomics. Edited by: Zhang W, Shmulevich I. 2002, Boston: Kluwer, 125171.
 32.
Koonin EV, Makarova KS, Aravind L: Horizontal gene transfer in prokaryotes: quantification and classification. Annu Rev Microbiol. 2001, 55: 709742. 10.1146/annurev.micro.55.1.709.
 33.
Feller W: An introduction to probability theory and its application. New York: Wiley, 1967–1968
 34.
Ijuri Y, Simon HA: Skew distributions and the sizes of business firms. 1977, Amsterdam, New York, Oxford: NorthHolland Publishing Company
 35.
Gihman II, Skorohod AV: The theory of stochastic processes. 1975, NewYork, Heidelberg, Berlin: SpringerVerlag
 36.
Johnson NL, Kotz S, Kemp AW: Univariate discrete distributions. 1992, New York: Wiley
 37.
Henrici P: Applied and computational complex analysis. 1986, New York: Wiley
 38.
Wolf YI, Grishin NV, Koonin EV: Estimating the number of protein folds and families from complete genome data. J Mol Biol. 2000, 299: 897905. 10.1006/jmbi.2000.3786.
 39.
Lawrence JG, Ochman H: Amelioration of bacterial genomes: rates of change and exchange. J Mol Evol. 1997, 44: 383397.
 40.
Gould SJ: The Structure of Evolutionary Theory. 2002, Cambridge, MA: Harvard Univ. Press
 41.
Tatusov RL, Galperin MY, Natale DA, Koonin EV: The COG database: a tool for genomescale analysis of protein functions and evolution. Nucleic Acids Res. 2000, 28: 3336. 10.1093/nar/28.1.33.
 42.
Remm M, Storm CE, Sonnhammer EL: Automatic clustering of orthologs and inparalogs from pairwise species comparisons. J Mol Biol. 2001, 314: 10411052. 10.1006/jmbi.2000.5197.
 43.
Gantmacher FR: The theory of matrices. 1989, New York: Chelsea Publishing Company
Acknowledgements
We thank Alexei Kondrashov, Alexei Ogurtzov, and Vladimir Ponomarev for critical reading of the manuscript and the Koonin group members for helpful discussions.
Author information
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Cite this article
Karev, G.P., Wolf, Y.I., Rzhetsky, A.Y. et al. Birth and death of protein domains: A simple model of evolution explains power law behavior. BMC Evol Biol 2, 18 (2002). https://doi.org/10.1186/14712148218
Received:
Accepted:
Published:
Keywords
 Horizontal Gene Transfer
 Domain Family
 Generalize Pareto Distribution
 Innovation Rate
 Equilibrium Size