The WRKY transcription factor superfamily: its origin in eukaryotes and expansion in plants

Background WRKY proteins are newly identified transcription factors involved in many plant processes including plant responses to biotic and abiotic stresses. To date, genes encoding WRKY proteins have been identified only from plants. Comprehensive search for WRKY genes in non-plant organisms and phylogenetic analysis would provide invaluable information about the origin and expansion of the WRKY family. Results We searched all publicly available sequence data for WRKY genes. A single copy of the WRKY gene encoding two WRKY domains was identified from Giardia lamblia, a primitive eukaryote, Dictyostelium discoideum, a slime mold closely related to the lineage of animals and fungi, and the green alga Chlamydomonas reinhardtii, an early branching of plants. This ancestral WRKY gene seems to have duplicated many times during the evolution of plants, resulting in a large family in evolutionarily advanced flowering plants. In rice, the WRKY gene family consists of over 100 members. Analyses suggest that the C-terminal domain of the two-WRKY-domain encoding gene appears to be the ancestor of the single-WRKY-domain encoding genes, and that the WRKY domains may be phylogenetically classified into five groups. We propose a model to explain the WRKY family's origin in eukaryotes and expansion in plants. Conclusions WRKY genes seem to have originated in early eukaryotes and greatly expanded in plants. The elucidation of the evolution and duplicative expansion of the WRKY genes should provide valuable information on their functions.


Background
Transcriptional control is a major mechanism whereby a cell or organism regulates its gene expression. Sequencespecific DNA-binding transcription regulators, one class of transcription factors [1], play an essential role in modulating the rate of transcription of specific target genes. In this way, they direct the temporal and spatial expressions necessary for normal development and proper response to physiological or environmental stimuli. Comparative genome analysis reveals that genes for transcription regu-lators are abundantly present in plant and animal genomes, and the evolution and diversity of eukaryotes seem to be related to the expansion of lineage-specific transcription regulator families [2].
The WRKY family proteins contain one or two highly conserved WRKY domains characterized by the hallmark heptapeptide WRKYGQK and a zinc-finger structure distinct from other known zinc-finger motifs [3]. To regulate gene expression, the WRKY domain binds to the W box in the promoter of the target gene to modulate transcription [5,7,16,24]. In addition to the W box, a recent study indicates that the WRKY domain can also bind to SURE, a sugar responsive cis element, as a transcription activator [9].
Based on the number of WRKY domains and the pattern of the zinc-finger motif, Eulgem et al. [3] classified members of the WRKY superfamily from the Arabidopsis genome into three groups. Members of Group 1 typically contain two WRKY domains, while most proteins with one WRKY domain belong to Group 2. Group 3 proteins also have a single WRKY domain, but the pattern of the zinc-finger motif is unique. Eulgem et al. [3] further divided Group 2 into five subgroups, according to the phylogenetic analysis of the WRKY domains.
Given the large family of WRKY genes with divergent regulatory functions in important plant processes, it would be desirable to understand the evolutionary origin and gene duplications leading to the multi-member WRKY family. The clarification of the phylogenetic relationships among WRKY genes in model plants will also assist understanding of the functions of these genes in important crops. We have comprehensively searched all currently available sequence data for the existence of WRKY genes outside the plant kingdom. Homologues of WRKY genes are found from two eukaryotic species: Giardia lamblia, a primitive protozoan, and Dictyostelium discoideum, a slime mold. The data indicate an early origin of WRKY genes in eukaryota and tremendous gene amplifications in the plant lineage. We then cataloged the WRKY genes from the rice genome and compared them with Arabidopsis WRKY genes. We also identified WRKY genes from expressed sequence tags (ESTs) and EST-assembled sequence contigs from nineteen plant species. The result suggests that WRKY gene duplication events correlate with the increasing structural and functional complexities in land plants. We propose a model for the evolution of WRKY genes.

WRKY genes in non-plant eukaryotes
We searched for WRKY genes in two comprehensive datasets, GenBank's non-redundant (nr) and dbEST of all species. Together these datasets contain over 13 million sequence records from more than 110,000 organisms [37]. Homologues of WRKY proteins are not found in the superkingdoms of archaea and eubacteria. In eukaryotes, no WRKY genes are identified from the lineages of fungi and animals.
Interestingly, two WRKY homologues were identified from non-plant eukaryotic species, and both have two WRKY domains [see Additional files 1 and 2]. The first protein (GenBank accession: EAA40901) is encoded by an intronless gene in the draft genome sequence of Giardia lamblia [38]. The unicellular protist Giardia is one of the most primitive organisms that represent the earliest branching among extant eukaryotes [39,40]. The second (accession AAO52331) is encoded by the genomic sequence of chromosome 2 of the slime mold Dictyostelium discoideum [41]. The genomic sequence for the WRKY domains were assembled from sequences generated from three libraries prepared by two groups [42], indicating that it is not from sequence contamination. The gene contains an intron, which interrupts the coding region between the two WRKY domains. For this species, about 150,000 EST sequences are currently available in GenBank. One EST (accession AU033476) aligns to the WRKY gene, indicating that the gene is expressed. D. discoideum belongs to the Mycetozoa, a lineage more closely related to animals and fungi than to green plants [41,43].

A WRKY gene in a green alga
Chlamydomonas reinhardtii is a unicellular green alga with a cell wall. It also has chloroplasts for photosynthesis. The evolutionary position of the species is located before the divergence of land plants [44,45]. The release 1.0 of its genome sequence has approximately 9 × whole genome shotgun coverage [46]. Since the gene annotation for the release is still at a preliminary stage, we predicted WRKY genes from the genome sequence (see Methods). The sequence similarity search between the genome sequence and Pfam's WRKY domain sequences indicated that the sequence 'Scaffold_1387' may encode WRKY domains. This sequence was then used for further WRKY domain and gene predictions. Despite minor differences in the gene structure prediction, both gene prediction programs FGENESH and GENSCAN agree on the major features of the protein, including the presence of two WRKY domains [see Additional files 1 and 2]. Moreover, the predicted peptide sequence of the WRKY domains is identical among all the gene and domain predictions. Sequence alignment by blastn indicates that six ESTs are from the predicted coding regions of the gene; the GenBank accessions for these ESTs are BI727288, AW772895, BM000804, BG846749, BE121978 and BQ821537.

A catalog of WRKY genes in rice
Rice, one of the most important crops for world agriculture, is recognized as a model monocot for the study of cereal crop genomes. A comprehensive catalog of rice WRKY genes would provide a basis for investigating the evolutionary patterns of the gene family and for transferring knowledge of the functions of these transcription factors from Arabidopsis to rice and from rice to other cereal crops.
We identified the members of the WRKY family in rice (Japonica variety) from its published genome sequence [47]. The WRKY gene identification procedure employed in this study (see Methods) was first tested with the Arabidopsis genome sequence. The procedure successfully identified all reported Arabidopsis WRKY genes [3,23]. The rice genome seems to encode 109 WRKY proteins, four of which have incomplete WRKY domains. The remaining 105 proteins with complete WRKY domains, listed in Additional file 3, were used for further analysis. The multiple sequence alignment of WRKY domains from rice, Arabidopsis, the green alga, the slime mold and Giardia lamblia, and the conserved WRKY domain patterns can be found in Additional file 2. Some rice genes encode identical WRKY domains. For example, OsWRKY34 and OsWRKY57 share identical amino acid sequences in the WRKY domains, but the nucleotide sequences for the domains are not identical and they are located in different chromosomes (1 and 4, respectively), indicating that they are distinct genes. Similarly, OsWRKY8 located in Chromosome 6 and OsWRKY76 located in Chromosome 2 also represent two genes. The following genes in parenthesis share the identical WRKY domains and have a high identity of the corresponding coding nucleotide sequences: (OsWRKY9, 101), (12,98 and 99), (21, 97), (29, 96), (39,105), (51,103), (73, 104), (80, 102), and (82, 100). These highly similar genes may represent newly duplicated paralogues. The 105 genes are unevenly distributed in the 12 chromosomes, ranging from 25 genes (the highest number) in Chromosome 1 to two genes (the lowest) in Chromosome 10. Sequence alignment indicates that 60 WRKY genes have one or more matched rice ESTs from the dbEST database (data not shown). Out of the 105 proteins, 13 have two WRKY domains. We assigned the WRKY domains into subfamilies using phylogenetic analysis with already classified AtWRKY genes from A. thaliana [3] as the reference. Eleven proteins with two WRKY domains are assigned to Group 1 because their C-terminal domains belong to this group. Since the Nand C-terminal domains form distinct clusters, we designated the two domains as 1N and 1C, respectively. Six proteins with a single domain also belonged to Group 1. While OsWRKY15, 16, 73 and 104 have a single domain homologous to Group 1N, OsWRKY13 and 91 contain a single Group 1C domain. Interestingly, both N-and Cdomains of the other two double-domain-containing proteins (OsWRKY66 and 67) are always clustered with Group 3 domains. Thirty-five single WRKY domain proteins are also assigned to this group. All together, there are 39 domains or 37 proteins in Group 3. We assigned 49 proteins to three new groups, Group 2_a + 2_b (13), Group 2_c (18), and Group 2_d + 2_e (18). These new groups are reorganized from the five subgroups IIa through IIe in Eulgem et al. [3] (see details of the classification in Discussion). Domains of OsWRKY 25 and 95 cannot be consistently classified and therefore remain unassigned [see Additional file 3].
Interestingly, several variant patterns of the WRKY domains exist in the rice WRKY proteins. Although the WRKYGQK peptide is highly conserved, nine variants with one or two amino acids substituted are observed in 19 domains, most of which belong to Groups 3 and 2_c (Table 1). While WRKYGEK and WRKYGKK are two common variants shared by seven (all in Group 3) and five (all in Group 2_c) domains, respectively, each of the other seven different heptapeptides occurs in only one protein.
The WRKY domains also contain patterns of zinc-finger motifs that have not been reported in the literature (Table  1). No variants are found in domains of Groups 1C and 2_a + 2_b. The WRKY genes encoding the variant domain patterns might be functional, because 10 genes with a total of seven heptapeptide variants and two zinc-finger motif variants have sequenced ESTs, although the DNA binding capacity may be reduced [48]. Furthermore, ESTs have been sequenced from the gene regions for the variants of WRKYGEK, WRKYGKK, WKKYGQK and C_X6_C_X28_H_X1_C, indicating that these patterns are not artifacts of the gene prediction ( Table 1).

Survey of WRKY genes in land plants
Since the genomes of rice and Arabidopsis have numerous WRKY genes whereas the green alga may have only a single copy, it would be interesting to investigate the gene duplication events of WRKY family during the course of evolution from unicellular plant organisms to flowering plants and the relationship between expansion of the WRKY family and the increased structural and functional complexities of the higher plants. Ideally, the complete set of WRKY genes should be identified from species representing different branches on the evolutionary tree of plants for further analysis. Unfortunately, genome sequence is currently not available for most plant species. However, a large number of EST sequences for many plants are publicly available and can be used to roughly estimate the minimum number of WRKY genes in these species.
We first surveyed GenBank's dbEST set and found that WRKY genes are widespread in land plants, as over 40 species have expressed WRKY genes (data not shown). We then estimated the number of unique WRKY genes for 17 species using their Gene Indices, which are assembled EST sequence contigs with the minimal redundancy, provided by The Institute for Genomic Research (TIGR) [49]. The analysis also included ESTs for the moss Physcomitrella patens and the fern Ceratopteris richardii whose Gene Indices are not available [see Additional file 4]. For the EST set, redundant ESTs for WRKY proteins were manually removed. Together these 19 species represent different branches on the evolutionary tree of the land plants. While the moss Physcomitrella is an early diverged land

Phylogeny of the WRKY domains
To examine the evolutionary relationships among the WRKY domains, we estimated the phylogeny by using the neighbor-joining program from PHYLIP 3.57 for the amino acid sequences of WRKY domains from G. lamblia, the slime mold, the green alga, Arabidopsis and rice. The phylogenetic relationships were also inferred with the programs of the least squares and parsimony from PAUP 4.0 for the corresponding nucleotide sequences. We also did the same analysis for the rice dataset alone. The topology of trees obtained from these analyses is essentially the same, and the neighbor-joining tree is shown in Figure 1. Group 2 domains designated by Eulgem et al. [3] are not monophyletic, but form three distinct clades. These include: 2_a + 2_b, 2_c, and 2_d + 2_e. Moreover, Group 2_a + 2_b and Group 2_c are closely related to Group 1C domains, while Group 3 is clustered with Group 2_d + 2_e. In addition, the rice and Arabidopsis WRKY trees (not shown) consistently clustered WRKY1N domains as a monophyletic subtree and all other domains as a natural clade, supporting the suggestion that Groups 2 and 3 domains are more closely related to the C-terminal domains of Group 1 genes than to the N-terminal domains [3].
In flowering plants, genes encoding WRKY domains appear to have been duplicated independently in monocots and dicots. For Group 3 domains, three subsets each of which consists of five or more members only from rice can be distinguished from the phylogram shown in Figure  2. Similarly, six members of WRKY domains, all from Arabidopsis, are clustered together. Independent domain clusters of either species are also found in other WRKY subfamilies (data not shown). These results suggest that numerous duplications and diversifications for WRKY genes, particularly Group 3 genes, have occurred after the divergence of the monocots and dicots. Indeed, all rice WRKY domains with the sequence WRKYGEK (Table 1) are classified as a sub-cluster of the largest rice domain cluster in Group 3 (Figure 2), implying that multiple duplication events led to this large cluster in rice.

Discussion
WRKY genes seem to be an innovation in eukaryota after the divergence of eubacteria -archaea -eukaryota. In eukaryotes, the WRKY genes are present in the green plants as well as in the ancient eukaryote G. lamblia and the mycetozoan D. discoideum, but not in fungi and animals. G. lamblia is a primitive unicellular eukaryote diverged ~ 1,500 million years ago (mya) [51]. Originally thought as plant-specific [2,3], the WRKY transcription factors therefore seem to have an early origin in eukaryotes. As the mycetozoa is closely related to the fungi-animal clade [41,43], the WRKY gene(s) may have been lost prior to the divergence of fungi and animals, but Unrooted phylogenetic tree of the WRKY domains Figure 1 Unrooted phylogenetic tree of the WRKY domains. The tree was reconstructed from the amino acid sequences using the neighbor-joining program from Phylip 3.57. Clades of WRKY domains are labelled according to the classifications of AtWRKY domains by Eulgem et al [3] who proposed three groups and five subgroups in Group 2 (a, b, c, d and e). We suggest classifying WRKY domains into five groups modified from the old system. While Groups 1 and 3 are unchanged, the original subgroup 2_c is promoted to Group 2_c. Subgroups 2_a and 2_b, and subgroups 2_d and 2_e are combined to form two new groups, 2_a + 2_b, and 2_d + 2_e, respectively (see text for details). WRKY domains from G. lamblia are represented by thick and dark-green branches; the slime mold, thick and cyan; the green alga, thick and magenta; Arabidopsis, thin and blue; and rice, thin and red.
after the split of the slime mold and fungi-animal lineages.
Based on the current data, we propose a model for the origin and evolution of the WRKY factor family ( Figure 3). First, the ancestor of the descendant WRKY genes found in G. lamblia, the slime mold and the green alga seems to be a Group 1 gene encoding two WRKY domains. The conservation of the C-and N-terminal domains suggests that they are derived from a single domain by domain duplication. Therefore we hypothesize that the earliest WRKY factor had one WRKY domain and the gene was innovated post the first appearance of eukaryotes ~ 2,500 mya [52] but prior to the divergence leading to Giardia protist, 1 ,500 mya. Second, our data and the previous results by Eulgem et al. [3] suggest that the WRKY domains of groups 2_a + 2_b, 2_c, 2_d + 2_e and 3 are evolutionarily close to the WRKY1C domain. It seems that Group 1 genes which contain only the C-terminal WRKY domain are ancestors of the descendant WRKY genes in other groups. The N-terminal domain in Group 1 genes may have been lost prior to the gene duplication. As the green alga may have only one WRKY gene which belongs to Group 1, the duplications and diversifications leading to other groups in plants probably occurred some time after the divergence of chlorophytes and streptophytes, ~ 800 mya [53]. Third, the domain structure conservation [see Additional file 2] and the phylogenetic analysis ( Figure 1) suggest that the three distinct subsets, Groups 2_a + 2_b, 2c, 2_d + 2_e, may be independently evolved from the Group 1 genes which have only the C-terminal domain. In addition, Group 3 genes appear to share a common ancestor Phylogram of Group 3 WRKY domains from Arabidopsis (AtWRKY) and rice (OsWRKY) Figure 2 Phylogram of Group 3 WRKY domains from Arabidopsis (AtWRKY) and rice (OsWRKY). The amino acid sequences were analysed with the neighbor-joining and parsimony algorithms implemented in PHYLIP 3.57. Bootstrap values ≥ 50% are indicated above the nodes for distance analysis. The C-terminal domains, AtWRKY1C, was used as the outgroup. OsWRKY proteins with the variant WRKYGEK are marked by *. with the clade 2_d + 2_e. The identification of 2_c and 2_d + 2_e genes in the moss EST data [see Additional file 4] suggests that the duplications of the genes in these groups predate the diversification of bryophytes, ~ 420 mya [50]. Although the WRKY genes in Group 2_a + 2_b and Group 3 are identified only from flowering plants in the current data, the origin of these genes seems to have occurred prior to the divergence of monocots and dicots, because the characteristic features of the WRKY domains in Group 3 are highly conserved in Arabidopsis and rice. In addition, multiple copies of Group 3 genes may exist in the common ancestor of monocots and dicots, since clusters with nested Arabidopsis and rice sequences are found in the group (Figure 2).
The classification of the WRKY family in Arabidopsis by Eulgem et al. [3] is not completely based on phylogenetic analysis and therefore does not necessarily reflect the evolutionary relationships among the groups. This is even apparent for the tree of AtWRKY genes built by the authors (see their Figure 3). For example, their Group 2 is not monophyletic, but seems to have several ancestors. Obvi-ously it is necessary to implement a new classification scheme for the WRKY family to reflect the evolution of the WRKY domains. Based on phylogenetic analysis (

Conclusions
Originally believed to be plant-specific, WRKY transcription factor family has an early origin in eukaryotes and is also present in a slime mold which is more closely related to the lineage of fungi-animals than to plants. WRKY genes have been duplicated many times during evolution in plants, resulting in a large gene family for WRKY proteins in flowering plants. The elucidation of the evolutionary pathway of WRKY family and a new classification system we proposed based on phylogenetic analysis, conserved WRKY domain structures and intron positions should assist the functional characterization of WRKY genes.

WRKY gene identification
We searched 'nr' and dbEST datasets for WRKY genes in species outside the plant phyla. The dbEST dataset was also used to survey the expressed WRKY genes in plant species. We aligned the sequences in the datasets with WRKY-seed using BLAST programs [61]. To determine the taxonomical distribution of WRKY genes from the BLAST output, we constructed a database where the BLAST results, the subject sequences and their associated taxonomy information from NCBI [58] were stored. The significant hits (E < 10 -4 ) were parsed and manually checked for the presence of the characteristic features of the WRKY domain.
To systemically catalog the WRKY genes for rice and G. lamblia, we searched their genome sequences with blastp and PSI-BLAST [61] using WRKY-seed as the query. For PSI-BLAST, we used the default settings for three iterations. We also searched for WRKY genes with HMMER using the global profile of the WRKY domain [60]. HMMER, a sequence analysis tool based on profile Hidden Markov models [62], is available at [63]. The search results with the threshold of E < 10 -4 for blastp and PSI-BLAST and E < 0.1 for HMMER were manually compared to remove non-WRKY hits. We also used the same strategy to identify the set of WRKY genes from the Arabidopsis genome.
To identify WRKY genes from the green alga, we first BLASTed its genome sequence against the WRKY-seed. The significantly aligned sequences (E < 10 -4 ) were then subject to WRKY domain and gene predictions. The WRKY domain was predicted with the Pfam's DNA SEARCH [64], a web-interface backed by the GeneWise algorithm [65]. The WRKY gene was predicted by FGENESH using the profile for monocots [66,67] and GENSCAN using the profile for maize [68,69].

Analysis of WRKY genes
The WRKY domain boundary was defined as by Eulgem et al. [3]. The peptide sequences of the domains were aligned with ClustalX (v1.81, with default settings) [70] and the alignment was adjusted based on the conserved features of the WRKY domains. The results were then used to guide the alignment of the corresponding nucleotide sequences. The neighbor-joining algorithm implemented in PHYLIP 3.573c [71] for amino acid sequences with the pairwise distance computed under the PAM model, and the least square fit and most parsimony algorithms in PAUP* 4.0b10 [72] for nucleotide sequences were used for phylogenetic tree reconstruction.