Genome-wide comparative phylogenetic analysis of the rice and Arabidopsis Dof gene families

Background Dof proteins are a family of plant-specific transcription factors that contain a particular class of zinc-finger DNA-binding domain. Members of this family have been found to play diverse roles in gene regulation of processes restricted to the plants. The completed genome sequences of rice and Arabidopsis constitute a valuable resource for comparative genomic analyses, since they are representatives of the two major evolutionary lineages within the angiosperms. In this framework, the identification of phylogenetic relationships among Dof proteins in these species is a fundamental step to unravel functionality of new and yet uncharacterised genes belonging to this group. Results We identified 30 different Dof genes in the rice Oryza sativa genome and performed a phylogenetic analysis of a complete collection of the 36-reported Arabidopsis thaliana and the rice Dof transcription factors identified herein. This analysis led to a classification into four major clusters of orthologous genes and showed gene loss and duplication events in Arabidopsis and rice, that occurred before and after the last common ancestor of the two species. Conclusions According to our analysis, the Dof gene family in angiosperms is organized in four major clusters of orthologous genes or subfamilies. The proposed clusters of orthology and their further analysis suggest the existence of monocot specific genes and invite to explore their functionality in relation to the distinct physiological characteristics of these evolutionary groups.


Background
Detailed analyses of completely sequenced genomes reveal that a significant percentage of the encoded proteins corresponds to transcription factors (TF). These can be classified into several gene families according to the presence of particular DNA binding domains [1][2][3][4][5][6][7][8]. However, the analysis of a particular transcription factor should be done in the context of the family, to which it belongs, taking into account that functional redundancy is a very frequent event within eukaryotic TFs [9]. Moreo-ver, transcription factors operate in complex networks based on protein-protein interactions and are often organized into regulatory cascades. Due to their crucial role in the regulation of gene expression, the study of TFs is of outstanding interest, and for the reasons stated above it should be ideally done from a genomic perspective [10,11].
The complete genomic sequence of Arabidopsis thaliana [3] and the shotgun-quality genomic sequence of Oryza sativa [12][13][14][15] have been recently obtained, each constituting a model plant for a dicotyledonous and monocotyledonous species respectively. In both the Arabidopsis and rice genomes several groups of plant-specific TF have been described that are of great interest, since they may be involved in the regulation of events restricted to the plant kingdom [11]. One of these groups is the Dof (DNA binding with one finger) family, a particular class of zinc finger domain TFs [16,17] characterized by a conserved region of 50 amino acids with a C 2 -C 2 finger structure, associated to a basic region, that binds specifically to DNA sequences with a 5'-T/AAAAG-3' core [18]. Dof proteins have been reported to participate in the regulation of gene expression in processes such as seed storage protein synthesis in developing endosperm [19,20], light regulation of genes involved in carbohydrate metabolism [21], plant defense mechanisms [22], seed germination [23][24][25], gibberellin response in post-germinating aleurone [26,27], auxin response [28][29][30] and stomata guard cell specific gene regulation [31].
Because hierarchy organization of genes reflects an ancient process of gene duplication and divergence, many of the theoretical and analytical tools of the phylogenetic systematics can be utilized in comparative genomics [5]. Here, this analytical approach, successfully applied in Arabidopsis [32], was used to perform a phylogenetic characterization of all the Arabidopsis and rice Dof transcription factors. As a first step we have revised and annotated all the rice Dof genes and compared them with those from Arabidopsis. This phylogenetic analysis, led to the definition of four clusters of rice and Arabidopsis orthologous genes and to identify the minimum complement of Dof genes that were present in the angiosperm common ancestor. We also discuss on the relevance to recognize groups of orthology as the basis for further characterization of Dof genes of unknown function.

Identification of a comprehensive set of Dof proteins from rice and Arabidopsis
Rice Dof genes Oryza sativa ssp. japonica and O. sativa ssp. indica draft genome sequences were simultaneously released [13,14] and more recently the high quality finished sequences of O. sativa ssp. japonica chromosomes 1 and 4 [33,34]. We analyzed both genomes in order to assemble a complete and non-redundant set of rice Dof genes. The nucleotide and deduced amino acid sequences of the Dof domains were used to perform independent Blast searches [35] through several rice databases: Rice TIGR db, DDBJ and TMRI Rice Genome Database (for the japonica genome) and the NCBI O. sativa BLAST page (for the indica genome). A total of 30 non-redundant Dof transcription factors were identified in japonica and indica. Among them, twenty-seven sequences were almost identical in both species, while two japonica genes were not clearly identified in indica and one indica gene was not clearly identified in japonica. To explore this discrepancy we used the complete sequences of the japonica genes, partially detected in indica, to perform a BLAST search in the indica database and vice versa. Portions of the two genes partially detected in indica (OsDof-5 and OsDof-20) were found in the corresponding database. In the case of OsDof-5, a DNA fragment 3' to the Dof domain was identified as a perfect match to the japonica gene used as the query sequence. This fragment corresponds to a terminal region of a short contig assembly (AAAA01006178.1). The other indica sequence, OsDof-20, was found within a misassembled contig (AAAA01062012.1) as a completely homologous match to the japonica sequence. For the indica gene not clearly identified in japonica (OsDof-30), a DNA fragment 5' to the Dof domain was identified as a perfect match against the indica gene used as the query sequence. This fragment corresponds to a terminal region of a 6 kbcontig assembly at the TMRI rice genome project (CL037947.70). Since none of the indica and TMRI japonica contigs have been mapped to a rice genome, we were unable to provide the chromosome location for OsDof-29 and OsDof-30 in Table 1. Gene structure and the corresponding deduced amino acid sequences for all the indica Dof TFs and eight from japonica were processed with the help of the RiceGAAS annotation system. The remaining japonica genes were obtained from the rice TIGR annotation database. According to the predicted structures, approximately half of the rice Dof TFs (16) contains one or more introns (Table 1).

Arabidopsis Dof genes
A non-redundant and complete compilation of the Arabidopsis Dof genes was obtained from the At TIGR db and MIPS MATDB databases. A total of 36 annotated TFs belonging to the Dof gene family were extracted from these sources ( Table 2). In a previous publication, Riechmann [11] indicated the existence of 37 Dof encoding genes in Arabidopsis. However, the presence of several stop codons within the ORF of At1g65935 suggests that it is most probably a pseudogene [36] and was therefore excluded from our analysis. Structural examination of the remaining 36 genes revealed the presence of introns in half of the sequences, generally placed upstream of the Dof domain. Of those, 15 contained just one intron ( Table 2).

Phylogenetic analysis and recognition of Dof families in rice and Arabidopsis
In order to evaluate the evolutionary relationship among the rice Dof proteins, we performed a phylogenetic analysis based on their DNA binding domain sequences ( Figure  1). Pair-wise amino acid similarities were higher than 50%, a threshold conventionally used to classify a group of genes as a gene family [5,37]. Consistent with the unrooted tree obtained by the neighbor-joining algorithm ( Figure 2B) four groups were defined (a, b, c and d), two of which were further divided into subgroups supported by the presence and position of introns (Table 1), bootstrapping values and the occurrence of common protein motifs outside of the Dof domain ( Figure 3 and Table 3).
An equivalent phylogenetic analysis of Dof domain sequences was done in Arabidopsis. The un-rooted tree inferred from the neighbor-joining analysis in displayed in Figure 2C. Our results show that the Arabidopsis Dof gene family can be organized into four groups or subfamilies (A, B, C and D). Groups B, C and D were further subdivided into subgroups, according the same criteria applied in the analysis of the rice Dof proteins. In order to detect putative duplicated genes in the Arabidopsis genome, we examined sequence redundancy between pairs of closely related Dof proteins. Using the Arabidopsis Redundancy Viewer (MATDB), we found ten pairs of genes (Table 1) on genomic regions associated with major genomic duplication events that occurred in Arabidopsis [38][39][40].

Comparison of the Arabidopsis and rice Dof proteins and determination of orthology relationships
To evaluate the evolutionary relationships within the Dof gene family, we performed a combined phylogenetic analysis of the 66 Arabidopsis and rice sequences to obtain a joint tree (Figure 2A). The tree topology, as well as the group and subgroup organization, resembled those from the rice and Arabidopsis individual trees ( Figures 2B and  2C). The tree presented in Figure 2A identified putative   Figure 2A were already displayed as paralogs in the respective trees ( Figures 2B and 2C). Additionally, nearly all the Arabidopsis paralogs correspond to regions described as Dof domain sequence alignment of the annotated rice proteins Figure 1 Dof domain sequence alignment of the annotated rice proteins. The four cysteine residues putatively responsible of the zinc-finger structure are indicated. Identical amino acids are highlighted in black. Gene names correspond to those listed in Table 1.
genomic redundancies (Table 2). Just three genes were found in unexpected locations in the combined tree. A possible reason for this fact could be the presence of an apparent orthologous from the other species, generating a better support for the new location (i.e. OsDof-3 and At1g07640). On the other hand, relocation of At5g65590 seems to be an artifact of sequence similarities across the Dof domain, since according to additional information of the conserved motifs outside of the Dof domain, location within MCOG Bb is more consistent (Figures 2A and 3).
Comparative analyses of the complete amino acid sequences of the Dof proteins by BLOCKS and MEME software ( Figure 3) are in agreement with those of the presented phylogenetic analysis, since several family and subfamily specific conserved motifs (Table 3) could be determined for each of the defined groups in Figure 2A.
Moreover, an additional tree derived from the MEME results, analyzing the Dof domain plus all the conserved motifs described in Table 3, presented the same group and subgroup organization as the tree displayed in Figure 2A (data not shown).   Table 1  Schematic distribution of conserved motifs among the defined gene clusters in Figure 2A Table 3.

Comparative genomic analysis of the rice and Arabidopsis Dof gene families
The main objective of this phylogenetic study was to identify putative orthologous and paralogous Dof genes, orthologs being defined as genes in different genomes that have been created by the splitting of taxonomic lineages, and paralogs as genes in the same genome created by gene duplication events [5,43]. Paralogs usually display different functions, while orthologs may retain the same function [1]. Distinguishing orthologous from paralogous genes is essential to comparative genomics. Indeed, the fundamental activity of comparative genomics is to track the presence, structural characteristics, function, and map position of orthologs in multiple genomes [5].
Considering the extensive annotation work done in Arabidopsis since the release of its sequenced genome [3], together with the analysis of rice sequences carried out in this study, we assume that most (or possibly all) of the Dof transcription factors from these species are represented in the 66 sequences documented (36 from Arabidopsis and 30 from rice). Our analyses of these sequences defined four MCOGs in rice and Arabidopsis (Figure 2A). Within each MCOG, particular clusters of paralogous and orthologous genes were identified, showing ancestral a Numbers correspond to the motifs described in Figure 3. b Sequences obtained from the analysis of the 66 rice and Arabidopsis Dof complete proteins with the MEME system. c Dof consensus sequence (in italics). d Predicted nuclear localization signal, according to Park et al [50] (underlined).
duplication and gene loss events. These results were also corroborated through the construction of a rice/Arabidopsis reconciled tree [5,44] (data not shown). The tree presented ( Figure 2A) showed considerable bootstrapping support for many of the defined groups and subgroups, but several clusters remained with poor supporting values. This fact was an expectable consequence of performing a study like the present with a 50 amino acid-length sequence, a constraint imposed by the lack of sequence conservation among Dof proteins outside this domain. However, it is worth to mention that most of the groups and subgroups defined were supported by additional criteria, such as gene structure and the presence of common protein motifs outside the Dof domain detected in the MEME analysis ( Figure 3 & Table 3).

The Dof family in angiosperms: Rice (monocot) and Arabidopsis (dicot) specific genes
Although Dof proteins are exclusively found in the plant kingdom, searching EST databases allows to track the occurrence of Dof-encoding sequences from the unicellular algae Clamydomonas, to mosses and gymnosperms, indicating an ancient origin and the possibility of diversification throughout plant evolution. In this respect, comparisons of Dof repertoires from different organisms may give important insights into the evolutionary history of the family.
Assuming our compilation to be a complete catalog of the rice and Arabidopsis Dof transcription factors, we might postulate the existence of rice and Arabidopsis specific Dof genes, and by extension, putative monocot-and dicot-specific Dof genes. The genes belonging to Arabidopsis cluster C 3 and rice cluster d 3 (Figure 2A) might represent such situation, since each group has no apparent counterpart in the other species. To characterize these events further, we performed global searches (against whole plant nucleotide and protein sequences) with all the members of the two subfamilies (C 3 , d 3 ). For this purpose, query sequences used in BLAST searches were selected outside the Dof domain, since this structure is highly conserved across the whole plant kingdom [36]. The sequence producing the lowest e value with the C 3 queries corresponds to the Pisum sativum ERDP gene (Accession BAA85655.1) followed by maize PBF and its orthologs from barley and wheat. PBF-like genes are Dof proteins known to participate in important regulatory processes of gene expression in the seeds of monocots [19,20]. This suggests an orthologous relationship between the Arabidopsis (dicot) subfamily C 3 and monocot PBF-like genes. This hypothesis is in agreement with our own experimental results, where At4g21080 shows seed specific expression (unpublished results).
BLAST searches with the putative monocot-specific d3 group identified two maize genes as the only sequences that were closely related. The maize genes Dof1 and Dof2 [21] are likely to be orthologous to OsDof-22/OsDof-16. These findings suggest the possible existence of monocotspecific genes (i.e. d 3 related), while no obvious dicot-specific genes were found based on the Arabidopsis sequences. Further analysis of Dof evolution will require the completion of genome projects currently underway and the isolation of Dof sequences, not available at present, from a broader spectrum of plant species.

Duplication events, gene function and phylogenic relationship
When comparing multi-gene families between species it is a common event to find several genes in one species that are collectively orthologs of a single gene in the other, indicating recent duplications exclusive to the former. In this situation, knowledge of gene function of certain members allows the confirmation of paralogous and orthologous relationships, otherwise difficult to infer merely from tree topologies. This is the case of DAG1 (At3g61850) and DAG2 (At2g46590), two closely related Arabidopsis genes (Figure 2A and Table 2). These two genes, display a high degree of sequence similarity and show identical patterns of expression, indicating a potential case of functional redundancy. Nevertheless, a systematic analysis of mutant variants demonstrated that they perform opposite functions in the control of seed germination [23][24][25]. Thus, DAG1 and DAG2 are clearly nonredundant and paralogous genes produced after a recent duplication event.
Conversely, phylogenetic relationship could help in the identification of gene function. Considering the case described above, a third gene (At4g24060) seems to be a paralog of the DAG1-DAG2 branch, resulting in a cluster that appears to be ortholog to the rice gene cluster (OsDof-9/OsDof-18) present in MCOG Cc. Remarkably, OsDof-9 and OsDof-18 were first reported after their isolation from rice seed aleurone layers [27] and it will be interesting to investigate whether they have evolved into antagonistic functions in germination as their Arabidopsis corresponding paralogs.
Considering genome size differences in Arabidopsis (115 Mb) and rice (420 Mb), it is worth mentioning that 36 Dof genes were identified in the former, whereas only 30 in the later, in agreement with important duplication events in the origin of the Arabidopsis genome. In this context, establishing phylogenetic relationships is of outstanding interest to unravel gene functionality.

Conclusions
We identified the probable full complement of Dof genes in rice and Arabidopsis, which are representative of the major evolutionary lineages in the angiosperms: the monocotyledons and the dicotyledons. Phylogenetic analyses resulted in the identification of four major clusters of orthologous genes that contain members belonging to both species, and that must have been represented in their common ancestor before the taxonomic splitting of the angiosperms. Recognition of species-specific subgroups within these clusters led to explore the existence of monocot and dicot unique genes. We performed exhaustive searches in plant databases that allowed the detection of likely orthologs to dicot-specific genes, while no clear orthologs to monocot-specific-genes could be identified.
In view of important genome duplication events leading to gene redundancy in the history of plant diversification, a combination of available functional data with phylogenetically inferred relationships are essential to effectively establish conserved and diverged roles in present day genes of evolutionary unrelated plant species.

Methods
Our collection of non-redundant Arabidopsis Dof proteins was gathered from three different and interconnected sources: the Munich Information Center for Protein Sequences database (MIPS, MATDB; http:// mips.gsf.de/proj/thal/db), the Institute for Genomic Research, (At TIGR db, http://www.tigr.org/tdb/e2k1/ ath1/index.shtml) and the Regulatory Gene Initiative on Arabidopsis (REGIA) European project. Information regarding the gene structure was obtained from the At TIGR db. Redundancy analyses of the Arabidopsis genomic regions comprising Dof genes were carried out with the Redundancy Viewer at the MATDB.
The compilation of a non-redundant set of rice Dof proteins was obtained from the Oryza sativa ssp. japonica and Oryza sativa ssp. indica databases. Sequences for japonica were obtained from the International Rice Genome Sequencing Project, IRGSP, through the Rice TIGR db BLAST tool http://www.tigr.org/tdb/e2k1/osa1/ index.shtml. Newly released sequences from chromosomes 1 and 4 [33,34] were obtained from the DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/Welcomee.html. Additional japonica sequences were obtained from the Syngenta project by browsing the Torrey Mesa Research Institute (TMRI) Rice Genome Database http:// www.tmri.org/. Sequences for indica were obtained from the Whole Genome Shotgun Sequencing Project of the Beijing Genomics Institute by means of the O. sativa BLAST page at the NCBI http://www.ncbi.nlm.nih.gov/ PMGifs/Genomes/riceWGS.html. Gene structure of previously annotated Dof genes was obtained from the Rice TIGR db. Unannotated Dof genes were annotated using the Rice Genome Automated Annotation System (Rice-GAAS; http://ricegaas.dna.affrc.go.jp) at the National Institute of Agrobiological Science. Additional information was supplied from the MATDB.
Alignments of protein sequences by the CLUSTALW [43,45,46] were performed at the DNA Data Bank of Japan page http://www.ddbj.nig.ac.jp/Welcome-e.html. Bootstrapping analysis with a PHYLIP format tree output was carried out after the neighbor-joining method and the trees were represented with the help of the TREEVIEW (v. 1.6.6) software [47]. Rice and Arabidopsis conserved motif analysis within the determined Dof groups was performed by means of the RiceGAAS, MEME ( [48]; http:// meme.sdsc.edu/meme/website/intro.html) and BLOCKS ( [49]; http://blocks.fhcrc.org/blocks) programs.

Author's Contributions
DL carried out the annotation of the rice genes, the phylogenetic, bioinformatic and genomic analyses, drafted and edited the manuscript. PC contributed with the Dof gene family background knowledge and edited the manuscript. JVC conceived of the study and participated in its design and coordination. All authors read and approved the final manuscript.