Comparative analyses reveal distinct sets of lineage-specific genes within Arabidopsis thaliana
© Lin et al; licensee BioMed Central Ltd. 2010
Received: 8 May 2009
Accepted: 12 February 2010
Published: 12 February 2010
The availability of genome and transcriptome sequences for a number of species permits the identification and characterization of conserved as well as divergent genes such as lineage-specific genes which have no detectable sequence similarity to genes from other lineages. While genes conserved among taxa provide insight into the core processes among species, lineage-specific genes provide insights into evolutionary processes and biological functions that are likely clade or species specific.
Comparative analyses using the Arabidopsis thaliana genome and sequences from 178 other species within the Plant Kingdom enabled the identification of 24,624 A. thaliana genes (91.7%) that were termed Evolutionary Conserved (EC) as defined by sequence similarity to a database entry as well as two sets of lineage-specific genes within A. thaliana. One of the A. thaliana lineage-specific gene sets share sequence similarity only to sequences from species within the Brassicaceae family and are termed Conserved Brassicaceae-Specific Genes (914, 3.4%, CBSG). The other set of A. thaliana lineage-specific genes, the Arabidopsis Lineage-Specific Genes (1,324, 4.9%, ALSG), lack sequence similarity to any sequence outside A. thaliana. While many CBSGs (76.7%) and ALSGs (52.9%) are transcribed, the majority of the CBSGs (76.1%) and ALSGs (94.4%) have no annotated function. Co-expression analysis indicated significant enrichment of the CBSGs and ALSGs in multiple functional categories suggesting their involvement in a wide range of biological functions. Subcellular localization prediction revealed that the CBSGs were significantly enriched in proteins targeted to the secretory pathway (412, 45.1%). Among the 107 putatively secreted CBSGs with known functions, 67 encode a putative pollen coat protein or cysteine-rich protein with sequence similarity to the S-locus cysteine-rich protein that is the pollen determinant controlling allele specific pollen rejection in self-incompatible Brassicaceae species. Overall, the ALSGs and CBSGs were more highly methylated in floral tissue compared to the ECs. Single Nucleotide Polymorphism (SNP) analysis showed an elevated ratio of non-synonymous to synonymous SNPs within the ALSGs (1.99) and CBSGs (1.65) relative to the EC set (0.92), mainly caused by an elevated number of non-synonymous SNPs, indicating that they are fast-evolving at the protein sequence level.
Our analyses suggest that while a significant fraction of the A. thaliana proteome is conserved within the Plant Kingdom, evolutionarily distinct sets of genes that may function in defining biological processes unique to these lineages have arisen within the Brassicaceae and A. thaliana.
Lineage-specific genes are defined as genes in one taxonomic group that have no detectable sequence similarity to genes from other lineages. With the availability of complete or near-complete genome and transcriptome sequences from a wide range of species, lineage-specific genes have been extensively studied, especially in microbial species [1–4]. Several hypotheses regarding the origin of lineage-specific genes have been proposed. One model suggests that lateral gene transfer has an important role in generating lineage-specific genes [5, 6]. The second model proposes that lineage-specific genes may be generated by gene duplication followed by rapid sequence divergence [4, 7]. It is also suggested that an accelerated evolutionary rate may be responsible for the emergence of lineage-specific genes such that no sequence similarity to genes from other species can be detected . Other models include de novo emergence from non-genic sequences which are more diverged between species  as well as artifacts from genome annotation . Although the origin and evolution of lineage-specific genes remains unresolved, the identification and characterization of putative lineage-specific genes can provide insight into species-specific functions and evolutionary processes such as speciation (divergence) and adaptation .
Within the Plant Kingdom, the identification and characterization of lineage-specific genes has been performed through comparative analysis of Expressed Sequence Tags (ESTs) and/or the finished genome sequences of Arabidopsis thaliana (Arabidopsis) and Oryza sativa (rice) [11–13], the model species for dicotyledonous and monocotyledonous plants, respectively. Through a comparative analysis between the Arabidopsis and rice predicted proteomes, 116 protein clusters comprised of at least two Arabidopsis sequences but lacking a rice protein were identified, suggesting they were encoded by Arabidopsis-specific genes [14, 15]. In a comparative analysis of legume with non-legume unigene datasets, GenBank's nonredundant and EST databases, and the genome sequences of Arabidopsis and rice, approximately 6% of the legume unigene sets were identified as legume-specific . In a more recent analysis, a set of 861 rice genes termed "Conserved Poaceae Specific Genes" that are evolutionarily conserved within the Poaceae family yet lack significant sequence similarity to non-Poaceae species was identified by searching the finished rice genome sequence against the genomic sequences from Arabidopsis, Medicago, poplar, and EST clusters from 184 plant species . This set of conserved Poaceae-specific genes provides a starting point for further research experiments to better understand the unique morphology, physiological and developmental characteristics of Poaceae species. With the recent availability of additional plant genome sequences, a recent study identified 165, 638, and 109 lineage-specific genes in Arabidopsis, rice, and poplar, respectively, by searching genes with expression evidence against EST assemblies, a non-redundant protein database, and plant genome sequences . In addition to lineage-specific genes, comparisons of multiple plant genomes can provide information on lineage-specific gene expansion of gene families .
In this study, we identified and characterized Conserved Brassicaceae-Specific Genes (CBSGs) and Arabidopsis Lineage-Specific Genes (ALSGs) using the completed and well-annotated A. thaliana genome, the genomes of Medicago truncatula (Medicago), Populus trichocarpa (poplar), Vitis vinifera (grapevine), Carica papaya (papaya), Sorghum bicolor (sorghum), Chlamydomonas reinhardtii (green alga), Physcomitrella patens (moss), and O. sativa (rice) [19–27], as well as EST clusters from 178 plant species. An earlier study on Arabidopsis lineage-specific genes  restricted their analysis to only genes with expression evidence and employed a relaxed criterion to define sequence conservation. As a consequence, only 165 lineage-specific genes were identified in Arabidopsis. In our study, we elected to limit false negatives and identify more potential lineage-specific genes in A. thaliana by using the entire A. thaliana predicted protein-coding gene complement in our analyses and by using more stringent searching criteria. Furthermore, we identified two types of lineage-specific genes, those restricted to A. thaliana and those restricted to the Brassicaceae. By our definition, CBSGs are A. thaliana genes that have significant sequence similarity only to sequences from species within the Brassicaceae family while ALSGs are A. thaliana genes that are unique to A. thaliana. As a large portion of the CBSGs and ALSGs have no known function, co-expression and subcellular localization analyses were performed to infer possible biological function. DNA methylation analysis was performed to investigate the epigenetic modification and effects. To assess evolutionary pressures within these two sets of lineage-specific genes, Single Nucleotide Polymorphisms (SNPs) within the coding regions were analyzed.
Identification of CBSGs and ALSGs
Characterization of the CBSGs and ALSGs
Genic features of the CBSGs, ALSGs, ECs, and TE-related genes
Exon GC (%)
Intron GC (%)
Gene GC (%)
1st position GC (%)
2nd position GC (%)
3rd position GC (%)
Functional annotation of CBSGs, ALSGs, and ECs
With no known function
no transcript support
With a known function
no transcript support
putative PCP or SCRa
Neither the CBSGs nor ALSGs were distributed randomly within the A. thaliana genome (See Additional file 3). Large numbers of CBSGs, ALSGs, and ECs were located within segmentally duplicated blocks consistent with the substantial segmental duplication that occurred in A. thaliana . However, the CBSGs and ALSGs were located more frequently in non-segmentally duplicated regions compared to the ECs. A total of 23.7% and 27.3% of the CBSGs and ALSGs, respectively, were located within non-segmentally duplicated regions, compared to 13.8% EC genes (χ2 test, P < 1e-5). This could be due to differential gene loss of lineage-specific genes (ALSGs, CBSGs) in segmentally duplicated versus non-segmentally duplicated regions or alternatively that the ALSGs and CBSGs are located in segmentally duplicated blocks which have rapidly evolved and thus are not detected using similarity based segmental duplication methods.
We utilized a computational pipeline in which Pfam and novel BLASTP-based protein domains (see Methods) were used to classify paralogous families. As our pipeline involves identification of novel BLASTP-based domains, proteins without a Pfam domain can also be classified into paralogous families thereby removing any bias associated with lack of a characterized protein domain. At the whole genome level, 17,911 A. thaliana genes were classified into 3,051 paralogous families (66.7%). For the lineage-specific gene sets, 389 CBSGs (42.6%) and 65 ALSGs (4.9%) were classified into paralogous families, substantially lower than that of the EC set (70.9%). This is consistent with what was reported for lineage-specific genes within the Poaceae  and consistent with previous analyses in A. thaliana which demonstrated paralogous families were enriched in genes with known function yet the single-copy gene complement was enriched in genes with no known function .
CBSGs are enriched with proteins targeted to the secretory pathway
Subcellular localization of the CBSGs, ALSGs, ECs, and TAIR8 non-TE protein-coding genes
No. of genes (%)
TAIR8 non-TE Protein-coding Genes
Although mitochondria and chloroplasts have conserved functions throughout the Plant Kingdom and conserved sets of nuclear-encoded proteins across taxa have been documented (for example, [43–45]), both computationally predicted and empirically determined proteomes of mitochondria and chloroplasts have shown the presence of lineage-specific proteins [46, 47]. For example, experimental analysis of the rice mitochondrial proteome suggested that approximately 20% of the rice mitochondrial proteome may be lineage-specific as no clear homolog was detected in the Arabidopsis mitochondrial proteome . Consistent with these reports, we observed a significant difference in the percentage of genes that encode proteins targeted to the mitochondrion between the CBSG, ALSG, and EC sets. The CBSGs had a significantly lower than expected percentage of genes encoding proteins targeted to the mitochondrion and the ALSGs had a significantly higher than expected percentage (χ2 test, P < 1e-5). With respect to targeting to the chloroplast, CBSGs and ALSGs were detected although at a significantly lower percentage compared to the EC set (Table 3, χ2 test, P < 1e-5). In sharp contrast to the EC set, the majority of these putative mitochondrial and chloroplast targeted ALSGs and CBSGs have no known function (Table 3), suggesting these lineage-specific genes may encode novel functions within these two organelles.
Functional inference by co-expression analyses
Given the lack of functional assignment for a large percentage of the ALSG and CBSG sets, we performed co-expression analyses to associate these lineage-specific genes with genes with annotated functions in Gene Ontology (GO) functional categories . To prevent ascertainment bias, GO annotation solely based on expression evidence was excluded. Using Arabidopsis ATH1 microarray expression data, we computed Pearson's Correlation Coefficients (PCC) for the ALSGs and CBSGs in comparison to all other genes on the microarray. Probes for 345 (26%) ALSGs and 314 (34%) CBSGs are present on the ATH1 array. Based on a simulation study, we found that 99% of randomly selected A. thaliana gene pairs have a PCC < 0.6. Using a 0.6 PCC value as the cutoff, 260 ALSGs (75%) and 250 CBSGs (80%) with microarray probes were regarded as co-expressed with ≥ 1 gene(s) with GO annotations.
With the co-expression and GO annotation information, we then determined if any GO categories were over-represented among genes co-expressed with the ALSGs or CBSGs. We found that 75 out of the 260 ALSGs (29%) and 138 out of the 250 CBSGs (55%) had ≥ 1 significantly enriched GO categories (See Methods and Additional files 4 and 5). Upon closer inspection, 35 out of 138 (25%) CBSGs were associated with GO categories related to pollen, meiosis and sexual reproduction while 34 CBSGs (25%) were associated with GO categories related to intra-cellular transport and secretory pathways (See Additional files 6, 7 and 8). Analysis using the Fisher Exact Test revealed a highly significant over-representation of genes in these categories in CBSG versus the non-CBSG datasets. These observations, despite being obtained from a subset of the lineage-specific genes, are consistent with the TargetP prediction of an enrichment of CBSGs targeted to the secretory pathway.
CBSGs and ALSGs have a higher density of cytosine methylation
CBSGs and ALSGs have a higher ratio of non-synonymous to synonymous SNPs
The 914 CBSGs and 1,324 ALSGs identified in this study are attractive targets for experimental discovery as they are lineage-specific and the majority (76.1% CBSGs and 94.4% ALSGs) encode functions yet to be determined. Both the CBSGs and ALSGs had shorter genes compared to the ECs, primarily due to fewer numbers of exons per gene and a higher percentage of single-exon genes. A total of 68.6% of the 26,862 A. thaliana genes used in our analyses are high confidence genes as the gene structure (including splice junctions) of at least one or more isoforms has been confirmed with a single cDNA or multiple overlapping cDNAs [27, 52]. The percentages of high confidence genes within the ALSG, CBSG, and EC sets are 18.5%, 38.1%, and 72.4%, respectively. However, 54.0% and 83.9% of ALSGs and CBSGs, respectively, have transcript evidence from full length-cDNA, ESTs or microarray data, or have a putative function assigned, which provides strong support that they are likely to be bona fide genes rather than false positive gene predictions from the ab initio gene prediction programs utilized in genome annotation processes.
One question regarding these lineage-specific genes is their origin(s). One possibility is horizontal gene transfer. In the final step of our pipeline (Fig. 1), we utilized UniProtKB to filter any remaining conserved genes and identified 53 A. thaliana proteins with sequence similarity to genes from bacteria, viruses, nematodes, fungi, animals, or other plant species not available in our other large-scale plant genome and transcriptome sequence datasets. This limited number, which includes matches to other plant species, suggests that lateral gene transfer is a not a major source of lineage-specific genes in A. thaliana. Another potential source of lineage-specific genes is gene duplication followed by rapid evolution. Consistent with this, we found that a large number of ALSGs (73%) and CBSGs (76%) were located within segmentally duplicated regions suggesting they may have been generated by segmental duplication followed by rapid sequence divergence due to relaxed selective pressure on duplicated genes.
We have identified more lineage-specific genes (1,324 ALSGs) than the 165 Arabidopsis species-specific genes (ASS) identified by Yang et al. . Comparison of the ASS with our ALSGs and CBSGs revealed that 10 of the ASS were transposable element genes or miRNAs and thus were not included in our analysis. Of the remaining 155 ASS described in Yang et al., 129 were identified as either a ALSG or a CBSG. There are two major reasons for this difference. First, different BLAST E value cutoff were utilized in the lineage-specific gene identification pipelines. Our pipeline used a BLAST E-value cutoff of < 1e-5 while Yang et al.  used a BLAST E-value cutoff of < 0.1. Because the cutoff in earlier study is relaxed, it likely has a higher false negative rate in identifying truly lineage-specific genes compared to this study while our lineage-specific gene sets will likely have higher proportion of false positives. Secondly, Yang et al.  restricted their analysis to genes with expression evidence from ESTs or full length-cDNAs. As a consequence, lineage-specific genes that were expressed under limited conditions or whose expression level was too low to be detected by traditional transcript profiling were excluded. This practice also leads to false negatives because there is ample evidence that genes annotated as hypothetical are bona-fide genes . In contrast, we examined the entire A. thaliana predicted protein-coding gene complement and our approach likely has a significantly lower number of false negatives.
Both the ALSGs and CBSGs have more genetic variation among the 20 re-sequenced A. thaliana ecotypes than the EC genes, with the ALSGs and ECs having the most (0.41) and least (0.35) SNPs per 100 bp per gene, respectively. This was inversely correlated with the degree of evolutionary conservation of the ALSG, CBSG, and EC sets within the Plant Kingdom. However, the three sets have similar synonymous SNP density. As a consequence, ALSGs and CBSGs have higher ratios of non-synonymous to synonymous SNPs compared to the EC genes, indicating they are fast-evolving at the protein level.
The dramatic enrichment of secretory proteins in the CBSGs indicates there may be specific or highly evolved secretion processes within the Brassicaceae family as no significant sequence similarity could be detected in other dicot genomes including poplar, Medicago, papaya, and grapevine for which genome sequences are available. A majority of Brassicaceae species share a highly specialized self-incompatibility system that restricts self-pollination and promotes out-crossing [54, 55] through the ability to recognize and reject self-pollen or pollen from closely related plants. In our analysis, the majority of the Brassicaceae species used are self-incompatible (B. rapa, B. oleracea, B. oleracea var. alboglabra, R. raphanistrum, and R. sativus) while three are self-compatible (T. halophila, B. napus, and A. thaliana). Within each self-incompatible Brassica species, specificity of the self-incompatibility response is genetically determined by the alleles at the S (self-incompatibility) locus and involves the arrest of pollen development upon self pollination . SCR is the male determinant of the self-incompatibility response which is expressed specifically in the anther tapetum and microspores  and is predicted to interact with the female determinant S locus receptor kinase gene expressed in the papillar cells of the stigma . Out-crossing is thought to be the ancestral mode of mating in the Brassicaceae  and the S-locus specificity genes are predicted to be derived from common ancestors .
In addition to the self-recognition mechanisms required to avoid inbreeding within a single species, plants have mechanisms to control mating between different species . For example, B. napus pollen placed on B. oleraceae stigmas became hydrated and germinated; however, the pollen produced short coiled tubes that failed to penetrate the papillar cell wall . This phenotype suggests that discrimination between desirable and undesirable pollen between species also involves biochemical interactions on the stigma. Therefore, it is possible that some of the CBSG PCP and SCR-like members may play a role in inter-specific mating by influencing critical aspects required for successful pollination. Our finding that 67 of the 107 CBSGs with an assigned function and putatively involved in secretory pathways are similar to SCR or PCP proteins, and that these genes are subjected to diversifying selection, a phenomena almost always associated with genes involved in recognition events, supports this hypothesis. For example, SLR1 (for S locus glycoprotein-like receptor 1), a stigma-specific protein, interacts with members of the PCP . Whether any of the other members of the PCP or SCR like genes contribute to pollination biology in the Brassicaceae remains to be determined.
In summary, we have identified two sets of A. thaliana lineage-specific genes, CBSGs and ALSGs, which are specific to the Brassicaceae family and A. thaliana, respectively. The CBSGs are especially enriched in proteins with binding function such as receptor binding that may play a role in the self-incompatibility response. The exact functions of a majority of these lineage-specific genes remain an enigma at this time. Further biological experiments will be necessary to fully understand their functions in A. thaliana and Brassicaceae species.
Data sources and preparation
The proteome of A. thaliana was obtained from the TAIR8 release ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR8_genome_release. Pseudogenes and TE genes were excluded from the original gene set based on the TAIR8 annotation, which resulted in 27,025 protein coding genes. Further screening against two in-house transposon databases identified an additional 163 putative TE-related genes, resulting in 26,862 A. thaliana genes for further analysis. The repeat-masked assembled scaffolds (v1.0) of poplar (P. trichocarpa) were downloaded from DOE Joint Genome Institute http://genome.jgi-psf.org/Poptr1_1/Poptr1_1.download.ftp.html. The repeat-masked assembly of the grapevine (V. vinifera) genome was downloaded from Genoscope http://www.genoscope.cns.fr/spip/Vitis-vinifera-whole-genome.html. The release 2.0 assembly of the Medicago (M. truncatula) genome was downloaded from the Medicago Genome Sequence Consortium http://www.medicago.org/genome/downloads.php. Release 6 pseudomolecules of rice (O. sativa ssp. japonica) were downloaded from the Rice Genome Annotation Project http://rice.plantbiology.msu.edu/. The repeat-masked Sbi1 assembly of the sorghum (S. bicolor) genome was downloaded from DOE Joint Genome Institute ftp://ftp.jgi-psf.org/pub/JGI_data/phytozome/v4.0/Sbicolor/assembly/Sbi1/. The repeat-masked assembly (v4.0) of C. reinhardtii was downloaded from DOE Joint Genome Institute http://genome.jgi-psf.org/Chlre4/Chlre4.download.ftp.html. The masked assembly (v1.1) of the moss (P. patens ssp. patens) genome was downloaded from DOE Joint Genome Institute http://genome.jgi-psf.org/Phypa1_1/Phypa1_1.download.ftp.html. The papaya (C. papaya) genome was downloaded from NCBI http://www.ncbi.nlm.nih.gov/sites/entrez?db=Nucleotide&cmd=Search&term=DS981520:DS984726 [PACC]. The PUTs from 178 plant species (excluding A. thaliana in this analysis) were downloaded from PlantGDB on August 11, 2009 http://www.plantgdb.org/download/download.php?dir=/Sequence/ESTcontig. UniProtKB (Release 14.6) was downloaded from UniProt ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/.
The TE set was comprised of 3,900 TE genes from the TAIR8 release and 163 putative TE genes identified by screening against two in-house transposon databases. For the TE set, only three sequence files were created: gene, exon, and intron as they lack CDS or protein sequences. For each of the CBSG, ALSG, and EC set, the sequences of gene, exon, CDS, intron, and protein were either downloaded directly from the TAIR8 release or extracted from the chromosome sequences according to the coordinates provided in the GFF3 file. Perl scripts were used to calculate the exon number, length of gene, CDS, exon, intron, and protein, GC content of CDS, gene, and three codon positions.
Construction of paralogous protein families
A total of 26,862 non-TE A. thaliana proteins from the TAIR8 release were used to construct paralogous protein families in the A. thaliana proteome using a computational pipeline that utilized Pfam  and novel BLASTP-based novel domains described previously . In brief, Pfam domains were identified using HMMER2  with scores above the trusted cutoff value. Peptide regions that were not covered by Pfam domains were clustered based on homology (>45% identity over 75 amino acids, E-value < 1e-3) derived from an all versus all BLASTP search (WU-BLASTP 2.0 MP-WashU [22-Mar-2006]) . Clustered peptides were then aligned using CLUSTALW [63, 64] to develop BLASTP-based domains. Paralogous protein families were then classified based on the domain composition of each protein.
Identification of segmental duplication
A total of 26,862 non-TE A. thaliana proteins from the TAIR8 release were used to identify segmental duplication in the A. thaliana genome using a method described previously . In brief, similar protein pairs were identified by all versus all BLASTP search (WU-BLASTP 2.0 MP-WashU [22-Mar-2006], parameters "V = 5 B = 5 E = 1e-10") , which were then used to defined segmental duplication using DAGChainer  with parameters "-s -I -D 100000".
The ATH1 microarray compendium of 3,037 experiments (hereafter called "supercluster") was downloaded from the NASCArrays website http://affymetrix.arabidopsis.info/narrays/help/usefulfiles.html. Only the genes having probes on the ATH1 array, 345 of the 1,324 ALSGs and 314 of the 914 CBSGs, were used for further analysis. Pairwise Pearson's Correlation Coefficient was computed between all lineage-specific genes (ALSGs and CBSGs) with array data and all genes in the supercluster. The threshold value (r = 0.6) was defined as the 99 percentile of all pairwise correlation coefficients obtained during the above computation. Using this threshold, we obtained a set of co-expressed genes for each ALSG and CBSG gene tested. 260 of the 345 ALSGs and 250 of the 314 CBSGs had > = 1 unique gene with a significantly correlated expression profile. To define the potential functions of the lineage specific genes, GO annotation of A. thaliana genes co-expressed with ALSGs or CBSGs were used. The A. thaliana GO annotation was downloaded from the TAIR website  excluding annotations with the evidence codes IEP, IEA and RCA. For each ALSG/CBSG, we identified the enriched GO categories among the genes significantly co-expressed. The enrichment analysis is based on a Fisher Exact Test at a False Discovery Rate of 5% as defined by the Q-value program . After associating GO categories with each gene, we investigated whether there is an enrichment of genes in categories related to 1) Pollen, meiosis and sexual reproduction, 2) Intra-cellular transport and secretion, 3) Photosynthesis, 4) Defense responses, and 5) Development, cell cycle and differentiation. We manually assigned the GO categories to these five groups and determined whether these groups were over-represented among the lineage specific genes versus the non-lineage specific genes using a False Discovery Rate of 1% (See Additional files 6, 7 and 8).
Determination of subcellular localization
The subcellular localization of 32,419 protein sequences from the 26,862 A. thaliana protein-coding genes was identified by TargetP program  using plant networks and default parameters. Subcellular localization prediction with the best (lowest) Reliability Class was used to represent the subcellular localization of the deduced protein if multiple different locations were predicted for isoforms predicted for the gene. If none of the isoforms had a prediction of 'Chloroplast', 'Mitochondrion', or 'Secretory pathway', then the subcellular localization of the gene was assigned 'Other'. If multiple subcellular localizations with equal Reliability Class were predicted for the isoforms of a gene, the subcellular localization of that gene was assigned 'Uncertain'.
Analyses of DNA methylation
The cytosine methylomic sequence data from floral tissues of wild-type A. thaliana were generated by sequencing-by-synthesis technology and mapped to the A. thaliana genome as reported previously . The methylation data for the coding region, 500 bp upstream, and 500 bp downstream of all the 26,862 A. thaliana protein-coding genes were kindly provided by the Ecker group. The density of methylation of cytosines was defined as the number of 5-methylcytosines per 100 bp per gene. Only representative gene models were used in our analyses.
The SNP data from re-sequencing of 20 diverse A. thaliana accessions using high-density oligonucleotide arrays  was downloaded from the TAIR8 release ftp://ftp.arabidopsis.org/Polymorphisms/Perlegen_Array_Resequencing_Data_2007/SNP_predictions/. The polymorphism GFF3 file that includes the mapping information of the SNP markers was also downloaded from the TAIR8 release ftp://ftp.arabidopsis.org/Polymorphisms/TAIR8_Variation_GFF/TAIR8_GFF3_polymorphisms.gff. PERL scripts were used to parse the data and calculate synonymous and non-synonymous SNPs within protein coding regions. A total of 249,344 SNPs were downloaded. Only base calls from the MBML2 dataset  were used in our analyses. Base calls of 'N' were ignored. A total of 5,381 SNPs with more than two variations within all 20 accessions were excluded from our analyses. Representative models were used whenever alternative-splicing isoforms existed. SNPs that produce same amino acid as the reference codon (Columbia-0 ecotype) was counted as synonymous SNPs while SNPs that produce a different amino acid than the reference codon was counted as non-synonymous SNPs.
We are grateful to Dr. Kevin Childs for critical review of the manuscript and helpful comments. This work was supported by funds to CRB from Michigan State University.
- Amiri H, Davids W, Andersson SG: Birth and death of orphan genes in Rickettsia. Mol Biol Evol. 2003, 20 (10): 1575-1587. 10.1093/molbev/msg175.View ArticlePubMedGoogle Scholar
- Ogata H, Audic S, Renesto-Audiffren P, Fournier PE, Barbe V, Samson D, Roux V, Cossart P, Weissenbach J, Claverie JM, et al: Mechanisms of evolution in Rickettsia conorii and R. prowazekii. Science. 2001, 293 (5537): 2093-2098. 10.1126/science.1061471.View ArticlePubMedGoogle Scholar
- Siew N, Fischer D: Analysis of singleton ORFans in fully sequenced microbial genomes. Proteins. 2003, 53 (2): 241-251. 10.1002/prot.10423.View ArticlePubMedGoogle Scholar
- Domazet-Loso T, Tautz D: An evolutionary analysis of orphan genes in Drosophila. Genome Res. 2003, 13 (10): 2213-2219. 10.1101/gr.1311003.PubMed CentralView ArticlePubMedGoogle Scholar
- Daubin V, Lerat E, Perriere G: The source of laterally transferred genes in bacterial genomes. Genome Biol. 2003, 4 (9): R57-10.1186/gb-2003-4-9-r57.PubMed CentralView ArticlePubMedGoogle Scholar
- Striepen B, Pruijssers AJ, Huang J, Li C, Gubbels MJ, Umejiego NN, Hedstrom L, Kissinger JC: Gene transfer in the evolution of parasite nucleotide biosynthesis. Proc Natl Acad Sci USA. 2004, 101 (9): 3154-3159. 10.1073/pnas.0304686101.PubMed CentralView ArticlePubMedGoogle Scholar
- Alba MM, Castresana J: Inverse relationship between evolutionary rate and age of mammalian genes. Mol Biol Evol. 2005, 22 (3): 598-606. 10.1093/molbev/msi045.View ArticlePubMedGoogle Scholar
- Cai JJ, Woo PC, Lau SK, Smith DK, Yuen KY: Accelerated evolutionary rate may be responsible for the emergence of lineage-specific genes in ascomycota. J Mol Evol. 2006, 63 (1): 1-11. 10.1007/s00239-004-0372-5.View ArticlePubMedGoogle Scholar
- Levine MT, Jones CD, Kern AD, Lindfors HA, Begun DJ: Novel genes derived from noncoding DNA in Drosophila melanogaster are frequently X-linked and exhibit testis-biased expression. Proc Natl Acad Sci USA. 2006, 103 (26): 9935-9939. 10.1073/pnas.0509809103.PubMed CentralView ArticlePubMedGoogle Scholar
- Schmid KJ, Aquadro CF: The evolutionary analysis of "orphans" from the Drosophila genome identifies rapidly diverging and incorrectly annotated genes. Genetics. 2001, 159 (2): 589-598.PubMed CentralPubMedGoogle Scholar
- Rensink WA, Lee Y, Liu J, Iobst S, Ouyang S, Buell CR: Comparative analyses of six solanaceous transcriptomes reveal a high degree of sequence conservation and species-specific transcripts. BMC Genomics. 2005, 6: 124-10.1186/1471-2164-6-124.PubMed CentralView ArticlePubMedGoogle Scholar
- Allen KD: Assaying gene content in Arabidopsis. Proc Natl Acad Sci USA. 2002, 99 (14): 9568-9572. 10.1073/pnas.142126599.PubMed CentralView ArticlePubMedGoogle Scholar
- Graham MA, Silverstein KA, Cannon SB, Bosch Vanden KA: Computational identification and characterization of novel genes from legumes. Plant Physiol. 2004, 135 (3): 1179-1197. 10.1104/pp.104.037531.PubMed CentralView ArticlePubMedGoogle Scholar
- Conte MG, Gaillard S, Droc G, Perin C: Phylogenomics of plant genomes: a methodology for genome-wide searches for orthologs in plants. BMC Genomics. 2008, 9: 183-10.1186/1471-2164-9-183.PubMed CentralView ArticlePubMedGoogle Scholar
- Conte MG, Gaillard S, Lanau N, Rouard M, Perin C: GreenPhylDB: a database for plant comparative genomics. Nucleic Acids Res. 2008, D991-998. 36 DatabaseGoogle Scholar
- Campbell MA, Zhu W, Jiang N, Lin H, Ouyang S, Childs KL, Haas BJ, Hamilton JP, Buell CR: Identification and characterization of lineage-specific genes within the Poaceae. Plant Physiol. 2007, 145 (4): 1311-1322. 10.1104/pp.107.104513.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang X, Jawdy S, Tschaplinski TJ, Tuskan GA: Genome-wide identification of lineage-specific genes in Arabidopsis, Oryza and Populus. Genomics. 2009, 93 (5): 473-480. 10.1016/j.ygeno.2009.01.002.View ArticlePubMedGoogle Scholar
- Hanada K, Zou C, Lehti-Shiu MD, Shinozaki K, Shiu SH: Importance of lineage-specific expansion of plant tandem duplicates in the adaptive response to environmental stimuli. Plant Physiol. 2008, 148 (2): 993-1003. 10.1104/pp.108.122457.PubMed CentralView ArticlePubMedGoogle Scholar
- Rensing SA, Lang D, Zimmer AD, Terry A, Salamov A, Shapiro H, Nishiyama T, Perroud PF, Lindquist EA, Kamisugi Y, et al: The Physcomitrella genome reveals evolutionary insights into the conquest of land by plants. Science. 2008, 319 (5859): 64-69. 10.1126/science.1150646.View ArticlePubMedGoogle Scholar
- Merchant SS, Prochnik SE, Vallon O, Harris EH, Karpowicz SJ, Witman GB, Terry A, Salamov A, Fritz-Laylin LK, Marechal-Drouard L, et al: The Chlamydomonas genome reveals the evolution of key animal and plant functions. Science. 2007, 318 (5848): 245-250. 10.1126/science.1143609.PubMed CentralView ArticlePubMedGoogle Scholar
- Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, Haberer G, Hellsten U, Mitros T, Poliakov A, et al: The Sorghum bicolor genome and the diversification of grasses. Nature. 2009, 457 (7229): 551-556. 10.1038/nature07723.View ArticlePubMedGoogle Scholar
- Ming R, Hou S, Feng Y, Yu Q, Dionne-Laporte A, Saw JH, Senin P, Wang W, Ly BV, Lewis KL, et al: The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus). Nature. 2008, 452 (7190): 991-996. 10.1038/nature06856.PubMed CentralView ArticlePubMedGoogle Scholar
- Tuskan GA, Difazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam N, Ralph S, Rombauts S, Salamov A, et al: The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science. 2006, 313 (5793): 1596-1604. 10.1126/science.1128691.View ArticlePubMedGoogle Scholar
- Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, Choisne N, Aubourg S, Vitulo N, Jubin C, et al: The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature. 2007Google Scholar
- Young ND, Cannon SB, Sato S, Kim D, Cook DR, Town CD, Roe BA, Tabata S: Sequencing the genespaces of Medicago truncatula and Lotus japonicus. Plant Physiol. 2005, 137 (4): 1174-1181. 10.1104/pp.104.057034.PubMed CentralView ArticlePubMedGoogle Scholar
- Ouyang S, Zhu W, Hamilton J, Lin H, Campbell M, Childs K, Thibaud-Nissen F, Malek RL, Lee Y, Zheng L, et al: The TIGR Rice Genome Annotation Resource: improvements and new features. Nucleic Acids Res. 2007, D883-887. 10.1093/nar/gkl976. 35 DatabaseGoogle Scholar
- Swarbreck D, Wilks C, Lamesch P, Berardini TZ, Garcia-Hernandez M, Foerster H, Li D, Meyer T, Muller R, Ploetz L, et al: The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Res. 2008, D1009-1014. 36 DatabaseGoogle Scholar
- Gish W: 1996, [http://blast.wustl.edu]
- Dong Q, Lawrence CJ, Schlueter SD, Wilkerson MD, Kurtz S, Lushbough C, Brendel V: Comparative plant genomics resources at PlantGDB. Plant Physiol. 2005, 139 (2): 610-618. 10.1104/pp.104.059212.PubMed CentralView ArticlePubMedGoogle Scholar
- Guo WJ, Li P, Ling J, Ye SP: Significant Comparative Characteristics between Orphan and Nonorphan Genes in the Rice (Oryza sativa L.) Genome. Comp Funct Genomics. 2007, 21676-Google Scholar
- Yu J, Hu S, Wang J, Wong GK, Li S, Liu B, Deng Y, Dai L, Zhou Y, Zhang X, et al: A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science. 2002, 296 (5565): 79-92. 10.1126/science.1068037.View ArticlePubMedGoogle Scholar
- Vanoosthuyse V, Miege C, Dumas C, Cock JM: Two large Arabidopsis thaliana gene families are homologous to the Brassica gene superfamily that encodes pollen coat proteins and the male component of the self-incompatibility response. Plant Mol Biol. 2001, 46 (1): 17-34. 10.1023/A:1010664704926.View ArticlePubMedGoogle Scholar
- Schopfer CR, Nasrallah ME, Nasrallah JB: The male determinant of self-incompatibility in Brassica. Science. 1999, 286 (5445): 1697-1700. 10.1126/science.286.5445.1697.View ArticlePubMedGoogle Scholar
- Takayama S, Shiba H, Iwano M, Shimosato H, Che FS, Kai N, Watanabe M, Suzuki G, Hinata K, Isogai A: The pollen determinant of self-incompatibility in Brassica campestris. Proc Natl Acad Sci USA. 2000, 97 (4): 1920-1925. 10.1073/pnas.040556397.PubMed CentralView ArticlePubMedGoogle Scholar
- Suzuki G, Kai N, Hirose T, Fukui K, Nishio T, Takayama S, Isogai A, Watanabe M, Hinata K: Genomic organization of the S locus: Identification and characterization of genes in SLG/SRK region of S(9) haplotype of Brassica campestris (syn. rapa). Genetics. 1999, 153 (1): 391-400.PubMed CentralPubMedGoogle Scholar
- Takasaki T, Hatakeyama K, Suzuki G, Watanabe M, Isogai A, Hinata K: The S receptor kinase determines self-incompatibility in Brassica stigma. Nature. 2000, 403 (6772): 913-916. 10.1038/35002628.View ArticlePubMedGoogle Scholar
- Kachroo A, Nasrallah ME, Nasrallah JB: Self-incompatibility in the Brassicaceae: receptor-ligand signaling and cell-to-cell communication. Plant Cell. 2002, 14 (Suppl): S227-238.PubMed CentralPubMedGoogle Scholar
- Takayama S, Shiba H, Iwano M, Asano K, Hara M, Che FS, Watanabe M, Hinata K, Isogai A: Isolation and characterization of pollen coat proteins of Brassica campestris that interact with S locus-related glycoprotein 1 involved in pollen-stigma adhesion. Proc Natl Acad Sci USA. 2000, 97 (7): 3765-3770. 10.1073/pnas.040580797.PubMed CentralView ArticlePubMedGoogle Scholar
- Blanc G, Hokamp K, Wolfe KH: A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome. Genome Res. 2003, 13 (2): 137-144. 10.1101/gr.751803.PubMed CentralView ArticlePubMedGoogle Scholar
- Lin H, Ouyang S, Egan A, Nobuta K, Haas BJ, Zhu W, Gu X, Silva JC, Meyers BC, Buell CR: Characterization of paralogous protein families in rice. BMC Plant Biol. 2008, 8: 18-10.1186/1471-2229-8-18.PubMed CentralView ArticlePubMedGoogle Scholar
- Emanuelsson O, Nielsen H, Brunak S, von Heijne G: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol. 2000, 300 (4): 1005-1016. 10.1006/jmbi.2000.3903.View ArticlePubMedGoogle Scholar
- Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000, 408 (6814): 796-815. 10.1038/35048692.View ArticleGoogle Scholar
- Dangoor I, Peled-Zehavi H, Levitan A, Pasand O, Danon A: A small family of chloroplast atypical thioredoxins. Plant Physiol. 2009, 149 (3): 1240-1250. 10.1104/pp.108.128314.PubMed CentralView ArticlePubMedGoogle Scholar
- Glynn JM, Yang Y, Vitha S, Schmitz AJ, Hemmes M, Miyagishima SY, Osteryoung KW: PARC6, a novel chloroplast division factor, influences FtsZ assembly and is required for recruitment of PDV1 during chloroplast division in Arabidopsis. Plant J. 2009, 59 (5): 700-711. 10.1111/j.1365-313X.2009.03905.x.View ArticlePubMedGoogle Scholar
- Rensing SA, Kiessling J, Reski R, Decker EL: Diversification of ftsZ during early land plant evolution. J Mol Evol. 2004, 58 (2): 154-162. 10.1007/s00239-003-2535-1.View ArticlePubMedGoogle Scholar
- Huang S, Taylor NL, Narsai R, Eubel H, Whelan J, Millar AH: Experimental analysis of the rice mitochondrial proteome, its biogenesis, and heterogeneity. Plant Physiol. 2009, 149 (2): 719-734. 10.1104/pp.108.131300.PubMed CentralView ArticlePubMedGoogle Scholar
- Richly E, Leister D: An improved prediction of chloroplast proteins reveals diversities and commonalities in the chloroplast proteomes of Arabidopsis and rice. Gene. 2004, 329: 11-16. 10.1016/j.gene.2004.01.008.View ArticlePubMedGoogle Scholar
- Berardini TZ, Mundodi S, Reiser L, Huala E, Garcia-Hernandez M, Zhang P, Mueller LA, Yoon J, Doyle A, Lander G, et al: Functional annotation of the Arabidopsis genome using controlled vocabularies. Plant Physiol. 2004, 135 (2): 745-755. 10.1104/pp.104.040071.PubMed CentralView ArticlePubMedGoogle Scholar
- Lister R, O'Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, Ecker JR: Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell. 2008, 133 (3): 523-536. 10.1016/j.cell.2008.03.029.PubMed CentralView ArticlePubMedGoogle Scholar
- Jost JP, Saluz HP: DNA Methylation: Molecular Biology and Biological Significance. 1993, Springer Science & BusinessView ArticleGoogle Scholar
- Clark RM, Schweikert G, Toomajian C, Ossowski S, Zeller G, Shinn P, Warthmann N, Hu TT, Fu G, Hinds DA, et al: Common sequence polymorphisms shaping genetic diversity in Arabidopsis thaliana. Science. 2007, 317 (5836): 338-342. 10.1126/science.1138632.View ArticlePubMedGoogle Scholar
- TAIR: [http://www.arabidopsis.org]
- Xiao YL, Malik M, Whitelaw CA, Town CD: Cloning and sequencing of cDNAs for hypothetical genes from chromosome 2 of Arabidopsis. Plant Physiol. 2002, 130 (4): 2118-2128. 10.1104/pp.010207.PubMed CentralView ArticlePubMedGoogle Scholar
- Kusaba M, Dwyer K, Hendershot J, Vrebalov J, Nasrallah JB, Nasrallah ME: Self-incompatibility in the genus Arabidopsis: characterization of the S locus in the outcrossing A. lyrata and its autogamous relative A. thaliana. Plant Cell. 2001, 13 (3): 627-643. 10.1105/tpc.13.3.627.PubMed CentralView ArticlePubMedGoogle Scholar
- Bateman AJ: Self-incompatibility systems in angiosperms: III. Cruciferae. Heredity. 1955, 9: 52-68.Google Scholar
- Nasrallah JB: Cell-cell signaling in the self-incompatibility response. Curr Opin Plant Biol. 2000, 3 (5): 368-373. 10.1016/S1369-5266(00)00098-4.View ArticlePubMedGoogle Scholar
- Nasrallah ME, Liu P, Nasrallah JB: Generation of self-incompatible Arabidopsis thaliana by transfer of two S locus genes from A. lyrata. Science. 2002, 297 (5579): 247-249. 10.1126/science.1072205.View ArticlePubMedGoogle Scholar
- Fujimoto R, Okazaki K, Fukai E, Kusaba M, Nishio T: Comparison of the genome structure of the self-incompatibility (S) locus in interspecific pairs of S haplotypes. Genetics. 2006, 173 (2): 1157-1167. 10.1534/genetics.104.037267.PubMed CentralView ArticlePubMedGoogle Scholar
- Swanson R, Edlund AF, Preuss D: Species specificity in pollen-pistil interactions. Annu Rev Genet. 2004, 38: 793-818. 10.1146/annurev.genet.38.072902.092356.View ArticlePubMedGoogle Scholar
- Zuberi MI, Dickinson HG: Pollen-stigma interaction in Brassica. III. Hydration of the pollen grains. J Cell Sci. 1985, 76: 321-336.PubMedGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, et al: The Pfam protein families database. Nucleic Acids Res. 2004, D138-141. 10.1093/nar/gkh121. 32 DatabaseGoogle Scholar
- Eddy SR: Profile hidden Markov models. Bioinformatics. 1998, 14 (9): 755-763. 10.1093/bioinformatics/14.9.755.View ArticlePubMedGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22 (22): 4673-4680. 10.1093/nar/22.22.4673.PubMed CentralView ArticlePubMedGoogle Scholar
- Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD: Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res. 2003, 31 (13): 3497-3500. 10.1093/nar/gkg500.PubMed CentralView ArticlePubMedGoogle Scholar
- Lin H, Zhu W, Silva JC, Gu X, Buell CR: Intron gain and loss in segmentally duplicated genes in rice. Genome Biol. 2006, 7 (5): R41-10.1186/gb-2006-7-5-r41.PubMed CentralView ArticlePubMedGoogle Scholar
- Haas BJ, Delcher AL, Wortman JR, Salzberg SL: DAGchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics. 2004, 20 (18): 3643-3646. 10.1093/bioinformatics/bth397.View ArticlePubMedGoogle Scholar
- Storey JD: A direct approach to false discovery rates. Journal Of The Royal Statistical Society Series B. 2002, 64 (3): 479-498. 10.1111/1467-9868.00346.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.