Signature proteins for the major clades of Cyanobacteria

Background The phylogeny and taxonomy of cyanobacteria is currently poorly understood due to paucity of reliable markers for identification and circumscription of its major clades. Results A combination of phylogenomic and protein signature based approaches was used to characterize the major clades of cyanobacteria. Phylogenetic trees were constructed for 44 cyanobacteria based on 44 conserved proteins. In parallel, Blastp searches were carried out on each ORF in the genomes of Synechococcus WH8102, Synechocystis PCC6803, Nostoc PCC7120, Synechococcus JA-3-3Ab, Prochlorococcus MIT9215 and Prochlor. marinus subsp. marinus CCMP1375 to identify proteins that are specific for various main clades of cyanobacteria. These studies have identified 39 proteins that are specific for all (or most) cyanobacteria and large numbers of proteins for other cyanobacterial clades. The identified signature proteins include: (i) 14 proteins for a deep branching clade (Clade A) of Gloebacter violaceus and two diazotrophic Synechococcus strains (JA-3-3Ab and JA2-3-B'a); (ii) 5 proteins that are present in all other cyanobacteria except those from Clade A; (iii) 60 proteins that are specific for a clade (Clade C) consisting of various marine unicellular cyanobacteria (viz. Synechococcus and Prochlorococcus); (iv) 14 and 19 signature proteins that are specific for the Clade C Synechococcus and Prochlorococcus strains, respectively; (v) 67 proteins that are specific for the Low B/A ecotype Prochlorococcus strains, containing lower ratio of chl b/a2 and adapted to growth at high light intensities; (vi) 65 and 8 proteins that are specific for the Nostocales and Chroococcales orders, respectively; and (vii) 22 and 9 proteins that are uniquely shared by various Nostocales and Oscillatoriales orders, or by these two orders and the Chroococcales, respectively. We also describe 3 conserved indels in flavoprotein, heme oxygenase and protochlorophyllide oxidoreductase proteins that are specific for either Clade C cyanobacteria or for various subclades of Prochlorococcus. Many other conserved indels for cyanobacterial clades have been described recently. Conclusions These signature proteins and indels provide novel means for circumscription of various cyanobacterial clades in clear molecular terms. Their functional studies should lead to discovery of novel properties that are unique to these groups of cyanobacteria.


Background
Cyanobacteria are the sole prokaryotic group that carries out oxygenic photosynthesis. The species from this phylum exhibit enormous diversity in terms of their morphology, physiology and other characteristics (e.g. motility, thermophily, cell division characteristic, nitrogen fixation ability, etc.) [1][2][3][4][5]. The taxonomy and evolutionary relationships among cyanobacteria is presently poorly understood. In the 16S rRNA trees, which provides the current basis for understanding microbial phylogeny, cyanobacteria species/strains form 14 unresolved clusters [6]. Although cyanobacteria is a large phylum with >4000 isolates [7], only a small number of species and higher taxonomic groups within this phylum have been validly described [8][9][10]. Except for 16S rRNA, sequence information for cyanobacteria for other genes/proteins sequences until recently was very limited. Hence, the availability of genome sequences has provided new opportunities for understanding cyanobacterial phylogeny and taxonomy. Based upon these sequences, several investigators have assembled phylogenetic trees for cyanobacteria based upon combined sequences for different large sets of proteins. These studies have included analyses of 14 cyanobacteria based upon 34 proteins by Sanchez-Barcaldo et al. [4], trees for 24 cyanobacteria based upon 583 orthologous proteins by Swingley et al. [11], and branching patterns of 13 cyanobacteria based upon 682 proteins by Shi and Falkowski [12]. Additionally, Zhaxybayeva et al. [13] have examined individual phylogenies of 1128 proteincoding genes from 11 cyanobacterial genomes to identify phylogenetic signal exhibited by the plurality of these proteins and to recognize the incidence of lateral gene transfers. These studies have proven very useful in establishing the existence of certain important clades within the sequenced cyanobacteria and in clarifying their relative branching positions [4,11,12].
The studies of the above kind, although very useful, are limited to species whose genomes are sequenced. Further, as indicated by earlier work [4,11,12], integration of sequence information from any new genome by this approach requires reassembly of the entire phylogenomic tree(s). Based upon the phylogenomic approach it is also difficult to circumscribe various cyanobacterial clades in definitive biochemical or molecular terms, which is important for developing a stable taxonomy [14][15][16]. Hence, it is important to identify other reliable molecular markers that are consistent with the results of phylogenomic studies, but which can also be used to circumscribe different phylogenetic clades in more definitive (molecular) terms. One approach that has proven very useful in this regard consists of identifying molecular markers or synapomorphies that are specific for different phylogenetically defined clades. Two different kinds of molecular markers are proving very useful for these studies. The first of these consists of conserved inserts and deletions (indels) in widely distributed proteins that are distinctive characteristics of either a given phylum or its different main subgroups [17][18][19][20][21]. Our recent work has identified >40 conserved indels in important proteins that are exclusively present in either all cyanobacteria or many of its major clades that are observed in phylogenomic trees [22,23]. The presence of several of these indels in the plants/plastids homologs has also provided evidence for the derivation of plastids from cyanobacterial ancestors [22][23][24]. The second kind of molecular markers consists of whole proteins that are uniquely found in various species from a given phylogenetic clade [25][26][27][28]. Martin et al. [29] have earlier reported Blast analysis on 8 cyanobacterial genomes (6 finished and 2 unfinished) to identify 181 proteins that were uniquely found in at least 7 out of 8 of these cyanobacteria. A later study by Mulkidjanian et al. [30] on 15 cyanobacterial genomes identified 50 proteins that were uniquely present in at least 14 out of 15 cyanobacteria and 84 others that were exclusively present in plants/plastids and cyanobacteria.
These earlier studies primarily looked for proteins that were uniquely found in most cyanobacteria and no work was carried out on identifying proteins that are specific for various main clades of cyanobacteria, observed in phylogenetic trees. In the past 2-3 years, the number of sequenced cyanobacterial genomes has also more than doubled to a total of 36 genomes. Hence, it was of much interest to carry out both phylogenomic as well as gene content analyses on these genomes to identify signature proteins that are distinctive characteristics of either all cyanobacteria or its various main clades in the phylogenomic trees.

Phylogenomic/phylogenetic analyses on Cyanobacteria
Prior to undertaking studies on identifying proteins that are specific for different cyanobacterial clades, it was necessary to determine the branching pattern of sequenced cyanobacteria in phylogenetic trees. Although detailed phylogenetic studies have been previously reported for a limited numbers of cyanobacteria [4,11,12], sequence information for many other genomes has become available in the past 2-3 years (see Table 1). Hence, it was necessary to carry out phylogenetic studies on all of these cyanobacteria to determine their branching pattern. The phylogenetic trees are now commonly constructed based on concatenated sequences for large number of proteins [4,11,12,31]. Their main advantage is that because they are based on large numbers of characters derived from many independent proteins, they are generally considered to provide a better reflection of organismal phylogeny than trees based on any single gene or protein, where the observed relationship could be affected by various factors including lateral gene transfer, differences in evolutionary rates among species, long branch attraction effect, etc. [32]. However, it should be recognized that the trees based on concatenated sequences, due to the possibility of their lumping together gene sequences with discordant evolutionary histories, can sometime result in unreliable inferences [32][33][34]. In the present work, phylogenetic trees were constructed based on a combined sequence alignment for 44 widely distributed proteins (see additional file 1) from 44 cyanobacterial species/isolates for which sequence information was available (see Materials and Methods). Most of these proteins carry out important housekeeping functions, and they are universally present in various species [35], making them a good choice for phylogenetic analysis.
A rooted maximum likelihood (ML) distance tree based on the combined sequences for these proteins is shown in Fig. 1 and a neighbour-joining (NJ) tree for the same dataset is provided as additional file 2. A number of distinct clades of cyanobacteria were observed in both these trees. Very similar branching patterns and the grouping of cyanobacterial species in various clades have been observed in earlier studies based on other large and independent datasets of protein sequences [4,11,12], giving confidence in the observed results. One of the observed clades, referred to here as Clade A, consists of Gloebacter violaceus and Synechococcus sps. (JA-3-3Ab and JA2-3-B'a). The ML and NJ tree differ from each other in the branching position of this clade. In the ML tree, the Clade A species/strains formed the deepest branching lineage within cyanobacteria. In contrast, in the NJ tree, the cyanobacteria were divided into two main clades at the deepest level and the Clade A formed the outermost branch of one of these clades, separated from all other species/strains by a long branch (additional file 2). However, the branching of Clade A in this position is not reliable, as in our recent studies based on the same dataset of protein sequences but with smaller numbers of cyanobacteria, the clade A species/strains branched in the same position as seen here in the ML tree [23]. The deep branching of Clade A species/strains has also been observed in a number of earlier studies based on different datasets of protein sequences [4,6,11,12,23,[36][37][38][39]. Further strong and independent evidence that the Clade A species/strains constitutes the earliest branching lineage within sequenced cyanobacteria is provided by our recent identification of several conserved indels in broadly distributed proteins (viz. 18 aa insert in DNA polymerase I, 4-5 aa insert in the tryptophan synthase beta chain, 4 aa insert in tryptophanyl-tRNA synthetase and a 2 aa insert in the DNA polymerase III) [23]. The indicated conserved inserts in these proteins are commonly shared by all other sequenced cyanobacteria, but they are lacking in Clade A as well as all other phyla of bacteria [23]. The species distributions of these conserved indels strongly indicate that these synapomorphies were introduced in a common ancestor of various other cyanobacteria after the branching of Clade A. In a recent proposal for the classification of cyanobacteria, the thylakoids lacking Gloebacterales are placed into a separate subclass (Gloebacterophycidae) [15]. It is unclear whether the Synechococcus sps. (JA-3-3Ab and JA2-3-B'a), which group with G. violaceus, also lack thylakoids or not.
Most other cyanobacteria could be grouped into two main clades in these trees. One of these clades (designated here as Clade B) is comprised of diverse cyanobacteria including Thermosynechococcus, Acaryochloris, as well as other cyanobacterial groups such as Chroococcales (Synechocystis/Crocosphaera/Microcystis/Cyanothece), Nostocales (Nostoc/Nodularia/Anabaena) and Oscillatoriales (Trichodesmium/Lynbya) [15]. Within Clade B, a subclade comprising of the Chroococcales, Nostocales and Oscillatoriales is also observed in both ML and NJ trees ( Fig. 1 and additional file 2). The other main clade (clade C) is composed entirely of different strains/isolates of marine unicellular Prochlorococcus and Synechococcus cyanobacteria. This latter clade has been referred to as the Syn/Pro clade [4] and it corresponds to the subclass Synechococcophycidae in the proposal by Hoffman et al. [15]. Within clade C, different Prochlorococcus and Synechococcus strains/isolates were not completely separated from each other. In particular, two of the Prochlorococcus strains, MIT 9303 and MIT 9313, branched within the Synechoccous strains/ isolates, in both ML and NJ trees ( Fig. 1 and additional file 2). Similar polyphyletic branching of these strains has been observed in earlier studies [12,23]. However, in both these trees, one subclade of Prochlorococcus strains, which is referred to as the low B/A ecotype subgroup [40,41], was separated from all others Prochlorococcus strains by a long-branch. The branching position of the freshwater unicellular cyanobacterium Synechococcus elongatus (strains PCC 6301 and PCC 7942), although it appeared as a deep branching lineage of Clade C, was uncertain in these trees (discussed later).

Signature proteins for Cyanobacteria and its major subgroups
These phylogenetic trees provide a framework for identifying proteins that are specific for either all cyanobacteria or their different well-resolved clades. Based upon earlier studies, within any given group of bacteria or organisms, signature proteins are present at various phylogenetic depths [25,27,28,[42][43][44]. Hence, to identify proteins that are specific for different main clades of cyanobacteria, Blastp searches were carried out on each ORF in the genomes of the following 6 cyanobacteria: Synechococcus sp. WH8102, Synechocystis sp. PCC6803, Nostoc sp. PCC7120, Synechococcus sp. JA-3-3Ab, Prochlorococcus sp. MIT9215 and Pro. marinus subsp. marinus str. CCMP1375. These cyanobacteria are present at the tips of various clades in phylogenetic trees ( Fig. 1 and additional file 2). Hence, blast searches with the proteins in them should enable us to identify proteins that are specific for various main clades of cyanobacteria at different phylogenetic depths. The results of these studies are summarized below.

Signature proteins that are specific for Cyanobacteria
Blast searches on the above genomes have identified 39 proteins that are specific for cyanobacteria and which are present in virtually all of the sequenced genomes (Table 2a). Thirty-three of these proteins are present in all sequenced cyanobacteria (Table 2a) whereas the remaining 6 (marked with *) are missing in 1-2 isolated species/strains. The homologs of some of these proteins are also found in a few algae or plants. Because of their specific presence in practically all cyanobacteria, but generally no other bacteria, these proteins could be regarded as the cyanobacterial signature proteins. The number of cyanobacterial signature proteins identified in the present work is much smaller than those reported in earlier studies [29,30]. However, this difference is mainly due to the large increase in the number of sequenced cyanobacterial as well as other genomes in the past few years. In earlier work, we have also described 15 conserved indels in broadly distributed proteins that are distinctive characteristics of all available cyanobacteria and which are not found in any other bacterial groups/phyla [22,23].
These analyses have also identified 5 proteins whose homologs are present in all other cyanobacteria, except those from Clade A (Table 2b). Based upon solely the genomic distributions of these proteins, it is difficult to interpret whether the genes for these proteins first evolved in a common ancestor of all cyanobacteria followed by their loss in Clade A species/strains, or they originally evolved in a common ancestor of the Clade B and C cyanobacteria after the branching of Clade A. However, based upon the results of phylogenomic analyses, and more importantly the species distribution patterns of several conserved indels in widely distributed proteins that provide evidence that the Clade A is ancestral to other cyanobacteria [23], the most parsimonious explanation for the observed distribution of these genes is that they first evolved in a common ancestor of the Clade B and C cyanobacteria, as indicated in Fig. 2. Table 2c lists 13 other proteins for which high scoring homologs are present in all (or most) cyanobacteria from Clades A and B, but which are lacking in Clade C strains/isolates. Because of the deep branching of Clade A, it is likely that the genes for these proteins also first evolved in a common ancestor of cyanobacteria, followed by their loss in an ancestor of Clade C. The alternate possibility that the Clade A and B cyanobacteria shared a common ancestor exclusive of Clade C is not supported by the species distribution pattern of conserved indels in several proteins, as noted above. Blast searches with proteins in the genome of Synechococcus sp. JA-3-3Ab have also identified 14 proteins that are specific for the Clade A cyanobacteria (additional file 3). The Clade A species/strains can also be distinguished from other cyanobacteria based upon a 15 aa conserved insert in the protein synthesis elongation factor-G that is specific for this clade [23].

Signature proteins for the Clade B cyanobacteria
The Clade B comprises the majority of known cyanobacteria except the unicellular marine cyanobacteria (Clade C) and some deep branching cyanobacteria (see   1). This clade as defined in our work includes all of the species/strains from the orders Chroococcales, Nostocales and Oscillatoriales as well as the deeper branching cyanobacteria, A. marina and Thermosyn. elongatus. Of these latter cyanobacteria, Acaryochloris is unique in containing chlorophyll d as its primary photosynthetic pigment [45], whereas Thermosynechococcus is a unicellular thermophilic cyanobacterium [46]. Our analyses have identified 38 proteins that are uniquely shared by all or most of the species/strains from this clade. Two of the Synechococcus strains viz. PCC7002 and PCC7335, also consistently appeared in this group and of these Synechococcus PCC7002, for which sequence information was available from various cyanobacteria, branched with the Chroococcales in phylogenetic trees ( Fig. 1 and additional file 2).
The branching position of Syn. elongatus (strains PCC 6301 and PCC 7942) is not resolved in phylogenetic trees [4,11,12,37,47]. It generally branches in between the Clades B and C species/strains in phylogenetic trees ( Fig. 1, additional file 2) [23]. Our analyses have identified 22 proteins, which in addition to various Clade B  Table 3b suggests that Syn. elongatus and Clade B cyanobacteria probably shared a common ancestor exclusive of other cyanobacteria.
Within Clade B, the cyanobacterial species/strains belonging to the orders Nostocales, Oscillatoriales and Chroococcales form a distinct clade (NOC clade) in phylogenetic trees ( Fig. 1 and additional file 2). This clade has been referred to as the SPM clade in earlier work [4,47]. We have recently described a number of conserved indels in important proteins (viz. a 19 aa insert in DnaE protein, a 13 aa deletion in GDP-mannose pyrophosphorylase and a 22 aa insert in NAD(P)H-quinone oxidoreductase subunit D) that are distinctive characteristics of this clade of cyanobacteria [23]. In the present work, we have identified 9 proteins (Table 4a) that are also uniquely present in all of the species/strains from Figure 2 An interpretive cladogram indicating the evolutionary stages where genes for different signature proteins described in this work, which are specific for different groups of cyanobacteria, likely evolved. Many conserved indels that are specific for the same groups/clades of cyanobacteria, have also been described in recent work [23].
the NOC clade of cyanobacteria. In addition, 33 other proteins listed in the additional file 4 are also specific for the NOC clade, but they are missing in some species/strains. Within the NOC clade, species/strains belonging to the orders Nostocales and Oscillatoriales exhibit a closer relationship in phylogenetic trees ( Fig. 1 and additional file 2). A 4 aa deletion in the translation initiation factor IF-2 is also uniquely shared by various sequenced cyanobacterial species/strains from these two orders [23]. In this study, we have come across 22 proteins that are specifically present in various sequenced species/strains from these two orders of cyanobacteria (Table 4b), providing further support that these two groups are more closely related.
Within Clade B, the heterocyst-forming cyanobacteria form a monophyletic group (subclass Nostocophycidae) [6,10,47,50]. We recently described two conserved indels (a 4 aa insert in the PetA protein, a precursor of the apocytochrome f, and a 5 aa insert in the ribosomal protein S3) that are specific for these bacteria [23]. In the present work, blast searches on the genome of Nostoc sp. PCC7120 have identified 65 proteins that are uniquely shared by all of the sequenced Nostocales species/strains (Nostoc, Anabaena and Nodularia) (Table  4d and additional file 5). Fifty-eight additional protein listed in the additional file 5 are also specific for this order, but they are missing in 1-2 species/strains. These proteins provide potential molecular signatures for the Nostocales order (Nostocophycidae subclass).  # also found in one of the clade A cyanobacteria * missing in 1-2 species/strains + Additional proteins that are specific for Nostocales are listed in the Additional file 5.
The cyanobacteria such as Synechocystis, Microcystis, Crocosphaera and Cyanothece, belonging to the order Chroococcales, form another well-defined clade in phylogenetic trees (see Fig. 1 and additional file 2) [4,11,12,37,47]. A 1 aa insert in a highly conserved region of the RecA protein is also specific for these cyanobacteria [23]. This insert is also present in Synechococcus sp. PCC7002, which branches with this clade in the phylogenetic trees (see Fig. 1 and additional file 2) [4,47]. In this work, we have identified 8 proteins that are uniquely present in various sequenced Chroococcales species/strains (Table 4c). The evolutionary stages where the genes for these proteins have likely evolved are indicated in the interpretive diagram (Fig. 2).
We have recently described a number of conserved indels in widely distributed proteins that are specific for all of the species/strains from Clade C [23]. These signatures include a 3 aa insert in the RNA polymerase beta subunit, a 2 aa insert the proteins KsgA, a 6 aa insert in tyrosyl-tRNA synthetase, a 2 aa insert in the tRNA (guanine-N1-)-methyltransferase, a 1 aa insert in the RNA polymerase β' subunit and a 12 aa insert in the DNA polymerase I [23]. These signature indels are not found in the Clades A or B cyanobacteria or other phyla of bacteria. Additionally, they are also absent in Syn. elongatus as well as Synechococcus sps. PCC7002 and PCC7335. Another example of a signature insert that is specific for Clade C species/strains is presented in Fig. 3. In this case, a 6 aa insert in a flavoprotein is commonly present in all Clade C species/strains, but absent from all other cyanobacteria as well as other bacteria. This latter observation indicates that this indels is an insert in the Clade C species/strains. Interestingly, this insert and also several of the other Clade C signature indels are also present in Cyanobium sp. PCC7001 (Fig. 3), supporting its placement within the Clade C (Fig. 2) [4,15].
Our blast analyses on proteins from the genomes of Synechococcus sp. WH8102, Prochlorococcus sp. MIT9215 and Pro. marinus subsp. marinus str. CCMP1375 have identified 60 proteins that are uniquely shared by virtually all of the species/strains from Clade C cyanobacteria (Table 5a). These signature proteins provide further evidence and molecular markers indicating the distinctness of Clade C. Eight additional proteins in Table 5b are also specific for Clade C cyanobacteria, but they are absent in all of the low B/A ecotype Prochlorococcus strains, indicating that the genes for these proteins were lost from a common ancestor of the low B/A clade.
As noted earlier, in phylogenetic trees, the branching position of Syn. elongatus is not resolved. In our analyses, we have come across only 3 proteins (marked with + in Table 5a) that are uniquely found in Clade C species/strains as well as Syn. elongatus. This is in contrast to 22 proteins that are uniquely shared by Clade B cyanobacteria and Syn. elongatus (Table 3b). These observations in conjunction with the unique presence of split DnaE genes in Clade B cyanobacteria and Syn. elongatus make a strong case that Syn. elongatus is more closely related to the Clade B cyanobacteria than to the Clade C species/strains. The two genera, Prochlorococcus and Synechococcus, which make up most of the Clade C cyanobacteria, differ from each other in important respects, particularly with regard to the main pigments in their light harvesting systems [40,41]. In contrast to various Synechococcus strains/isolates and most other cyanobacteria, which contain chlorophyll a and phycobiliproteins as the major pigments in their photosynthetic systems, all Prochlorococcus strains/isolates utilize divinyl chlorophyll a and both mono and divinyl chlrophyll b as the main pigments in their light-harvesting systems [40,41]. Further, while Synechococcus isolates are ubiquitous in different aquatic environments including estuarine, coastal and offshore waters [53], Prochlorococcus strains are mainly found in warm oligotrophic oceanic settings [40]. Among the sequenced cyanobacteria, Prochlorococcus strains/isolates have the smallest genomes (see Table 1). Although Prochlorococcus are indicated to be polyphyletic in phylogenetic analyses (with strains MIT 9303 and MIT 9313 branching within the Synechococcus strains/isolates; see Fig. 1 and additional file 2) [12,23,33], our blast searches have identified 19 proteins that are uniquely shared by all or most of the Prochlorococcus strains (Table 6b). These results indicate that despite their polyphyletic branching in phylogenetic trees, all Prochlorococcus strains/isolates form a monophyletic clade, which is in accordance with their distinctive photosynthetic pigments composition. In this work, we also describe a 2 aa conserved insert in the protein heme oxygenase that is also exclusively present in various Prochlorococcus strains (Fig. 4). The unique presence of this insert in various Prochlorococcus strains provides further evidence that this group is monophyletic. The enzyme heme oxygenase, which contains this conserved insert, plays an important role in the biosynthesis of photosynthetic pigments phyto-chromobilin and phycobilins [54]. Because Prochlorococcus are unique in terms of their photosynthetic pigment composition, it is of much interest to determine the functional significance of this conserved indel.
If Prochlorococcus strains/isolates form a monophyletic lineage, then one expect that other cyanobacteria that are part of Clade C might also share many unique proteins in common. Indeed, our blast searches have identified 14 proteins that are uniquely present in various  other cyanobacteria (mostly Synechococcus strains) that are part of Clade C (Table 6a). It should be mentioned that for several of these proteins, blast hits indicating significant similarity are also found for Cyanobium sp. PCC7001 and Paulinellla chromatophora, indicating that these cyanobacteria are also part of the Clade C. The grouping of Cyanobium sp. PCC7001 with Clade C is also supported by the conserved indel in the flavoprotein (see Fig. 3).
As noted above, in phylogenetic trees based on concatenated protein sequences Prochlorococcus str. MIT9303 and MIT9313 branch within the various Synechococcus strains/isolates ( Fig. 1 and additional file 2). Earlier phylogenetic studies by Rocap et al. [41] based on the 16S-23S rDNA spacer region indicate that these two strains (high B/A clade IV) form the deepest branching isolates of this genus. Further, in contrast to other sequenced Prochlorococcus strains, whose G+C content range from 30-39%, the strains MIT9303 and MIT9313 have much higher G+C content (~50%) (see Table 1). Our blast analyses, in addition to identifying many proteins that are unique to various Synechococcus strains/isolates, have also identified 22 proteins that are specifically present in all of the Clade C Synechococcus strains as well as in Prochlorococcus MIT9303 and MIT9313 (additional file 6a). At the same time, we have come across 37 proteins that are uniquely found in all other sequenced Prochlorococcus strains, but which are missing in MIT9303 and MIT9313 (additional file 6b). In addition, we have also identified a 1 aa deletion in a conserved region of the protein protochlorophyllide oxidoreductase (POR) that is uniquely shared by all other Prochlorococcus strains except MIT9303 and MIT9313 (Fig. 5). The enzyme POR is responsible for catalyzing light driven reduction of protochlorophyllide to chlorophyllide -a key regulatory reaction in the chlorophyll biosynthetic pathway [55]. Hence, it is again of much interest to understand the functional significance of this conserved indel. The rare genetic change leading to this indel likely occurred in a common ancestor of various Prochlorococcus strains after the branching of MIT9303 and MIT9313 (Fig. 2). These observations, in conjunction with the branching pattern of these strains in phylogenetic trees, provide evidence that these two Prochlorococcus strains comprise the deepest branching group (high B/A clade IV) [41] within the Prochlorococcus genus, exhibiting closest relationship to the Synechococcus strains/isolates. Earlier studies have led to the division of Prochlorococcus strains/isolates into two physiologically distinct groups (high B/A and low B/A ecotypes), based upon the ratios of chlorophyll b and a2 in their light-harvesting systems and their ability to grow at different light intensities [40,41,56]. Of these two groups, strains from the high B/A ecotype, which have larger ratio of chlorophyll b/a 2 are able to grow at extremely low irradiance, whereas those from the low-B/A ecotype containing lower ratio of chlorophyll b/a 2 are unable to grow under these conditions. The low-B/A ecotype strains instead are adapted to growth at high light intensities, where the growth of high B/A ecotype strains is inhibited. The strains from these two ecotypes also differ in terms of their sensitivity to copper and their ability to use nitrite or nitrate as nitrogen sources [41,57]. In phylogenetic trees, the low B/A ecotype Prochlorococcus isolates (viz. MIT9515, CCMP1986, MIT9312, MIT9215, MIT9301 and AS9601) formed a distinct subclade that was well separated from all other Clade C species/strains by a long-branch and 100% bootstrap score ( Fig. 1 and additional file 2) [23,41]. We have also described two conserved indels (viz. a 5 aa deletion in leucyl-tRNA synthetase and 1 aa insert in the Ffh protein) that are uniquely shared by all of the low B/A ecotype Prochlorococcus strains [23]. In the present work, we have identified 67 proteins that are exclusively found in all of the sequenced strains from the low B/A ecotype clade (additional file 7a). Seventy-two proteins listed in the additional file 7b are also specific for this clade, but they are missing in 1-2 of the strains/isolates. These signature proteins and indels together with the distinct branching of the low B/A strains in phylogenetic trees provide strong evidence that this group of Prochlorococcus strains are phylogenetically, physiologically and molecularly distinct from all other Prochlorococcus strains. Based upon species distribution patterns of various cyanobacteria-specific proteins, evolutionary stages where the genes for these proteins likely evolved are indicated in the interpretive diagram in Fig. 2.

Discussion and Conclusions
In this work, we have used a combination of phylogenomic and signature proteins based approaches to and JA-2-3B'a) and these proteins are absent in those strains. Besides Synechococcus, homologs of many of these proteins are also found in Cyanobium sp. PCC7001 as well as in Paulinella chromatophora, indicating that these species may also belong to the Clade C cyanobacteria.
elucidate the evolutionary relationships among cyanobacteria. Phylogenetic trees were initially constructed for 44 cyanobacteria based on concatenated sequences for 44 widely distributed proteins present in various cyanobacteria. The branching pattern of cyanobacteria in these trees was very similar to that observed in other recent studies based on different large sets of proteins for smaller numbers of cyanobacteria [4,11,12]. In all of these trees a number of distinct clades of cyanobacteria are consistently observed. However, the main focus of the present work was on comparative analyses of cyanobacterial genomes to identify unique sets of genes/ proteins that are limited to particular groups of cyanobacteria, corresponding to various phylogenetically identified clades. This work complement our recent studies, where a comparative genomic approach was employed to identify >40 conserved indels in widely distributed proteins that are also specific for the same groups/clades of cyanobacteria [23]. Recent analyses of genomic sequences have revealed that whole proteins that are limited to different monophyletic clades are present at different phylogenetic depths [26][27][28]43,44,58,59]. Unlike ORFan proteins, which are unique to a given species or a strain and are subject to rapid gene loss [44,60,61], these lineage-specific proteins are retained in a conserved state by all or most species/strains from a given clade, indicating that they are conferring selective advantage to species from these clades [28,58,62]. Although the mechanism responsible for the evolution or acquisition of genes for these proteins is unclear [28,61], their specific presence in different clades indicates that the genes for these proteins first evolved (or introduced) in a common ancestor of these clades followed by their retention by various descendents of these clades. Because of their clade specificity, these lineage specific-proteins or conserved signature proteins (CSPs) provide valuable molecular markers for these clades [26][27][28]43,59]. Our recent analyses of CSPs from several major groups of bacteria (viz. alpha proteobacteria, epsilon proteobacteria, gamma proteobacteria, chlamydiae, Bacteroidetes-Chlorobi and Actinobacteria) provide evidence that the species distribution of most of these CSPs show high degree of concordance with different clades in the phylogenetic trees [25][26][27]42,63,64]. This inference is strongly reinforced by the results of present study, where most of the identified CSPs correspond to well-defined clades in the phylogenetic trees.
It should be mentioned that in our analyses we have not come across significant numbers of CSPs that support alternate groupings i.e. where the proteins are commonly shared by various species/strains from clades that are phylogenetically unrelated (e.g. Nostocales and Clade C, or Oscillatoriales and Clade C). However, one commonly observed pattern is that if two clades are close to each other in phylogenetic trees, but their branching is not clearly resolved (i.e. weakly supported by bootstrap scores), then in addition to observing many proteins that are unique to each of these two clades, several proteins that are commonly shared by them are also observed. This could be due to either that genes for many of these proteins probably evolved in a common ancestor of these clades prior to their becoming phylogenetically distinct or due to lateral gene transfers among closely related taxa [13,65]. Nevertheless, our results that most of these proteins are distinctive characteristics of phylogenetically well-defined monophyletic clades strongly suggest that their species distribution has not been significantly affected by lateral gene transfers, which is indicated to be very common in cyanobacteria [13,66].
When a protein is confined to only a certain group of species/strains, then based upon this information alone, it is difficult to determine whether the group of species containing this protein form a clade in the phylogenetic sense or not. To properly evaluate the results of such studies, it is necessary to carry out these studies in conjunction with phylogenetic as well as other forms of analyses (e.g. studies based on conserved indels), where it is possible to establish a rooted relationship among different groups or taxa under consideration [23,26,59]. Based on these studies, if a given protein is uniquely found in all or most of the species from a well-defined monophyletic clade, and generally no where else, then the simplest and most parsimonious explanation for this is that the gene for this protein first appeared in a common ancestor of this group and then passed on vertically to its various descendants [17,20,67]. We have interpreted the results of species distribution of various unique proteins based on this minimal assumption. Based on this interpretation, various identified signature proteins or CSPs could be regarded as molecular synapomorphies that are specific for different clades of cyanobacteria.
The branching order and interrelationships among cyanobacteria that emerges based upon all of these different approaches is shown in Fig. 2. All of these approaches indicate that a clade consisting of Gloebacter and the Synechococcus strains JA-3-3Ab and JA2-3-B'a (Clade A) forms the deepest branching lineage within cyanobacteria. A large number of sequenced cyanobacteria correspond to marine unicellular Synechococcus and Prochlorococcus strains (Clade C). We have identified numerous proteins and conserved indels that are specific for this clade. Although Synechococcus and Prochlorococcus strains do not form monophyletic clusters in phylogenetic trees, the shared presence of many novel proteins as well as some conserved indels by various Prochlorococcus strains provide evidence that this group is monophyletic. The unique pigments that are found in the light harvesting system of Prochlorococcus also support their distinctness from other cyanobacteria. The monophyletic grouping of marine unicellular Synechococcus strains/isolates based upon these molecular and biochemical characteristics is at variance with their polyphyletic branching in different phylogenetic trees (see Fig. 1, additional file 2) [4,11,23]. This discordance could be explained by either lateral migration of genes responsible for these characteristics [11,13,33,68], or due to inability of the phylogenetic trees to resolve the branching order among closely related species/strains. Among the Prochlorococcus strains, our analyses confirm that the strains corresponding to low B/A ecotype are distinct not only in physiological and phylogenetic terms [40,41,56], but that they also share large numbers of proteins that are unique to them. Several conserved indels that are specific for the low B/A ecotype clade have also been identified [23]. Recent study by Zhaxybayeva et al. [33] also provides evidence that the highlight adapted low B/A ecotype Prochlorococcus strains form a monophyletic clade, in contrast to the paraphyletic grouping of the low-light adapted (i.e. high B/A ecotype) Prochlorococcus spp. [33]. All of these observations make a strong case for the recognition of low B/A ecotype Prochlorococcus strains as a distinct taxonomic entity.
Within Clade B, many CSPs were identified that are specific for the Nostocales and Chroococcales orders. In addition, several other CSPs are uniquely present in the Nostocales and Oscillatoriales orders, or by the Nostocales, Oscillatoriales and Chroococcales. In recent work, a number of conserved indels that are unique to these orders of cyanobacteria have also been identified [23]. Although, the clade comprising of these cyanobacterial orders is not clearly resolved in phylogenetic trees [4,11], the shared presence of large numbers of novel CSPs as well as some conserved indels by these cyanobacteria strongly suggests that species/strains from these groups shared a common ancestor exclusive of other cyanobacteria and that this clade represents a deeper branching grouping within cyanobacteria. The results presented here also suggest that Syn. elongatus is more closely related to Clade B in comparison to either clade A or C of cyanobacteria.
The signature proteins and conserved indels for different cyanobacterial clades that are described in this work and in our recent studies [23] provide novel and powerful means for understanding cyanobacterial phylogeny and taxonomy. Based on these molecular markers, all of the main clades of cyanobacteria can now be identified and circumscribed in molecular terms. These signature proteins and indels should also prove useful for the identification and assignment of cyanobacterial species/ strains to specific clades based upon the presence or absence of various signature indels or CSPs. Because many of these CSPs, or proteins containing the conserved indels, are highly conserved, degenerate PCR primers could be readily designed to sequence the corresponding genes/proteins from any given cyanobacteria. The assignment of any species/strains into a given clade by this approach is based upon several independent signatures that provide complementary information. Some of these signatures serve to exclude a given species/strains from particular groups or clades, whereas others point to its inclusion in more and more specific clades. Blast searches with these cyanobacteria-specific CSPs should also prove useful in determining the presence or absence of different groups of cyanobacteria in metagenomic sequences [69] Most of the cyanobacterial signature proteins identified in this work are of unknown functions. However, the retention of these genes by all cyanobacteria from the indicated clades strongly suggests that these proteins perform important functions in these groups of cyanobacteria [70][71][72]. Likewise, our recent work shows that the conserved indels in protein sequences are also essential for the group or clade of species where they are found [73]. Hence, further work on understanding the cellular functions of these cyanobacterial signature proteins and signature indels should be of great interest. These studies should provide valuable insights regarding biochemical and physiological characteristics that are unique to different clades of cyanobacteria [64,[74][75][76].

Phylogenetic/phylogenomic analyses
Phylogenetic analyses were carried out on a set of 44 proteins involved in important housekeeping functions that are present in most organisms (see Additional file 1) [35]. Blast searches with these proteins revealed that their homologs were present in all 34 sequenced cyanobacterial genomes (listed in Table 1), the two outgroup species (Bacillus subtilis and Staphylococcus aureus), as well as 10 other cyanobacteria (viz. Crocosphaera watsonii WH8501, Cyanothece sp. CCY0110, Lyngbya sp. PCC8106, Microcystis aeruginosa PCC7806, Nodularia spumigena CCY9414, Syenchococcus sp. WH5701, Syenchococcus sp. BL107, Syenchococcus sp. RS9917, Syenchococcus sp. RS9916 and Syenchococcus sp. WH7805). Hence, sequence information for all of these cyanobacteria was included in our analyses. The multiple sequence alignments for these proteins were created using the ClustalX 1.83 program [77] and they were concatenated into a single large file. This unedited sequence alignment was imported into the Gblocks 0.91b program to remove poorly aligned regions [78]. This program was used with default settings except that allowed gap position parameter was changed to half. The resulting final alignment of 16834 amino acid sites was used for phylogenetic analyses. A neighbour-joining (NJ) tree based on 1000 bootstrap replicates was constructed by the Kimura model [79] using the TREECON 1.3b program [80]. The maximum-likelihood (ML) analysis was carried out using the WAG+F model with gamma distribution of evolutionary rates with four categories using the TREE-PUZZLE program with 10000 puzzling steps [81].

Identification of proteins and conserved indels that are specific for Cyanobacteria
The Blastp searches were carried out on each ORF in the genomes of Synechococcus sp. WH8102, Synechocystis sp. PCC6803, Nostoc sp. PCC7120, Synechococcus sp. JA-3-3Ab, Prochlorococcus sp. MIT9215 and Prochlorococcus marinus subsp. marinus str. CCMP1375 to identify proteins that are uniquely present in various clades of cyanobacteria seen in the phylogenetic trees (Fig. 1). The blast searches were performed against all organisms (i.e. non-redundant (nr) database) using the default parameters, without the low complexity filter [82]. The proteins that were of interest were those where either all significant hits were from the indicated groups of cyanobacteria, or which involved a large increase in E values from the last hit belonging to a particular clade to the first hit from any other bacteria/cyanobacteria and the E values for the latter hits were >1e -04 , indicating weak similarity that could occur by chance. Higher E values are often significant for smaller proteins as the magnitude of the E value depends upon the length of the query sequence [82]. Hence, the lengths of the query proteins and those of various hits were also taken into consideration when analyzing the results of these studies. In most cases, the lengths of various significant hits were very similar to those of the query proteins. Some proteins, which in addition to cyanobacteria were also found in the plants/plastids, or in an isolated species from some other groups (noted appropriately), were also retained. The proteins, which were uniquely found in a given species or strain were not examined in this work. For all cyanobacterial proteins that are specific for various clades or subgroups, their accession numbers, any information regarding cellular functions, and protein lengths, were tabulated and are presented. Identification of new conserved indels that are specific for cyanobacterial clades was carried out as described in our earlier work [22,23].