Escherichia coli is one of the most important model organisms in both biology and medicine. Many major findings have emerged from the study of E. coli, including bacterial conjugation, recombination and genetic regulation. More importantly, E. coli plays important roles in the intestinal tract of humans and other vertebrates, especially in the lower section. There are more than a billion E. coli cells in the intestines of a healthy human . Unfortunately, several E. coli strains can cause intestinal and extraintestinal diseases, such as diarrhea, urinary tract infection, septicemia, pneumonia and meningitis, in humans and animals . The availability of an increasing number of complete E. coli genomes has revealed that E. coli exhibits high diversity at the whole-genome level. Comparative genomic analyses have demonstrated that the diversity among natural isolates of E. coli is extraordinarily high, and the average genome-wide conservation across different strains is less than 50% . Therefore, E. coli is an ideal candidate for studying how the relationship between a bacterium and its host can fluctuate between commensalism and pathogenicity .
In general, at the whole-genome level, two main categories of methods are used to assess phylogenetic relationships among prokaryotes (i.e., phylogenomic analysis). One method is based on the concept of orthology, in which sequence alignment is the core computational method. Many approaches, such as gene content, gene order, multilocus sequence typing (MLST) and super-tree or super-matrix methods, belong to this category [1, 4–6]. Another approach is based on the frequencies of K-mer oligonucleotides and does not employ an alignment [6, 7]; this type of method emphasizes the importance of genome content and organization. Intuitively, for phylogenomic analysis, we are seeking one or a set of genomic features that can be used as indicators/markers to robustly and correctly reveal the evolutionary relationships among a group of organisms of interest. In addition, we are also interested in features that are functional units, which could act as a bridge between genomic diversity and phenotypic differences. Within bacterial systems, the concept of an operon satisfies these two criteria. Operons are groups of genes that exhibit physical clustering within the genome and are typically transcribed in a single mRNA . Genes within the same operon usually have related functions, and some of these genes may be employed in the same pathway. Regulatory genes are also commonly located in close proximity to the genes that are being regulated . Although certain operons may comprise genes with no clear functional relationship, these genes may be required under the same environmental conditions even though they are involved in different pathways . Unfortunately, many, if not all, operons predicted in databases to date consist only of structural genes that lack expressional regulatory elements. It is well known that the correct expression of genes must remain faithful to the specific genetic background. In addition, certain relatively large clusters of genes that have related functions, but do not belong to the same operon, have been described . Therefore, it is currently assumed that predicted operons may be difficult to use in practice as indicators/markers for phylogenomic studies. With the availability of an increasing number of closely related or intraspecific prokaryotic genomes, as well as the advent of whole-genome alignment algorithms [11, 12], there is an opportunity to implement phylogenomic analyses of the evolution and ecological adaptation of these organisms on the whole-genome scale. To this end, we chose one type of genomic feature, called locally collinear blocks (LCBs), to study the evolutionary relationships and potential ecological adaptations of E. coli on the whole-genome scale. In principle, LCBs from closely related organisms or within one species should contain useful phylogenomic signals regarding their evolutionary histories. Each LCB, also known as a collinear region, is a region of DNA sequence that is shared by two or more genomes that are being studied . Clearly, if an LCB is sufficiently large, it is likely to contain one or more consecutive genes with related functions in addition to their regulatory regions. Therefore, LCBs that are present or absent in either genome may satisfy both of the aforementioned criteria for feasible genomic markers; if these criteria are met, the analysis of LCBs should reveal a comprehensive history of the evolutionary and ecological adaptation of E. coli genomes.
To test our hypothesis, we studied the vertical and phenetic relationships of 34 strains of E. coli at the level of LCBs. First, we identified potential LCBs using the Mugsy program . Next, we divided the LCBs into two groups according to their occurrence among the strains: core and variable LCBs. The core LCBs are the set of collinear regions shared by all of the studied strains, whereas the variable LCBs are the set of collinear regions that were absent in at least one of the 36 strains. Then we constructed two phylogenies based on the LCBs from each of these two groups. The phylogeny based on core LCBs tends to reflect the vertical evolutionary history of the strains (i.e., the evolutionary phylogeny). In contrast, the second phylogeny, based on the variable LCBs is likely to reveal the whole-genome similarities of extant strains (i.e., the similarity phylogeny). In the evolutionary phylogeny, the strains were clustered into groups as known phylogroups. Within each phylogroup, strains were grouped according to their respective pathotypes. These patterns indicate that it is feasible to use LCBs as indicators/markers to infer intraspecific phylogenies. We also found that the B2 phylogroup occured at the base of the evolutionary phylogeny, thereby suggesting that the ancestor of E. coli / Shigella was an opportunistic pathogen. Such a pathogen may be harmless under certain environmental conditions and pathogenic in other settings . A comparison of the evolutionary and similarity phylogenies shows that Shigella may have at least three origins. We scrutinized the common properties of Shigella that were missing in other E. coli genomes and found that the common LCBs from their genomes were mainly influenced by mobile genetic elements. This finding implies that Shigella may have experienced a convergent evolution event via horizontal gene transfer (HGT) and acquired similar phenotypes during the course of evolution. Interestingly, by inspecting specific branches of the similarity phylogeny and correlating the branch support of LCBs with key branches in the evolutionary phylogeny, we identified putative LCBs that may be relevant to the pathogenicity of certain strains. Moreover, by analyzing the annotated genes within these regions, additional details on the evidence associated with pathogenicity were revealed, which may provide clues for further experimental evaluation. We believe that such phylogenomic studies, which examine collinear regions of whole genomes, will help to better understand the evolution and adaptation of microbes and E. coli in particular.