Carbohydrates, in the form of mono-, di-, oligo-, and polysaccharides, as well as glycoconjugates, play fundamental roles in all forms of life
. Beyond their role in energy storage, carbohydrates are central to diverse biological processes such as host-pathogen interactions, signal transduction, inflammation, intracellular trafficking, diseases, and differentiation/development. Not least, as structural components of terrestrial biomass, carbohydrates comprise approximately 75% of the carbon fixed annually by primary production
. Sugar-rich plant cell walls, seeds, and tubers thus represent a renewable material with significant potential to address energy and material needs.
A striking feature of carbohydrates is their remarkable structural complexity, due to a rich diversity of monosaccharide building blocks, and the possibility of numerous stereo- and regiospecific linkages
, which give rise to both simple linear and complex, highly branched molecules
. A decade of investments in genomics and proteomics has greatly improved our interpretation of the molecular language of the cell, but deciphering the complex carbohydrate-based information in the biomolecular landscape is still in its infancy. Indeed, glycomics has been identified both as “the last frontier of molecular and cellular biology”
 as well as an “emerging technology that will change the world”
Functional analysis of glycans and glycoconjugates is complicated by the fact that they are not direct genetic products, but are instead synthesized, recognized, modified, and degraded by a plethora of carbohydrate-active enzymes (CAZymes) and binding proteins. In the synthetic direction, phosphosugar-dependent glycosyltransferases (GTs) catalyze the formation of glycosidic linkages, whereas their breakdown is mediated by glycoside hydrolases (GHs) and polysaccharide lyases (PLs), with the assistance of carbohydrate esterases (CEs). The structural diversity of carbohydrates is reflected in an abundance of CAZyme-encoding genes, which comprise 1-3% of the genome of most organisms
. Expanding and harnessing knowledge of the complexity of the “CAZome” is thus essential to understanding the complexity of the glycome.
The protein sequence-based classification of CAZymes was initiated in 1991 as a complement to the long-standing Enzyme Commission (EC) number system
, which is based solely on enzyme activities
. Given the prevalence of convergent evolution of enzymes that cleave glycosidic bonds, as well as the demonstrable catalytic promiscuity of individual enzymes, sequence-based classification has proven to be a robust way to unify information on enzyme structure, specificity, and mechanism, which provides enormous predictive power
. Initially motivated by a need to delineate cellulases (EC 188.8.131.52) into distinct structural families
, the first incarnation of the GH family classification, as such, comprised 35 GH families
. The number of families increased steadily with the growing interest in Glycobiology so that, as of August 2012, 130 sequence-based families of GHs have been defined in the continuously updated CAZy database
Presently, one of the largest GH families is GH5, historically known as “cellulase family A” as it was the first cellulase family described
. GH5 exemplifies a family with a large variety of specificities: it currently contains close to 20 experimentally determined enzyme activities denoted with an EC number. The abundance of GH5 enzymes in different ecological niches has been highlighted by their frequent identification in metagenomes of diverse microbial communities
[12–14], as well as the genomes of individual organisms
. As with other CAZyme families
, GH5 members are commonly found to be encoded as parts of multi-modular polypeptide chains containing other catalytic, substrate-binding, and functionally unidentified or yet to be described modules.
Within the large GH5 family, a discernible diversity of sequences was observed soon after its creation. The first five subfamilies of GH5 (A1-A5) were identified as early as 1990
. Subfamily A6 was introduced in 1997
 and the following year eukaryotic and prokaryotic β-mannanases were assigned to A7 and A8, respectively
. Subsequently, subfamily A9 was introduced in a study, which notably also suggested the merger of A5 and A6
. Finally, A10 was the most recently defined GH5 subfamily
, while new subfamilies that presently lack a unique identifier have also been suggested
[21, 22]. Family GH5 belongs to clan GH-A, which presently groups 19 GH families to form the largest set of evolutionarily related GH families described in CAZy thus far (a clan is a group of families that arise from a common but very distant ancestor; despite weak sequence similarity, clan members share conserved protein fold and catalytic machinery).
Families such as GH5 were originally defined with a very small number of sequences. With the accumulation of an increasing body of sequence data, the relationship between the original families has sometimes changed enough to merit reexamination of family membership. Very recently, detailed three-dimensional structural analysis led to the reclassification of several GH5 sequences into family GH30 based on the organization of secondary structural elements around the conserved (β/α)8 fold of the catalytic module
Given the continuing expansion in sequence numbers and the partial GH5/GH30 reclassification, it is clear that a global re-analysis of the subfamily division of GH5 is now needed. The rapid accumulation of genomic data in the past decade revealed a complex and varied sequence space, with the consequence that a substantial portion of GH5 family members are currently not assigned to any subfamily. This situation will only become worse as the rate of (meta)genomic sequencing continues to increase with phenomenal rapidity. Further, this flood of data will cause an increasing reliance on computer-based annotation, which necessarily requires a robust framework to produce meaningful functional predictions. The division of CAZyme families into subfamilies based on phylogenetic analysis has been applied as a successful approach to meet this challenge: Subfamily classification of GH13, GH30 and all of the PL families has demonstrated that the majority of the defined subfamilies were monospecific, thus indicating a significantly better correlation of substrate specificity between sequences at the subfamily level than the family level
[23–25]. Significantly, the division into subfamilies allows the identification of currently uncharacterized subfamilies that can subsequently be analyzed biochemically and structurally to potentially unveil new activities.
Hence, we present here an improved, robust subfamily classification for GH5 by employing a large-scale analysis of all publicly available sequences. Our intention is that the introduction of this additional hierarchical level across this important GH family will serve to guide enzyme discovery, structure-function analysis, and biocatalyst improvement in post-genomic efforts. Not least, many enzyme activities relevant to biomass analysis and conversion are found in GH5 (e.g., cellulases, mannanases, xylanases, galactanases, and xyloglucanases), as are enzymes with biomedical applications
. Significantly, the present analysis unveiled a large number of sparsely or incompletely characterized subfamilies that may still hide a number of unsuspected activities and singular structural features.