Computer vision applied to herbarium specimens of German trees: testing the future utility of the millions of herbarium specimen images for automated identification
© The Author(s). 2016
Received: 7 July 2016
Accepted: 10 November 2016
Published: 16 November 2016
Global Plants, a collaborative between JSTOR and some 300 herbaria, now contains about 2.48 million high-resolution images of plant specimens, a number that continues to grow, and collections that are digitizing their specimens at high resolution are allocating considerable recourses to the maintenance of computer hardware (e.g., servers) and to acquiring digital storage space. We here apply machine learning, specifically the training of a Support-Vector-Machine, to classify specimen images into categories, ideally at the species level, using the 26 most common tree species in Germany as a test case.
We designed an analysis pipeline and classification system consisting of segmentation, normalization, feature extraction, and classification steps and evaluated the system in two test sets, one with 26 species, the other with 17, in each case using 10 images per species of plants collected between 1820 and 1995, which simulates the empirical situation that most named species are represented in herbaria and databases, such as JSTOR, by few specimens. We achieved 73.21% accuracy of species assignments in the larger test set, and 84.88% in the smaller test set.
The results of this first application of a computer vision algorithm trained on images of herbarium specimens shows that despite the problem of overlapping leaves, leaf-architectural features can be used to categorize specimens to species with good accuracy. Computer vision is poised to play a significant role in future rapid identification at least for frequently collected genera or species in the European flora.
KeywordsAutomated identification Computer vision Herbarium specimens JSTOR Leaf shape Leaf venation
Global Plants, a collaborative between JSTOR and some 300 herbaria, is now the world’s largest database of digitized plant specimens (http://about.jstor.org/content/global-plants, accessed 20 April 2016). It currently contains images of 2,482,901 million herbarium sheets (Hannah Begley, Digital Librarian for Primary Sources, JSTOR, 4 May 2016). Each of the images is accompanied by metadata including the Latin (formal) species name and data transcribed from the label, such as name of collector (where known), collection date and location (where known), and the acronym of the herbarium that owns the specimen. Many of the images in JSTOR plants are of so-called type specimens, which have received priority in digitization projects because they are essential to biological nomenclature. As set out in the Code of Nomenclature for algae, fungi, and plants , type specimens fix the application of names, and by definition, a type specimen is, and always remains, correctly identified, no matter the changing taxonomic views and insights from new data. The plants shown in these images are not always perfectly preserved; for example, their leaves often overlap each other or are damaged.
Herbaria that are digitizing their type specimens are allocating considerable recourses to the maintenance of computer hardware and rental of storage space for growing numbers of images. It is therefore important to consider, and test, the future utility of the millions of high-resolution type specimens. An obvious desideratum in this context is the application of machine learning to quickly classify images into thousands of categories of interest to different user, such as “Rosaceae”, “Crataegus”, or “Fagus sylvatica.” So far, computer vision has not been applied to JSTOR plant specimen images. Instead, use of digitized plant images still relies exclusively on human pattern recognition and on the (excellent) memories of taxonomists who know under which Latin name to search for an image. Between June 2015 and April 2016, 282,403 unique visitors viewed 427,636 (17%) of the 2.48 million JSTOR plant images. During a slightly larger period of 1.5 years (Dec. 2014-April 2016), there were 419,822 unique visitors (Hannah Begley, Digital Librarian for Primary Sources, JSTOR, 4 May 2016). At this time, of course, only people who know the Latin name of a plant (or its synonymous names) can find images of specific species in JSTOR.
Machine learning applied to images of museum specimens offers the opportunity to identify (i.e., provide with a Latin name) specimens speedily, which would facilitate subsequent fine-scale analysis by taxonomic experts, for example, whether a specimens is an outlier in its traits or geographic range and may represent a new species. (Machines can never decide whether a specimen represents a new species because under the current Codes of Nomenclature species ranking is a matter of opinion). The application of machine learning and computer vision to museum specimens differs from applications of computer vision to living specimens where the primary goal is not find a Latin name for an unnamed specimen, but instead to cluster specimens for other purposes. So far, computer vision approaches in biology have been applied to cluster (images of) wings of Drosophila species , wings of common British moths , bee wings , and color spots of cichlid fish . In plants, computer vision has been applied to images of 1907 fresh leaves belonging to 32 different plant species , images of fresh leaves of a few tree species , and images of fresh leaves of three legume species . One study has applied machine learning to images of dead leaves, using (digitized images of) 7597 leaf clearings from 2001 genera of flowering plants to categorize leaf vein patterns . Leaf clearings are leaves that have been chemically treated and preserved to show the veins. The images used in all these studies show non-overlapping leaves. Specimens in herbarium image databases, by contrast, have many overlapping leaves, presenting a challenge for the application of computer vision.
Here, we test whether a standard computer vision algorithm can be trained on images of typical herbarium specimens to learn to identify the 26 tree species (Additional file 1: Table S1) most commonly encountered in Germany (http://www.baumkunde.de/haeufigste-baeume-deutschland.php) and surrounding countries of Central Europe. Algorithms that examine morphometric characters usually require large training databases of images per category (here species). For example, Wilf et al.  used at least 100 images per category (fossil leaf morphotype = fossil leaf species). To generate a training database, we used only 10 images per species to simulate the empirical fact that most species are represented in image databases by just a few images. The photographed specimens were collected between 1820 and 1995 and were typical of herbarium material in often having overlapping leaves. Our approach to deal with the problem of overlapping leaves consisted of first segmenting single, non-overlapping leaves for each species in a preprocessing step. A normalization routine was then used to counteract distortions and ensure comparability of the images. Next, features of three categories were extracted from the normalized images and fed into a Support Vector Machine to achieve the final classification. No prior study has adopted this combination of tools for leaf identification, using herbarium material.
Imaging setup for herbarium specimens
To obtain the training images, we used the Munich herbarium’s HerbScan unit, which consist of a flatbed scanner (Epson Expression model 10000XL), modified for inverted use. Specimens were photographed at a resolution of 5144 × 3599 pixels and 300 dpi. For each species, specimens were selected to cover a range of typical herbarium material, including broken over folded leaves, leaves damaged by herbivores, and overlapping leaves. The most common German trees include several species that have similar leaves, for example, Acer plantanoides, Acer pseudoplantanus, Populus tremula, Populus nigra, Quercus cerris, Quercus robus, Quercus petraea, Ulmus glabra, and Ulmus minor. A selection of our 260+ images is available as online supporting material. We initially tried using images from JSTOR but finding 10 per species proved extremely time-consuming and not possible for all 26 species.
To extract leaf characteristics from herbarium specimens, the first step is to locate and segment single leafs. The automatic segmentation routines had to distinguish leaves, fruits, flowers and stems, and to cope with overlapping and damaged leaves. We achieved this using the lazy snapping routine , which requires that the user exemplarily mark a few points on the leaf and on the background. Lazy Snapping is based on graph cuts and provides visual feedback to the user so that segmentation results can be corrected if necessary. Although fully automatic segmentation routines have been proposed , the semi-automatic approach allowed for efficient and flexible processing and was able to deal with overlapping and damaged leaves.
Three sets of descriptive features (FS1, FS2, FS3) were defined and served as input to a Support Vector Machine characterizing the leaf shape and leaf veins. Fourier descriptors can easily be modified to be invariant under translation, rotation, and scaling. The first feature set, FS1, consisted of Fourier descriptors characterizing the outline of the binary leaf segmentation. The second set, FS2, consisted of the descriptive leaf shape parameters compactness, convexity, solidity, rectangularity, circularity, perimeter-area ratio, slimness, position of maximum thickness, and dispersion [13, 14]. The third set, FS3, quantified the structure of the vein network. Because a pixel-wise identification of the vein network could not be performed in a robust manner, we focused on features representing the orientation of the vein structure by using weighted orientation histograms.
A Support Vector Machine (SVM) was trained in order to assign unknown feature vectors to one of the species classes. The SVM was configured with a linear kernel providing the best performances. Due to the low number of mages per species, a leave-one-out validation strategy was pursued. Multiclass classification was realized through the one-against-one strategy .
Two test sets were used for validation. Test set I consisted of all 26 species (Additional file 1: Table S1), a few of them in the same genus. In contrast, test set II included only one species per genus, leaving 17 species.
Taxonomists are increasingly relying on digitized images, either to achieve specimen identification via visual matching of features or to extract morphological features that can be coded and used for phylogenetic, morphometric, or other purposes [9, 11, 16, 17]. Because of the great number of databased images now available and the comparatively few taxonomic experts, there is a great need for computer vision to be applied to specimen images of which millions are being made available online at substantial costs (; cf. our Introduction with data on JSTOR images and usage numbers). Deep learning approaches  for computer vision in principle could allow automated plant specimen identification --meaning the suggestion of a Latin name for the respective image-- as long as the software could be trained on suitable subsets of the millions of Latin-named plant images already available online. Extracting such subset is not an easy task, however, and the first insight from this project was that we had to make new images (using the HerbScan setup applied in many herbaria for specimen digitization) to obtain 10 images for each of the tree species. It is an empirical fact that most of the estimated 340,000 species of higher plants are known from few collections and are so far represented by few images in public databases. Therefore the use of 10 images for training purposed sets a realistic bar.
Our success rate of 73 to 85% with the two test sets is comparable to that in the few other projects that have applied computer vision to name plants by clustering of similar leaf types, although not necessarily on finding formal [Latin] species names. For example, the Leafsnap application developed by Kumar et al. , which identifies common North American tree species in the Washington (District of Colombia) area, had success rates of 96.8%, but requires snapshots of fresh, non-overlapping leaves. A classification of different grapevine varieties (from photos of perfectly spread out fresh leaves) had success rates of up to 80% . Scans of herbarium specimens of four fern species, using 18 scans per species, gave classification accuracies of at least 96% . We found not data on identification success rates for image sharing and retrieval applications, such as Pl@ntNet, where users can upload photos of plants and identify them for others ( for a critique). For comparison to all these photo-based tools, a study in which 67 common British trees (which includes the common ‘German’ species) were identified by barcoding had species discrimination success rates of 65 to 86% .
This study represents the first application of computer vision to images of old herbarium specimens, similar to the 2.48 million specimen images in JSTOR. There is a need to efficiently use images of herbarium material (all with scientific [Latin] species names) that are stored in public databases by making them more useful for non-specialists who need names for their plants. The results demonstrate that computer vision can be used to classify specimens even when they have many overlapping leaves and even when few training images are available. Rapid identification even just to the genus level could help non-botanists who need a quick list of the tree species in a particular street or park.
We thank the editor, Nina Rønsted, and two anonymous reviewers for constructive comments and additional literature.
This work was supported by a German Research Foundation (DFG) grant to D. Merhof (ME3737/3-1).
Availability of data and materials
The code and images used in this study will be permanently available here http://www.lfb.rwth-aachen.de/bibtexupload/pdf/UNG17a.zip, Institute of Imaging & Computer Vision, RWTH Aachen University, Germany.
JU carried out the research; DM funded and supervised the research; SSR had the initial idea, provided conceptual input, and wrote the paper, which was modified and approved by all coauthors.
SSR wrote the manuscript, with technical contributions from JU and DM. JU developed the computational methods and implemented the software framework, supervised by DM. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent to publish
Ethics approval and consent to participate
This study included no data or analyses requiring special consent to conduct or to publish.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- McNeill J, et al., editors. International Code of Nomenclature for algae, fungi, and plants (Melbourne Code), adopted by the Eighteenth International Botanical Congress Melbourne, Australia, July 2011. Bratislava: International Association for Plant Taxonomy; 2012.Google Scholar
- Houle D, Mezey J, Galpern P, Carter A. Automated measurement of Drosophila wings. BMC Evol Biol. 2003;3:25.View ArticlePubMedPubMed CentralGoogle Scholar
- Watson AT, O’Neill MA, Kitching IJ. Automated identification of live moths (Macrolepidoptera) using DIgital Automated Identification SYstem (DAISY). Syst Biodivers. 2004;1:287–300.View ArticleGoogle Scholar
- Steinhage V, Schröder S, Roth V, Cremers AB, Drescher W, Wittmann D. The science of “fingerprinting” bees. Ger Res. 2006;28:19–21.View ArticleGoogle Scholar
- Joo D, Kwan Y, Song J, Pinho C, Hey J, Won Y-J. Identification of cichlid fishes from lake Malawi using computer vision. PLoS One. 2013;8:e77686.View ArticlePubMedPubMed CentralGoogle Scholar
- Laga H, Kurtek S, Srivastava A, Golzarian M, Miklavcic SJ. A Riemannian elastic metric for shape-based plant leaf classification. Fremantle: International Conference on Digital Image Computing Techniques and Applications; 2012. p. 1–7.Google Scholar
- Novotny P, Suk T. Leaf recognition of woody species in Central Europe. Biosystems Engineering 115:444–452.Google Scholar
- Larese MG, Namías R, Craviotto RM, Arango MR, Gallo C, Granitto PM. Automatic classification of legumes using leaf vein image features. Pattern Recognit. 2014;47:158–68.View ArticleGoogle Scholar
- Wilf P, Zhang S, Chikkerur S, Little SA, Wing SL, Serre T. Computer vision cracks the leaf code. Proc Natl Acad Sci U S A. 2016;113:3305–10.View ArticlePubMedPubMed CentralGoogle Scholar
- Li Y, Sun J, Tang CK, Shum HY. Lazy snapping. ACM Trans Graphics. 2004;23:303–8.View ArticleGoogle Scholar
- Corney D, Clark JY, Tang HL, Wilkin P. Automatic extraction of leaf characters from herbarium specimens. Taxon. 2012;61:231–44.Google Scholar
- Zwiggelaar R, Parr TC, Taylor CJ. Finding orientated line patterns in digital mammographic images. Edinburgh: Proceedings of the 7th British Machine Vision Conference; 1996. p. 715–24.Google Scholar
- Michels DL, Giesselbach SA, Werner T, Steinhage V. On feature extraction for fingerprinting grapevine leaves. Proc. Intern. Conf. on Image Processing, Computer Vision, and Pattern Recognition. Las Vegas: IPCV; 2013. p. 407–12.Google Scholar
- Yahiaoui I, Mzoughi O, Boujema N. Leaf shape descriptor for tree species identification. Multimedia Expo (ICME). 2012;2012:254–9.Google Scholar
- Hsu C, Lin C. A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 2002;13:415–25.Google Scholar
- Belhumeur P, Chen D, Feiner S, Jacobs D, et al. Searching the world’s herbaria: A system for visual identification of plant species. Computer Vision. Berlin: ECCV; 2008. p. 116–29.Google Scholar
- Cope JS, Corney D, Clark JY, Remagnino P, Wilkin P. Plant species identification using digital morphometrics: A review. Expert Syst Appl. 2012;39:7562–73.View ArticleGoogle Scholar
- LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44.View ArticlePubMedGoogle Scholar
- Kumar N, Belhumeur PN, Biswas A, Jacobs DW, Kress WJ, Lopez IC, Soares JVB. Leafsnap: A computer vision system for automatic plant species identification. Computer Vision. Berlin: ECCV; 2012. p. 506–16.Google Scholar
- Grimm J, Hoffmann M, Stöver BC, Müller KF, Steinhage V. Image-based identification of plant species using a model-free approach and active learning. Lect Notes Comput Sci. 2016;9904:169–76.View ArticleGoogle Scholar
- Alexis J, Bonnet P, Goëau H, Barbe J, Selmi S, Champ J, Dufour-Kowalski S, Affouard A, Carré J, Molino J-F, Boujemaa N, Barthélémy D. A look inside the Pl@ntNet experience. Good Bias Hope Multimedia Systems. 2016;22:751–66. doi:10.1007/s00530-015-0462-9.Google Scholar
- Tosh J, James K, Rumsey F, Crookshank A, Dyer R, Hopkins D. Is DNA barcoding child’s play? Science education and the utility of DNA barcoding for the discrimination of UK tree species. Bot J Linn Soc. 2016;181:711–22.View ArticleGoogle Scholar