- Open Access
Exploring historical trends using taxonomic name metadata
© Sarkar et al; licensee BioMed Central Ltd. 2008
- Received: 30 August 2007
- Accepted: 13 May 2008
- Published: 13 May 2008
Authority and year information have been attached to taxonomic names since Linnaean times. The systematic structure of taxonomic nomenclature facilitates the ability to develop tools that can be used to explore historical trends that may be associated with taxonomy.
From the over 10.7 million taxonomic names that are part of the uBio system , approximately 3 million names were identified to have taxonomic authority information from the years 1750 to 2004. A pipe-delimited file was then generated, organized according to a Linnaean hierarchy and by years from 1750 to 2004, and imported into an Excel workbook. A series of macros were developed to create an Excel-based tool and a complementary Web site to explore the taxonomic data. A cursory and speculative analysis of the data reveals observable trends that may be attributable to significant events that are of both taxonomic (e.g., publishing of key monographs) and societal importance (e.g., world wars). The findings also help quantify the number of taxonomic descriptions that may be made available through digitization initiatives.
Temporal organization of taxonomic data can be used to identify interesting biological epochs relative to historically significant events and ongoing efforts. We have developed an Excel workbook and complementary Web site that enables one to explore taxonomic trends for Linnaean taxonomic groupings, from Kingdoms to Families.
- Literature Digitization
- Taxonomic Grouping
- Taxonomic Description
- Natural Language Processing Technique
- Natural Language Processing Tool
Taxonomic names represent one of the fundamental tokens that bridge biological knowledge, regardless of its form, across multiple resources [1, 2]. The standardized format of Genus species associated with a taxonomic name is universally accepted and identifiable across multiple levels of expertise. Furthermore, a well-formed taxonomic name includes a proper name and a four-digit number, which represent the taxonomic authorship and date of description. Efforts such as the Catalogue of Life Program (COLP ), an international joint effort to create a single checklist for all known species, are a significant step towards developing common resources to keep track and integrate organism-based biological information across disparate resources.
The Universal Biological Indexer and Organizer (uBio ), is a complementary taxonomic information system designed to provide the core services that address name-based impediments to information retrieval [5, 6]. In addition to collecting taxonomic and vernacular names from centralized resources such as the aforementioned COLP, scientific names and authority information are also gathered from a range of sources that provide downloads or data transfers. Through the use of natural language processing techniques, organism name, authority, and date information are also extracted from primary literature sources (such as Nomenclator Zoologicus). The uBio database of organism names currently exceeds 10.7 million records. Of these records, approximately 3 million are associated with authority and year information.
A series of Ruby scripts were developed to identify and process the requisite data into a form that could be imported Microsoft Excel. First, taxonomic names that contain authority and year information were identified from the uBio NameBank database. Next, the identified names were organized into a unified taxonomic hierarchy that consisted of the Kingdom, Phylum, Class, Order, and Family information inferred from NCBI Taxonomy, ITIS Taxonomy, and Catalogue of Life Taxonomy. The number of taxonomic names associated for each year from 1750 to 2004 was tabulated and output into a pipe-delimited format [see Additional file 1] that could be imported into Microsoft Excel [see Additional file 2]. Searching and graphing features were enabled for the data imported into Excel using series of Visual Basic macros. The data were also loaded into a MySQL database, which was subsequently incorporated into a Ruby on Rails Web application.
Two major declines in taxonomy are observed during the periods of 1912–1919 and 1939–1945. Both of these declines are coincident with World Wars I and II. In these periods, the number of taxonomic descriptions falls significantly. Interestingly, while there are large declines in Animal and Plant descriptions, the number of taxonomic descriptions for Bacteria encounters an increase during the same epochs. A speculative hypothesis might be that bacteriology experienced an insurgence during the early 1900's that carried on through subsequent years. This might be reflective of the emergence of methods to study microbes . However, it is important to note that claims of the World Wars directly affecting the rate of taxonomic descriptions would require deeper discussions and historical inquiry of biological/taxonomic activity that is outside the scope of the present discussion.
Following the World Wars, the rate of taxonomic descriptions per year does not reach the pre-World War I peak of 30,790 descriptions until 1994 (28,724 descriptions). Between 1945 and 1994, the number of all taxonomic descriptions per year remains around 18,000. Again, it is difficult to speculate on the actual cause for the last peak in 1994. Some of the cause may be rooted in the quality of the data in the uBio NameBank database, which is a reflection of the data sources wherefrom the taxonomic names originate. For example, the analyzed data indicate that a significant portion of the 1994 peak is due to an increase in descriptions of the family Cerambycidae, a large group of beetles. This may very well be reflective of the sources that are used to populate uBio NameBank, which may be more complete with contemporary descriptions of particular taxonomic groups. Thus, we would like to emphasize that it is likely that the last decade of the data analyzed (from 1994 to 2004) may be incomplete, and thus potentially misleading, since there is a delay between taxonomic descriptions and their official vetting and appearance in centralized checklists. A caveat when exploring taxonomic trends and inferring conclusions from these data is that they are dependent on only the currently organized data within the uBio resource. Nonetheless, for the data still likely reflect the general historical trends of data before the 1990's. To this end, names within uBio are dependent on the accuracy and completeness of both expert community checklists and centralized taxonomies. For example, when considering the taxonomic group for butterflies and moths (Lepidoptera), the spreadsheet is biased towards those names that were gathered from LepIndex , which is in turn based on a database of a paper-card catalogue that has had few additions since the 1980's (of the ~351,000 descriptions in uBio, ~290,000 are from LepIndex). In addition to continuing to add new checklists and keeping them updated, we are developing natural language processing tools to identify new taxonomic name descriptions as they appear in the literature (e.g., uBioRSS ). It is our hope that through navigating the data through exploratory tools, such as the Excel workbook presented here, that additional checklists and other name sources can be identified and shared via resources such as uBio.
Variance in taxonomic nomenclature also accounts for underestimates of some taxa represented in these data, most notably viruses. Because virus nomenclature generally does not follow the Linnaean binomen, author, year format, it can be difficult to automatically infer the author and year for virus taxa. There are plans to identify year information using source data from the ICTVdb  and incorporate it into a future version of the dataset.
The valuable insights that are possible from analyses such as the ones presented here are due in large part to the meticulous and systematic nomenclature that has been in place for scientific organism names since first suggested by Linnaeus in the 1700's. Here, we have leveraged the systematic annotation of dates associated with scientific names to develop a mechanism to explore taxonomic trends. The resulting Excel workbook and Web interface enable one to explore temporal taxonomic trends for major taxonomic groupings, from Kingdoms to Families.
Project name: TaxaToy
Project home page: http://sarkarlab.mbl.edu/taxatoy
Operating system(s): Excel workbook requires Microsoft Office (Microsoft Windows or Mac OS X); pipe-delimited file can be viewed on any platform; the Web application can be viewed in Web browsers in any operating system
Programming language: Visual Basic; Ruby on Rails
License: GNU GPL
Any restrictions to use by non-academics: None
The work presented here was funded in part by the MBLWHOI Library and the DAB Lindberg Research Fellowship from the Medical Library Association to INS. The authors also thank David Remsen and Patrick Leary for the development and encoding of uBio data in a form that made the analyses presented herein possible. Finally, the authors thank the reviewers of earlier versions of this manuscript for their insightful comments and suggestions.
- Polaszek A: A universal register for animal names. Nature. 2005, 437: 477-10.1038/437477a.View ArticlePubMedGoogle Scholar
- Sarkar IN: Biodiversity informatics: organizing and linking information across the spectrum of life. Brief Bioinform. 2007, 8: 347-357. 10.1093/bib/bbm037.View ArticlePubMedGoogle Scholar
- Catalogue of Life. [http://www.species2000.org]
- The universal biological indexing and organization system. [http://www.ubio.org]
- Patterson DJ, Remsen DP, Marino WA, Norton C: Taxonomic indexing – extending the role of taxonomy. Syst Biol. 2006, 55: 367-373. 10.1080/10635150500541680.View ArticlePubMedGoogle Scholar
- Remsen DP, Norton C, Patterson DJ: Taxonomic informatics tools for the electronic Nomenclator Zoologicus. Biol Bull. 2006, 210: 18-24. 10.2307/4134533.View ArticlePubMedGoogle Scholar
- TaxaToy. [http://sarkarlab.mbl.edu/taxatoy]
- Lederberg J: Infectious history. Science. 2000, 288: 287-293. 10.1126/science.288.5464.287.View ArticlePubMedGoogle Scholar
- The Global Lepidoptera Names Index (LepIndex). [http://www.nhm.ac.uk/entomology/lepindex]
- Leary PR, Remsen DP, Norton CN, Patterson DJ, Sarkar IN: uBioRSS: tracking taxonomic literature using RSS. Bioinformatics. 2007, 23: 1434-1436. 10.1093/bioinformatics/btm109.View ArticlePubMedGoogle Scholar
- The universal virus database of the international committee on taxonomy of viruses. [http://www.ncbi.nlm.nih.gov/ICTVdb]
- Members of the biodiversity heritage library. [http://biodiversitylibrary.org/Members.aspx]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.