iPhylo: IRMNG

Roderic D. M. Page

Showing posts with label IRMNG. Show all posts

Wednesday, September 07, 2016

Guest post: Absorbing task or deranged quest: an attempt to track all genus names ever published

YtNkVT2U This guest post by Tony Rees describes his quest to track all genus names ever published (plus a subset of the species…).

A “holy grail” for biodiversity informatics is a suitably quality controlled, human- and machine-queryable list of all the world’s species, preferably arranged in a suitable taxonomic hierarchy such as kingdom-phylum-class-order-family-genus or other. To make it truly comprehensive we need fossils as well as extant taxa (dinosaurs as well as dinoflagellates) and to cover all groups from viruses to vertebrates (possibly prions as well, which are, well, virus-like). Linnaeus had some pretty good attempts in his day, and in the internet age the challenge has been taken up by a succession of projects such as the “NODC Taxonomic Code” (a precursor to ITIS, the Integrated Taxonomic Information System - currently 722,000 scientific names), the Species 2000 consortium, and the combined ITIS+SP2000 product “Catalogue of Life”, now in its 16th annual edition, with current holdings of 1,635,250 living and 5,719 extinct valid (“accepted”) species, plus an additional 1,460,644 synonyms (information from http://www.catalogueoflife.org/annual-checklist/2016/info/ac). This looks pretty good until one realises that as well as the estimated “target” of 1.9 million valid extant species there are probably a further 200,000-300,000 described fossils, all with maybe as many synonyms again, making a grand total of at least 5 million published species names to acquire into a central “quality assured” system, a task which will take some time yet.

Ten years ago, in 2006, the author participated in a regular meeting of the steering committee for OBIS, the Ocean Biogeographic Information System which, like GBIF, aggregates species distribution data (for marine species in this context) from multiple providers into a single central search point. OBIS was using the Catalogue of Life (CoL) as its “taxonomic backbone” (method for organising its data holdings) and, again like GBIF, had come up against the problem of what to do with names not recognised in the then-latest edition of CoL, which was at the time less than 50% complete (information on 884,552 species). A solution occurred to me that since genus names are maybe only 10% as numerous as species names, and every species name includes its containing genus as the first portion of its binomial name (thanks, Linnaeus!), an all-genera index might be a tractable task (within a reasonable time frame) where an all-species index was not, and still be useful for allocating incoming “not previously seen” species names to an appropriate position in the taxonomic hierarchy. OBIS, in particular, also wished to know if species (or more exactly, their parent genera) were marine (to be displayed) or nonmarine (hide), similar with extant versus fossil taxa. Sensing a challenge, I offered to produce such a list, in my mind estimating that it might require 3 months full-time, or the equivalent 6 months in part time effort to complete and supply back to OBIS for their use.

To cut a long story short… the project, which I christened the Interim Register of Marine and Nonmarine Genera or IRMNG (originally at CSIRO in Australia, now hosted on its own domain “www.irmng.org” and located at VLIZ in Belgium) has successfully acquired over 480,000 published genus names, including valid names, synonyms and a subset of published misspellings, all allocated to families (most) or higher ranks (remainder) in an internally coherent taxonomic structure, most with marine/nonmarine and extant/fossil flags, all with the source from which I acquired them, sources for the flags, and more; also for perhaps 50% of genera, lists of associated species from wherever it has been convenient to acquire them (Catalogue of Life 2006 being a major source, but many others also used). My estimated 6 months has turned into 10 years and counting, but I do figure that the bulk of the basic “names acquisition” has been done for all groups (my estimate: over 95% complete) and it is rare (although not completely unknown) for me to come across genus names not yet held, at least for the period 1753-2014 which is the present coverage of IRMNG; present effort is therefore concentrated on correcting internal errors and inconsistencies, and upgrading the taxonomic placement (to family) for the around 100,000 names where this is not yet held (also establishing the valid name/synonym status of a similar number of presently “unresolved” generic names).

With the move of the system from Australia to VLIZ, completed within the last couple of months, there is the facility to utilise all of the software and features presently developed at VLIZ that currently runs WoRMS, the World Register of Marine Species and its many associated subsidiary databases, as well as (potentially) look at forming a distributed editing network for IRMNG in the future, as already is the case for WoRMS, presuming that others are see a value in maintaining IRMNG as a useful resource e.g. for taxonomic name resolution, detection of potential homonyms both within and across kingdoms, and generally acting as a hierarchical view of “all life” to at least genus level. A recently implemented addition to IRMNG is to hold ION identifiers (also used in BioNames), for the subset of names where ION holds the original publication details, enabling “deep links” to both ION and BioNames wherein the original publication can often be displayed, as previously described elsewhere in this Blog. Similar identifiers for plants are not yet held in the system but could be, (for example Index Fungorum identifiers for fungi), for cases where the potential linked system adds value in giving, for example, original publication details and onward links to the primary literature.

All in all I feel that the exercise has been of value not only to OBIS (the original “client”) but also to other informatics systems such as GBIF, Encyclopedia of Life, Atlas of Living Australia, Open Tree of Life and others who have all taken advantage of IRMNG data to add to their systems, either for the marine/nonmarine and extant/fossil flags or as an addition to their primary taxonomic backbones, or both. In addition it has allowed myself, the founding IRMNG compiler, to “scratch the taxonomic itch” and finally flesh out what is meant by statements that a certain group contains x families or y genera, and what these actually might be. Finally, many users of the system via its web interface have commented over time on how useful it is to be able to input “any” name, known or unknown, with a good chance that IRMNG can tell them something about the genus (or genus possible options, in the case of homonyms) as well as the species, in many cases, as well as discriminate extant from fossil taxon names, something not yet offered to any significant extent by the current Catalogue of Life.

Readers of iPhylo are encouraged to try IRMNG as a “taxonomic name resolution service” by visiting www.irmng.org and of course, welcome to contact me with suggestions of missed names (concentrating at genus level at the present time) or any other ideas for improvement (which I can then submit for consideration to the team at VLIZ who now provide the technical support for the system).

Tuesday, December 01, 2015

Guest post: 10 years of global biodiversity databases: are we there yet?

YtNkVT2U This guest post by Tony Rees explores some of the themes from his recent talk 10 years of Global Biodiversity Databases: Are We There Yet?.

10 years of global biodiversity databases: are we there yet? from Tony Rees

A couple of months ago I received an invitation to address the upcoming 2015 meeting of the Malacological Society of Australasia (Australian and New Zealand mollusc persons for the uninitiated) on some topic associated with biodiversity databases, and I decided that a decadal review might be an interesting exercise, both for my potential audience (perhaps) and for my own interest (definitely). Well, the talk is delivered and the slides are available on the web for viewing if interested, and Rod has kindly invited me to present some of its findings here, and possibly stimulate some ongoing discussion since a lot of my interests overlap his own quite closely. I was also somewhat influenced in my choice of title by a previous presentation of Rod's from some 5 years back, "Why aren't we there yet?" which provides a slightly pessimistic counterpoint to my own perhaps more optimistic summary.

I decided to construct the talk around 5 areas: compilations of taxonomic names and associated valid/accepted taxa; links to the literature (original citations, descriptions, more); machine-addressable lists of taxon traits; compilations of georeferenced species data points such as OBIS and GBIF; and synoptic efforts in the environmental niche modelling area (all or many species so as to be able to produce global biodiversity as well as single-species maps). Without recapping the entire content of my talk (which you can find on SlideShare), I thought I would share with readers of this blog some of the more interesting conclusions, many of which are not readily available elsewhere, at least not with effort to chase down and/or make educated guesses.

In the area of taxonomic names, for animals (sensu lato) ION has gone up from 1.8m to 5.2m names (2.8m to 3.5m indexed documents) from all ranks (synonyms not distinguished) over the cited period 2005-2015, while Catalogue of Life has gone up from 0.5m species names + ?? synonyms to 1.6m species names + 1.3m synonyms over the same period; for fossils, BioNames database is making some progress in linking ION names to external resources on the web but, at less than 100k such links, is still relatively small scale and without more than a single-operator level of resourcing. A couple of other "open access" biological literature indexing activities are still at a modest level (e.g. 250k-350k citations, as against an estimated biological literature of perhaps 20m items) at present, and showing few signs of current active development (unless I have missed them of course).

Comprehensive databases of taxon traits (in machine addressable form) appear to have started with the author’s own "IRMNG" genus- and species- level compendium which was initially tailored to OBIS needs for simply differentiating organisms into extant vs. fossil, marine vs. nonmarine. More comprehensive indexes exist for specific groups and recently, Encyclopedia of Life has established "TraitBank" which is making some headway although some of the "traits" such as geographic distribution (a bounding box from either GBIF or OBIS) and "number of GenBank sequences" stretch the concept of trait a little (just my two cents' worth, of course), and the newly created linkage to Google searches is to be applauded.

With regard to aggregating georeferenced species data (specimens and observations), both OBIS (marine taxa only) and GBIF (all taxa) have made quite a lot of progress over the past ten years, OBIS increasing its data holdings ninefold from 5.6m to 44.9m (from 38 to 1,900+ data providers) and GBIF more than tenfold from 45m to 577m records over the same period, from 300+ to over 15k providers. While these figures look healthy there are still many data gaps in holdings e.g. by location sampled, year/season, ocean depth, distance to land etc. and it is probably a fair question to ask what is the real "end point" for such projects, i.e. somewhere between "a record for every species" and "a record for every individual of every species", perhaps...

Global / synoptic niche modelling projects known to the author basically comprise Lifemapper for terrestrial species and AquaMaps for marine taxa (plus some freshwater). Lifemapper claims "data for over 100,000 species" but it is unclear whether this corresponds to the number of completed range maps available at this time, while AquaMaps has maps for over 22,000 species (fishes, marine mammals and invertebrates, with an emphasis on fishes) each of which has a point data map, a native range map clipped to where the species is believed to occur, an "all suitable habitat map" (the same unclipped) and a "year 2100 map" showing projected range changes under one global warming scenario. Mapping parameters can also be adjusted by the user using an interactive "create your own map" function, and stacking all completed maps together produces plots of computed ocean biodiversity plus the ability to undertake web-based "what [probably] lives here" queries for either all species or for particular groups. Between these two projects (which admittedly use different modelling methodologies but both should produce useful results as a first pass) the state of synoptic taxon modelling actually appears quite good, especially since there are ongoing workshops e.g. the recent AMNH/GBIF workshop Biodiversity Informatics and Modelling Species Distributions at which further progress and stumbling blocks can be discussed.

So, some questions arising:

Who might produce the best "single source" compendium of expert-reviewed species lists, for all taxa, extant and fossil, and how might this happen (my guess: a consortium of Catalogue of Life + PaleoBioDB at some future point)
Will this contain links to the literature, at least citations but preferably as online digital documents where available? (CoL presently no, PaleoBioDB has human-readable citations only at present)
Will EOL increasingly claim the "TraitBank" space, and do a good enough job of it? (also bearing in mind that EOL is still an aggregator, not an original content creator, i.e. somebody still has to create it elsewhere)
Will OBIS and/or GBIF ever be "complete", and how will we know when we’ve got there (or, how complete is enough for what users might require)?
Same for niche modelling/predicted species maps: will all taxa eventually be covered, and will the results be (generally) reliable and useable (and at what scale); or, what more needs to be done to increase map quality and reliability.

Opinions, other insights welcome!