iPhylo: birds

Roderic D. M. Page

Showing posts with label birds. Show all posts

Thursday, May 14, 2015

The value of ION to GBIF

This a quick writeup of an analysis I did to make the case that the list of names held by the Index of Organism Names (ION) (part of Thomson Reuters) would be very useful for GBIF. I must declare a bias, in that I've spent a good chunk of the last 3-4 years exploring the ION database and investigating ways to link the taxonomic names it contains to the primary taxonomic literature, culminating in building BioNames.

What makes ION special is its scope (it endeavours to have all names covered by the ICZN), and that many of its names have associated citation information (i.e., details on the publication that published the name). Like any name database it has duplications and errors, and some of the older content is a bit ropey, but it's a tremendous resource and from my perspective nothing else in zoology come close.

But rather than rely on anecdote, I decided to do a quick analysis to see what ION could potentially add to GBIF. I've been doing some work on bird names recently, so as an exercise I searched GBIF for holotype specimens for birds. The search (13 May 2015) returned 11,664 records. I then filtered those on taxonomic names that GBIF could not match exactly (TAXON_MATCH_FUZZY) or names that GBIF could only match to a higher rank (TAXON_MATCH_HIGHERRANK). The query URL is:

http://www.gbif.org/occurrence/search?TAXON_KEY=212 &TYPE_STATUS=HOLOTYPE &ISSUE=TAXON_MATCH_FUZZY &ISSUE=TAXON_MATCH_HIGHERRANK

This query found 6,928 records, so over half the bird holotype specimens in GBIF do not match a taxonomic name in GBIF. What this means is that GBIF can't accurately place these names in its own taxonomic hierarchy. It also makes it hard to do meaningful analyses of things such as "how long does it take before a bird specimen is collected to when it is described as a new species?" because if you can match the name then you can't get the date the name was published.

To explore this further, I downloaded the results of the query (the download has DOI http://doi.org/10.15468/dl.vce3ay). I then wrote a script to parse the specimen records and extract the GBIF occurrence id, catalogue number, and scientific name. I then used the GBIF API to retrieve (where available) the verbatim record for each specimen (using the URL http://api.gbif.org/v1/occurrence//verbatim where is the occurrence id). This gives us the original name on the specimen, which I then looked up in BioNames using its API. If I got a hit I extracted the identifier of the name (the LSID in the ION database) and the corresponding publication id in BioNames (if available). If there was a publication associated with the name I then generated a human-readable citation using BioNames’s citeproc API. The code for all this is on github.

Here's a sample of the mapping:

Occurrence	Holotype	GBIF matched name	Verbatim name	ION	BioNames	Publicaton
883603238	USNM PAL378357.3368464	Porzana Vieillot, 1816	Porzana severnsi	879659	2c4f3...	Olson, S. L., & James, H. F. (1991). Descriptions of thirty-two new species of birds from the Hawaiian Islands: Part 1. Non-Passeriformes. Ornithological Monographs, 45, 1-88. doi:10.2307/40166794
858732312	AMNH Skin-245914	Otus choliba (Vieillot, 1817)	Otus choliba duidae	4307811	b3315...	Chapman, F. M., & History, T. D. E. of the A. M. of N. (1929). Descriptions of new Birds from Mt. Duida, Venezuela. American Museum Novitates, 380, 1-27. Retrieved from http://hdl.handle.net/2246/3988
858732345	AMNH Skin-245936	Atlapetes Wagler, 1831	Atlapetes duidae	4307791	b3315...	Chapman, F. M., & History, T. D. E. of the A. M. of N. (1929). Descriptions of new Birds from Mt. Duida, Venezuela. American Museum Novitates, 380, 1-27. Retrieved from http://hdl.handle.net/2246/3988
858733764	AMNH Skin-45339	Leptotila Swainson, 1837	Leptotila gaumeri Lawr.
858744126	AMNH Skin-218110	Zosterops Vigors & Horsfield, 1827	Zosterops alberti ablita

The complete result of this mapping can be viewed here. Of the 6,392 holotypes with names not recognised by GBIF, nearly half (3,165, 49.5%) exactly matched a name in ION. Many of these are also linked to the publication that published that name.

So, adding ION help us find half the missing holotype names. This is before doing anything more sophisticated, such as approximate string matching, resolving synonyms, etc. Hence, I'd argue that the names in ION would add a lot to GBIF's ability to interpret the occurrence records it receives from museums.

I've not had time for further analysis, but at first glance a lot of the missed names are subspecies, the are quite a few fossils, and many names are in the relatively older literature. However there are also some recently described taxa, such as the hawk-owl Ninox rumseyi Rasmussen et al. 2012, and a bunting subspecies from Tristan du Cuhna (Nesospiza acunhae fraseri Ryan, 2008) that are missing from GBIF.

Thursday, May 16, 2013

The impact of museum collections: one collection ≈ one Nobel Prize

iPhylo: GBIF specimens in BioStor: who are the top ten museums?(Unfortunately not @amnh) iphylo.blogspot.com/2012/02/gbif-s…
— Susan Perkins (@NYCuratrix) May 14, 2013

Ideas on measuring the "impact" of a natural history collection have been bubbling along, as reflected in recent comments on iPhylo, and some offline discussions I've been having with David Blackburn and Alan Resetar.

My focus has been at the specimen-level, with a view to motivation the adoption of persistent specimen-level identifiers so that we can citations of specimens over time (e.g., in publications and databases such as GenBank). Not only does this provide a measure of the "impact" of a collection, it helps with provenance. If we sequence a specimen that is subsequently assigend to a different taxon and we have a way of tracking that specimen via its identifier, then we can transmit that new identification to other consumers of data based on that specimen. For example, we could automatically notify GenBank that what we thought was an x is actually a y.

So I made a simple "league table" of museum collections based on specimens cited in BioStor. There are all sorts of issues with this approach. Once you rank collections, people may use that to argue some can be axed and more resources funnelled into others. A more positive approach would be to indetify collections that are underused, and try and figure out why. And in the same way that taxonomic papers may have a citation long life, specimens may sit in a museum for a long time before being cited (for example, when eventually recognised as a new species doi:10.1016/j.cub.2012.10.029). So, metrics can be a double-edged sword.

Citing specimens is a useful metric, but not all citations are equal, and not all citations are immediate. A specimen that yields DNA sequences that are published in, say, Nature, arguably has more weight than a specimen listed in a rarely cited paper. Likewise, subsequent citations of a paper that cites a specimen should confer more weight on the value of that specimen. Elsewhere (doi:10.1093/bib/bbn022, preprint here: hdl:10101/npre.2008.1760.1) I've argued for a Google PageRank-style way to measure the impact of a specimen that takes into account papers and other objects derived from a specimen (e.g., images, sequences).

Meanwhile, Morgan Jackson alerted me to a quicker way to get a measure of the impact of the collection.

@rdmpage @nycuratrix Check Nature for a recent note about this. A bird museum calc. their collection's h-score from papers citing specimens
— Morgan Jackson (@BioInFocus) May 16, 2013

The "short note" Morgan refers to is by Kevin Winker and Jack J. Withrow:

Winker, K., & Withrow, J. J. (2013). Natural history: Small collections make a big impact. Nature, 493(7433), 480–480. doi:10.1038/493480b

They constructed a Google Scholar profile and collected papers that cite the University of Alaska Museum's bird collection (see here for full details). The h-score of this collection of papers is 42, which Winkler and Withrow note is "equivalent to an average Nobel laureate in physics". Here's the graph of citations over time:

Chart 1

It's a neat trick, if a little time consuming. But one advantage it has is that it puts collections on a similar footing to individual researchers. You could imagine asking the question "how much money would you spend supporting a researcher at this level?" How does this compare to the resources actually being spent?

One thing I hope will emerge from discussions like this is a desire to make specimens first-class citizens of the web, with stable identifiers that enable them to be cited in the same way we cite papers and, increasingly, data sets.