iPhylo: nomenclators

Roderic D. M. Page

Showing posts with label nomenclators. Show all posts

Monday, November 11, 2013

Names and nomenclators: just do it already!

Quick notes on taxonomic names (again). It's a continuing source of bafflement that the biodiversity community is making a dog's breakfast of names. It seems we are forever making it more complicated than it needs to be, forever minting new acronyms that pollute the landscape without actually contributing anything useful, and forever promising shiny new tools and services without every actually delivering them. Meanwhile people and projects that build upon names are left to deal with a mess.

It seems to me that it would be nice if we had a single place to go to get definitive information on a name, and that place would give us a unique identifier that we could use in our own databases as a way to clean up and reconcile our data. For example, if we have a bibliographic database we can map citations to DOIs and then use those to identify the articles. If we have a list of journal names, we can map those to ISSNs and clean up our data. Likewise, if we have a classification such as GBIF or NCBI, we should be able to map the names in those classifications onto standard identifiers for taxonomic names.

The frustrating thing is we already have standard identifiers for taxonomic names. Since around 2005 we have been serving LSIDs for plant and animal names. We have Index Fungorum, IPNI, ION, and ZooBank, all serving LSIDs, all serving RDF, all using the same TDWG vocabulary.

The nomenclators vary in size and scope, but we have the three major, multicellular eukaryotes covered (circles proportional to number of names in each database):

There is some duplication, both within nomenclators (IPNI and ION I'm looking at you) and between nomenclators (ION and ZooBank have the same scope, although ZooBank is dwarfed by ION, anyone care to explain why we have both...?). All four databases are actively growing, partly through direct registration of new taxonomic names.

So, we're basically done, right? Surely all we need to do is harvest the LSIDs for all these names, put them into a single triple store, and wrap some basic services around them? If the nomenclators provide a list of recent changes (e.g., as an RSS feed) then we could continuously update the store with new names. Then any database or classification could reconcile it's names with those in the nomenclators. They could also then augment their own records by making use of additional data the nomenclators have, such as objective synonomies and links to original descriptions. In other words, we could have a model like this:
Taxonmodel

Classifications represent a view of how taxa are related, the names associated with those taxa are stored in nomenclators. This means that classification databases like GBIF and NCBI are not in the business of managing names, they simply link to the nomenclators (in the same way that a bibliographic database can link to DOI, ISSNs, and author ids such as ORCID and VIAF).

We have almost all of this infrastructure in place already. In one of the unsung triumphs of TDWG we have all the nomenclators serving data in the same format using the same technology. And yet we have singly failed to do anything useful with this extraordinary resource! Instead we seem more interested in contributing more projects to the acronym soup of biodiversity informatics. All around us projects to assign and link identifiers for publications (CrossRef), data (DataCite), and people (ORCID) are taking off. The infrastructure for taxonomic names has been in place since 2005, we could be doing the same sort of things CrossRef, DataCite and ORCID are doing in their domains. Why aren't we?

Thursday, March 03, 2011

Microcitations: linking nomenclators to BHL

One of the challenges of linking databases of taxonomic names to the primary literature is the minimal citation style used by nomenclators (see my earlier post Nomenclators + digitised literature = fail).

For example, consider Nomenclator Zoologicus. Volumes 1-10 of this list of generic names in zoology were digitised in 2004 and put online by uBio (for more details of this project see Taxonomic informatics tools for the electronic Nomenclator Zoologicus, pmid:16501061). In Nomenclator Zoologicus the citation for the genus Abana is:

Ann. Mag. nat. Hist., (8) 2, 72.

The challenge is to link this short citation to the digital version of the corresponding article. I've been sitting on a copy of the digitised Nomenclator Zoologicus kindly provided by Dave Remsen, and I've finally started to look at the problem of mining it for links to databases such as BHL.

You can see the first attempt at http://biostor.org/microcitation.php. This form takes a genus name and the short citation and attempts to locate the corresponding page in BHL. It then checks whether the name is present on that page. Locating a page in a journal can be a challenge given the often rather ropey metadata in BHL, but BioStor uses a combination of fuzzy string matching and crude kludges to find the best match. But a further complication is that OCR errors may mean the taxonomic name we are looking for might not be detected on the page.

For example, if we search for the citation for the genus Aethriscus, Ann. Mag. nat. Hist., (7) 10, 329. we find two candidate pages in the journal Ann. Mag. nat. Hist, but neither contains the string "Aethriscus". However, if we use approximate string matching we find the OCR text for one page has the string "thriscus". This differs by only two characters from "Aethriscus", and so is a possible match (shown in orange).

Looking at the scanned page we can see the likely source of the problem:

In the original publication the name Aethriscus was written as Æthriscus. The ligature Æ has been corrupted by the OCR engine, and in Nomenclator Zoologicus the name is written without the ligature, hence the failure to exactly match the name with the text. These are some of the challenges faced when trying to close the circle and link names to literature.

The microcitation parser is still pretty crude, but usable. You can get results in either HTML or JSON, so the task of mapping microcitations to BHL pages can be automated. At present the name matching assumes you are looking at a single word (e.g., a genus), I need to extend it to handle binomials.

Thursday, May 07, 2009

Nomenclators + digitised literature = fail

Continuing with RSS feeds, I've now added wrappers around IPNI that will return for each plant family a list of names added to the IPNI database in the last 30 days. You can see the list at here.

One thing which is a constant source of frustration for me is the disconnect between nomenclators (lists of published names for species) and scientific publishing. The unit of digitisation for a publisher is the scientific article, but nomenclators often cite not the article in which a name was published, but the page on which the name appears.

For example, consider IPNI record 77096979-1 (or, if you prefer LSIDs urn:lsid:ipni.org:names:77096979-1). It is for the begonia Begonia ozotothrix, and the citation is:

Edinburgh J. Bot. 66(1): 105 (-110; figs. 1, 4-5, map). 2009 [Mar 2009]

Very detailed, and great if I have access to a physical library that has the Edinburgh Journal of Botany -- I just find volume 66 on the shelf and turn to page 105. But, I want this on my computer now ("library" - who they?). How do I find this reference on the web? The answer, is not easily. Tools such as OpenURL, which could be used, assume that I know at least the starting page of the article, but IPNI doesn't tell me that. Nor do I have an article title, which might help, but a Google search on "Begonia ozotothrix" finds the article:

TWO NEW SPECIES OF BEGONIA ( BEGONIACEAE) FROM CENTRAL SULAWESI, INDONESIA
D C Thomas, W H Ardi and M Hughes
Edinburgh Journal of Botany 66, 103 (2009)
doi:10.1017/S0960428609005320

Note the DOI! This article exists on the web, so why can't IPNI give me the DOI? They've gone to a lot of trouble to describe the citation in great detail, but adding the DOI brings the record into the 21st century and the web (the DOI is even printed on the article!).

I think nomenclators need to make a concerted effort to integrate with the digital scientific literature, otherwise they will remain digital backwaters that make the implicit assumption that their users have access to libraries such as that at the Royal Botanic Gardens, Edinburgh (pictured).

For recently published articles there's absolutely no reason not to store the DOI. Finding these retrospectively is a pain, but I need these for my RSS feed (and other projects) so one thing I added a while ago to bioGUID's OpenURL resolver is the ability to search for an article given an arbitrary page. For example,

http://bioguid.info/openurl/?genre=article&title=Edinburgh J. Bot.&volume=66&pages=105

will search various sources (such as CrossRef) to find an article that includes page 105. Now, I just have to have a parser that can make sense of IPNI bibliographic citations...