Tuesday, November 29, 2011

Mapping names to literature: closing in on 250,000 names

Following on from my earlier post Linking taxonomic names to literature: beyond digitised 5×3 index cards I've been slowly updating my latest toy:


This site displays a database mapping over 200,000 animal names to the primary literature, using a mix of identifiers (DOIs, Handles, PubMed, URLs) as well as links to freely available PDFs where they are available. Lots still to do as about a third of the 1.5 million names in the database have citations that my code hasn't been able to parse. There are also lots of gaps that need to be filled in, for example missing DOIs or PubMed identifiers, and a lot of the earlier names are linked by "microcitations" to names, and I'll need to handle those (using code from my earlier project Nomenclator Zoologicus meets Biodiversity Heritage Library: linking names directly to literature).

The mapping itself is stored in a database that I'm constantly editing, so this is far from production quality, but I've found it eye-opening just how much literature is available. There is a lot of scope for generating customised lists of papers, for example, primary taxonomic sources for taxa currently on the IUCN Red List, or those taxa which have sequences in GenBank (building on the mapping of NCBI taxa onto Wikipedia). Given that a lot of the relevant literature is in BHL, or available as PDFs, we could do some data mining, such as extracting geographical coordinates, taxonomic names, and citations. And if linked data is your thing, the 110,000 DOIs and nearly 9,000 CiNiii URLs all serve RDF (albeit not without a few problems).

I've set a "goal" of having 250,000 names mapped to the primary literature, at which point the database interface will get some much-needed attention, but for now have a look for your favourite animal and see if it's original description has been digitised.