Wednesday, July 17, 2013

Augmenting ZooKeys bibliographic data to flesh out the citation graph

Zookeys logoIn a previous post (Learning from eLife: GitHub as an article repository) I discussed the advantages of an Open Access journal putting its article XML in a version-controlled repository like GitHub. In response to that post Pensoft (the publisher of ZooKeys) did exactly that, and the XML is available at https://github.com/pensoft/ZooKeys-xml.

OK, "now what?" I hear you ask. Originally I'd used the example of incorrect bibliographic data for citations as the motivation, but there are other things we can do as well. For example, when reading a ZooKeys article (say, using my eLife Lens-inspired viewer) I notice references that should have a DOI but which don't. With the XML available I could add this. This adds another link in the citation graph (in this case connecting the ZooKeys paper with the article it cites). If Pensoft were to use that XML to regenerate the HTML version of the article on their web site then the reader will be able to click on the DOI and read the cited article (instead of the "cut-and-paste-and-Google-it" dance). Furthermore, Pensoft could update the metadata they've submitted to CrossRef, so that CrossRef knows that the reference with the newly added DOI has been cited by the ZooKeys paper.

To experiment with this I've written some scripts that take ZooKeys XML, extract each citation from the list of literature cited, and look up DOIs for each reference that lacks them (using the CrossRef metadata search API). If a DOI is found then I insert it into the original XML. I then push this XML to my fork of Pensoft's repository (https://github.com/rdmpage/ZooKeys-xml). I can then ask Pensoft to update their repository (by issuing a "pull request"), and if Pensoft like what they see, they can accept my edits.

Automating the process makes this much more scalable, although manual editing will still be useful in some cases, especially where the original references haven't been correctly atomised into title, journal, etc.

So that the output is visible independently of Pensoft deciding whether to accept it, I've updated my Zookeys article viewer to fetch the XML not from the ZooKeys web site, but from my GitHub repository. This means you get the latest version of the XML, complete with additional DOIs (if any have been added).

Initial experiments are encouraging, but it's also apparent that lots of citations lack DOIs. However, this doesn't mean that they aren't online. Indeed, a growing number of articles are available through my BioStor repository, and through BioNames. Both of these sites have an API, so the next step is to add them to the script that augments the XML. This brings us a little closer to the ultimate goal of having every taxonomic paper online and linked to every paper that either cites, or is cited by, that paper.

Monday, July 15, 2013

Using @IFTTT to create a Twitter stream for @EvolDir

IftttSince 2009 I've been running a service that takes posts to the EvolDir mailing list and sends them to a Twitter stream at @EvolDir. This service was running on a local machine which has died. Rather than wait until I rebuild that server (again), I looked around for other ways to recreate this service. I decided to use IFTTT ("if this then that"), which is a wonderful way to chain together web services.

IFTTT uses "recipes" to chain together two services (if something happens here, do this). For example, given a RSS feed of recent messages to EvolDir, I can create a recipe to send the links to those posts to Twitter:
Rsstwitter
Great, only EvolDir is an old-fashioned mailing list distributed by email only. There is no web site for the list that has URLs to each post, and no RSS feed. IFTTT to the rescue again. I created a recipe that reads emails for a GMail account that is subscribed to EvolDir, and each time it gets an email from EvolDir it sends the contents of the email to a blog on Tumblr:
Mail2tumblr
Now each post has a web presence, and Tumblr generates a RSS feed for the blog, so the first recipe can take that feed and every time the feed is updated it will send that post to Twitter. Simples.

Friday, July 12, 2013

Learning from eLife: GitHub as an article repository

Playing with my eLife Lens-inspired article viewer and some recent articles from ZooKeys I regularly come across articles that are incorrectly marked up. As a quick reminder, my viewer takes the DOI for a ZooKeys article (just append it to http://bionames.org/labs/zookeys-viewer/?doi=, e.g. http://bionames.org/labs/zookeys-viewer/?doi=10.3897/zookeys.316.5132), fetches the corresponding XML and displays the article.

Taking the article above as an example, I was browsing the list of literature cited and trying to find those articles in BioNames or BioStor. Sometimes an article that should have been found wasn't, and on closer investigation the problem was that the ZooKeys XML has mangled the citation. To illustrate, take the following XML:

<ref id="B112"><mixed-citation xlink:type="simple"><person-group><name name-style="western"><surname>Tschorsnig</surname> <given-names>HP</given-names></name><name name-style="western"><surname>Herting</surname> <given-names>B</given-names></name></person-group> (<year>1994</year>) <article-title>Die Raupenfliegen (Diptera: Tachinidae) Mitteleuropas: Bestimmungstabellen und Angaben zur Verbreitung und Ökologie der einzelnen Arten. Stuttgarter Beiträge zur Naturkunde.</article-title> <source>Serie A (Biologie)</source> <volume>506</volume>: <fpage>1</fpage>-<lpage>170</lpage>.</mixed-citation></ref>

I've highlighted the contents of the article-title (title) and source (journal) tags, respectively. Unfortunately the actual title and journal should look like this:

<ref id="B112"><mixed-citation xlink:type="simple"><person-group><name name-style="western"><surname>Tschorsnig</surname> <given-names>HP</given-names></name><name name-style="western"><surname>Herting</surname> <given-names>B</given-names></name></person-group> (<year>1994</year>) <article-title>Die Raupenfliegen (Diptera: Tachinidae) Mitteleuropas: Bestimmungstabellen und Angaben zur Verbreitung und Ökologie der einzelnen Arten. Stuttgarter Beiträge zur Naturkunde.</article-title> <source>Serie A (Biologie)</source> <volume>506</volume>: <fpage>1</fpage>-<lpage>170</lpage>.</mixed-citation></ref>

Tools to find articles that rely on accurately parsed metadata, such as OpenURL, will fail in cases like this. Of course, we could use tools that don't have this requirement, but we could also fix the XML so that OpenURL resolves will succeed.

This is where the example of the journal eLife comes in. They deposit article XML in GitHub where anyone can grab it and mess with it. Let's imagine we did the same for ZooKeys, created a GitHub repository for the XML, and then edited it in cases where the article metadata is clearly broken. A viewer like mine could then fetch the XML, not from ZooKeys, but from GitHub, and thus take advantage of any corrections made.

We could imagine this as part of a broader workflow that would also incorporate other sources of articles, such as BHL. We could envisage workflows that take BHL scans, convert them to editable XML, then repurpose that content (see BHL to PDF workflow for a sketch). I like the idea that there is considerable overlap between the most recent publishing ventures (such as eLife and ZooKeys) and the goal of bringing biodiversity legacy literature to life.

Thursday, July 11, 2013

Barcode Index Number (BIN) System in DNA barcoding explained

Journal pone 0066213 g001Quick note to highlight the following publication:
Ratnasingham, S., & Hebert, P. D. N. (2013). A DNA-Based Registry for All Animal Species: The Barcode Index Number (BIN) System. (D. Fontaneto, Ed.)PLoS ONE, 8(7), e66213. doi:10.1371/journal.pone.0066213
This paper outlines the methods used by the BOLD project to cluster sequences into "BINS", and touches on the issue of dark taxa (taxa that are in GenBank but which lack formal scientific names). Might be time to revisit the dark taxa idea, especially now that I've got a better handle on the taxonomic literature (see BioNames) where the names of at least some dark taxa may lurk.

Wednesday, July 10, 2013

The challenge of semantically marking up articles (more thoughts on PLoS Hubs)

Here's a sketch of my vision of how to make something like the original PLoS Hubs vision (see The demise of the @PLoS Biodiversity Hub: what lessons can we learn? for background). In the blog post explaining the vision behind PLoS Hubs (Aggregating, tagging and connecting biodiversity studies), David Mindell wrote:

The Hub is a work in progress and its potential benefits are many. These include enhancing information exchange and databasing efforts by adding links to existing publications and adding semantic tagging of key terms such as species names, making that information easier to find in web searches. Increasing linkages and synthesizing of biodiversity data will allow better analyses of the causes and consequences of large scale biodiversity change, as well as better understanding of the ways in which humans can adapt to a changing world.

So, first up, some basic principles. The goal (only slightly tongue in cheek) is for the user to never want to leave the site. Put another way, I want something as engaging as Wikipedia, where almost all the links are to other Wikipedia pages (more on this below). This means object (publication, taxon name, DNA sequence, specimen, person, place) gets a "page". Everything is a first class citizen.

It seems to me that the fundamental problem faced by journals that "semantically enhance" their content by adding links is that those links go somewhere else (e.g., to a paper on another journal's site, to an external database). So, why would you do this? Why actively send people away?. Google can do this because you will always come back. Other sites don't have this luxury. In the case of citations (i.e., adding DOIs to the list of literature cited) I guess the tradeoff is that because all journals are in this together, you will receive traffic from your "rivals" if papers they publish papers that cite your content. So you are building a network that will (potentially) drive traffic to you (as well as send people away). You are building a network across which traffic can flow (i.e., the "citation graph").

If we add other sorts of links (say to GenBank, taxon databases, specimen databases like GBIF, locality databases such as Protected Planet, etc.) then that traffic has no obvious way of coming back. This also has a deeper consequence, those databases don't "know" about these links, so we loose the reverse citation information. For example, a semantically marked-up paper may know that it cites a sequence, but that sequence doesn't know that it has been cited. We can't build the equivalent citation graphs for data.

One way to tackle this is to bring the data "in house", and this is what Pensoft are doing with their markup of ZooKeys. But they are basically creating mashups of external data sources (a bit like iSpecies). We need to do better.

Prototyping

One way to make this more concrete is to think how a hub could be prototyped. Let's imagine we start with a platform like Semantic MediaWiki. I've an on-again/off-again relationship with Semantic MediaWiki, it's powerful but basically a huge, ugly hack on top off a huge piece of software that wasn't designed for this purpose, but it's a way to conceptualise what can be done. So, I'm not arguing that Semantic MediaWiki is how to do this for real (trust me, I'm really, really not), but that it's a way to explore the idea of a hub.

OK, let's say we start with a paper, say an Open Access paper from PLoS. We parse the XML, extract every entity that matters (authors, cited articles, GenBank accessions, taxon names, georeferenced localities, specimen codes, chemical compounds, etc.) and then create pages for each entity (including the article itself). Each page has a URL that uses an external id (where available) as a "slug" (e.g., the URL for the page for a GenBank sequence includes the accession number).

Now we have exploded the article into a set of pages, which are linked to the source article (we can use Semantic MediaWiki to specify the type of those links), and each entity "knows" that it has a relationship with that article.

Then we set about populating each page. The article page is trivial, just reformat the text. For other entities we construct pages using data from external databases wherever possible (e.g., metadata about a sequence from GenBank).

So far we have just one article. But that article is linked to a bunch of other articles (those that it cites), and there may be other less direct links (e.g., GenBank sequences are linked to the article that publishes them, taxonomic names may be linked to articles that publish the name, etc.). We could add all these articles to a queue and process each article in the same way. In a sense we are now crawling a graph that includes the citation graph, but it includes links the citation graph misses (such as articles that cite the same data, see Enhanced display of scientific articles using extended metadata for more on this).

The first hurdle we hit will be that many articles are not open access, in which case they can't be exploded into full text and associated entities. But that's OK, we can create article pages that simply display the article metadata, and link(s) to articles in the citation graph. Furthermore, we can get some data citation links for closed access articles, e.g. from GenBank.

So now we let the crawler loose. We could feed it a bunch of articles to start with (e.g., those in the original Hub (if that list still exists), those from various RSS feeds (e.g., PLoS, ZooKeys, BMC, etc.).

Entry points

Users can enter the hub in various ways. Via an article would be the traditional way (e.g., via a link from the article itself on the PLoS site). But let's imagine we are interested in a particular organism, such as Macaroni Penguins. PLoS has an interesting article on their feeding [Studying Seabird Diet through Genetic Analysis of Faeces: A Case Study on Macaroni Penguins (Eudyptes chrysolophus) doi:10.1371/journal.pone.0000831]. This article includes sequence data for prey items. If we enhance this article we build connections between the bird and its prey. In the simple level, the hub page for the crustacea and fish it feed on would include citations to this article, enabling people to follow that connection (with more sophistication the nature of the relationship could also be specified). This article refers to a specific locality and penguin colony, which is in a marine reserve (Heard Island and McDonald Islands Marine Reserve). Extract these entities, and other papers relevant to this are would be linked as they are added (e.g., using a geospatial query). Hence people interested in what we now about the biology of organisms in specific localities would dive in via the name (or URL) of the locality.

Summary

The core idea here is taking an article, exploding it, and treating every element as a first class citizen of the resulting database. It is emphatically not about a nice way to display a bunch of articles, it's not a "collection", and articles aren't privileged. Nor are articles from any particular publisher privileged. One consequence of this is that it may not appeal to an individual publisher because it's not about making a particular publisher's content look better than another's (although having Open Access content in XML makes it much easier to play this game).

The real goal of an approach like this is to end up with a database (or "knowledgebase") that is much richer than simply a database of articles (or sequences, or specimens), and which moves from being a platform for repurposing article text to a platform for facilitating discovering.

Tuesday, July 09, 2013

The problem with apps for journals

I'm a big fan of the work Nature publishing group is doing in experimenting with new methods of publishing, such as their iOS apps (which inspired me to "clone" their iPhone app) and the Encode app with the concept of "threads". But there's an aspect of the iPad app that puzzles me: Nature's app doesn't know that a linked article elsewhere in Nature is also available using the app. To illustrate, consider this piece on horse evolution (doi:10.1038/nature.2013.13261) which is available in the iPad app (with a subscription):

Image 1

This News & Comment piece links to the main article (doi:10.1038/nature12323), which is also available in the app:

Image 4

But if I click on the link in the app, I get the article displayed in a web browser view (just as if I was in a browser on the iPad, or any other device):

Image 3

This is odd, to say the least. If the article is available in the app (and much more attractively formatted than on the web) why doesn't the app "know" this and take me to that view?

In a sense this is part of a larger problem with publisher-specific apps. If the article isn't part of the publisher's catalogue then you are off to the web (or, indeed, another publisher's app), which makes for jarring reading experience. Each web site (or app) has its own way of doing things). Part of this is because different publishers represent different silos, mostly locked behind paywalls.

We can extend this argument to links to other cited entities, such as sequences, specimens, chemicals, etc. In most cases these aren't linked, and if they are, we are whisked off to web site somewhere else (say GenBank), a web site furthermore that typically knows nothing about the article we came from (e.g., doesn't know about the citation relationship we've just traversed). I think we can do better than this, but it will need us to treat links as more than simply jump off points to the wider web. For example, if the Nature app not only knew about all the Nature articles that were available to it, but also stored information about DNA sequences, chemical compounds, taxa, and other entities, then we could have a richer reading experience with potentially denser cross links (e.g., the app could display a genome and list all the Nature articles that cite that genome). Of course, this is still limited by publisher, and ultimately we want to break that silo as well. This is the attraction of starting with Open Access publications so that we can link articles across publishers, but still navigate through their content in a consistent way.

The demise of the @PLoS Biodiversity Hub: what lessons can we learn?

2etoq0zjwxicokm1woge
Jonathan Eisen recently wrote that the PLOS Hub for Biodiversity is soon to be retired, and sure enough it's vanished from the web (the original URL hubs.plos.org/web/biodiversity/ now bounces you straight to http://www.plosone.org/, you can still see what it looked like in the Wayback Machine).

Like Jonathan, I was involved in the hub, which was described in the following paper:
Mindell, D. P., Fisher, B. L., Roopnarine, P., Eisen, J., Mace, G. M., Page, R. D. M., & Pyle, R. L. (2011). Aggregating, Tagging and Integrating Biodiversity Research. (S. A. Rands, Ed.)PLoS ONE, 6(8), e19491. doi:10.1371/journal.pone.0019491

In retrospect PLoS's decision to pull the hub is not surprising. The original proposal imagined a web site looking like this, with the goal of building a "dynamic community".

Proposal

From my perspective the PLoS HUb failed for two reasons. The first is that PLoS weren't nearly as ambitious as they could have been. The second is that the biodiversity informatics community simply couldn't (an arguably still can't) provide the kind of services that PLoS would have needed to make the Hubs something worth nurturing.

After a meeting at the California Academy of Science in April 2010 to discuss the hub idea I wrote a ranty blog post (Biodiversity informatics = #fail (and what to do about it)) where I expressed my frustration that we had a group of people (i.e., PLoS) rock up and express serious interest in doing something with biodiversity data, and biodiversity informatics collectively failed them. We could have been aiming for a cool database of "semantically enhanced" publications that we could query taxonomically, geographically, phylogenetically, etc. (at least, that's what I was hoping PLoS were aiming for). Instead it became clear that most of the basic services were simply not available (we didn't have a simple code to extract GenBank accession numbers, specimens codes, etc., we couldn't link specimen codes to anything online, and woe betide you if you asked what a taxon name was).

In fairness, it also became pretty clear that PLoS weren't going to go too far down the line of an all-singing portal to biodiversity data. They were really looking at a shiny web site that housed a collection of Open Access papers on biodiversity. But my point is it could have been so much more than that. We had a chance to build a platform,a knowledge base for biodiversity data that had an accessible front end (e.g., the traditional publication) but exploded that into its component parts so we could spin the data around and ask other questions.

Inspired by the possibilities I spent the next couple of months playing with some linked data demos (see here and here, the links in these demos have long since died). The idea was to explore how much of what I imagined the PLoS Hub could be it was possible to build using RDF and SPARQL. It was fun, but RDF and SPARQL are awful things to "play" with, and the vast bulk of the data had to be wrapped in custom scripts I wrote because the original data providers didn't supply RDF. As I've written elsewhere, I think the cost of getting to a place where RDF enables you to do meaningful stuff is just too high. Our data are too messy, we lack agreed identifiers, and we either have too many or too few vocabularies (and those we do have invariably spark lengthy, philosophical debates - vocabularies are taxonomies of data, need I say more). The RDF approach is also doomed to fail because it assumes multiple decentralised data repositories are the way forward. In my experience, these cannot deliver the kinds of things we need. The data need to be brought together, cleaned, aligned, augmented, and finally linked together. This is much easier to do if all the data are in one place.

So where does this leave us? In many ways I'd like to attempt something like PLoS Hubs again, or perhaps more precisely, think about building a platform so that if a publisher came along and wanted to do something similar (but more ambitious) we would have the tools in place that could make it happen. What I'd like is a way more sophisticated version of this, where you could explore data in various dimensions (geography, taxonomy, phylogeny), track citation and provenance information (what papers cite this specimen, what sequences is it a voucher for, what trees are built on those sequences). If we had a platform that supported these sorts of queries, not only could we provide great environment upon which we could embed scientific publications, we could also support the kinds of queries we can't do at the moment (e.g., give me all the molecular phylogenies for species in Madagascar, locate all the data - publications, taxonomic identifications, sequences - about a specimen, etc.).

I'll leave you with a great rant about platforms. It's long but it's fun, and I think it speaks to where we are now in biodiversity informatics (hint, we aren't Amazon).

Wednesday, June 19, 2013

A new way to view taxonomic publications

One of goals of BioNames is to be more than simply another taxonomic database. In particular, I'm interested in the idea of having a platform for viewing taxonomic publications. One way to think about this is to consider the experience of viewing Wikipedia. For any given page in Wikipedia there will be links to other, related content in Wikipedia. Reading an article about a city, you can go and read about the country the city occurs in. Reading about a battle, you can discover more about the generals who fought it. The ability to discover all this interconnected information in one place is compelling.

I'd like something similar for taxonomy. Given that a taxonomic database is in essence a collection of taxonomic names and publications, and a taxonomic publication is in essence a collection of names and citations of taxonomic publications, why not embed the publication within the database and have the names and citations link to the corresponding entries in the database?

Based on some earlier efforts (e.g., Towards an interactive taxonomic article: displaying an article from ZooKeys) and inspired by the eLife Lens project, I've created a live demo of a way to view articles from the journal ZooKeys. Below is a screencast:



If you want to try this out, here are some live examples:

Note the pattern in the URL, just append the DOI for an article to http://bionames.org/labs/zookeys-viewer/?doi=

Everything is a bit rough, but it's working well enough for you to get the basic idea. Code is in github Essentially the viewer grabs the ZooKeys HTML, extracts the URL for the XML file, fetches that, then uses some XSLT style sheets to convert the XML into something viewable. There's a sprinkling of Javascript to call the BioNames API. Much of the code could be tweaked to accepted other NLM XML-based articles, such as content from PLoS and the BMC journals.

One direction this could go in is to make a viewer like this the default viewer in BioNames for ZooKeys articles, so that instead of being restricted to a PDF you can interactively navigate between the article and the cited literature. Indeed, the very action of locating cited references in BioNames builds citation links. We could imagine extending the approach to content that isn't in NLM XML, such as Zootaxa PDFs, or content from BHL. Eventually I'd like to have the taxonomic literature fully embedded in the database, not as PDF or image silos, but as documents linked to names and literature. The journal becomes a database.

More GBIF taxonomy fail

In browsing the GBIF classification in BioNames I keep coming across cases of wholesale duplication of taxa. I recently blogged about a single example, the White-browed Gibbon, but here's a larger example involving frogs.

Consider the frog genera Philautus and Raorchestes. The latter was described in 2010:

A ground-dwelling rhacophorid frog from the highest mountain peak of the Western Ghats of India (2010)Current Science (Bangalore) 98(8): 1119–1125. http://bionames.org/references/e0ab13cbb8bc8b3627bb53e88e7641a9
and contains a number of species previously in Philautus. The GBIF classification for Philautus still has these species, which means that these taxa appear twice in the GBIF data portal (associated with different occurrences).

To gauge the scale of the problem I've done a crude pairwise plot of species names in the two genera. In the diagram below a dot(●) appears if the species name in the corresponding row and column is identical. The diagonal corresponds to comparisons of a species name with itself.

Note the ●'s that appear off the diagonal. These are species in Philautus and Raorchestes that have the same species name (e.g., Philautus glandulosus and Raorchestes glandulosus. The off-diagonal dots indicate taxa that are duplicated.


● ● Raorchestes anili
● ● Raorchestes annandalii
● ● Raorchestes beddomii
● ● Raorchestes bobingeri
● ● Raorchestes bombayensis
● ● Raorchestes chalazodes
● ● Raorchestes charius
● ● Raorchestes dubois
● ● Raorchestes flaviventris
● ● Raorchestes glandulosus
● ● Raorchestes graminirupes
● ● Raorchestes griet
● ● Raorchestes gryllus
● ● Raorchestes longchuanensis
● ● Raorchestes luteolus
● ● Raorchestes menglaensis
● ● Raorchestes munnarensis
● ● Raorchestes nerostagona
● ● Raorchestes ochlandrae
● ● Raorchestes parvulus
● ● Raorchestes ponmudi
● ● Raorchestes sahai
● ● Raorchestes shillongensis
● ● Raorchestes signatus
● ● Raorchestes terebrans
● ● Raorchestes tinniens
● ● Raorchestes travancoricus
● ● Raorchestes tuberohumerus
● ● Raorchestes viridis
● Philautus abditus
● Philautus abundus
● Philautus acutirostris
● Philautus acutus
● Philautus adspersus
● Philautus albopunctatus
● Philautus alto
● Philautus amboli
● Philautus amoenus
● Philautus andersoni
● ● Philautus anili
● ● Philautus annandalii
● Philautus asankai
● Philautus aurantium
● Philautus auratus
● Philautus aurifasciatus
● Philautus banaensis
● Philautus basilanensis
● ● Philautus beddomii
● ● Philautus bobingeri
● ● Philautus bombayensis
● Philautus bunitus
● Philautus caeruleus
● Philautus cardamonus
● Philautus carinensis
● Philautus cavirostris
● ● Philautus chalazodes
● ● Philautus charius
● Philautus cinerascens
● Philautus cornutus
● Philautus crnri
● Philautus cuspis
● Philautus decoris
● Philautus dimbullae
● Philautus disgregus
● Philautus dubius
● ● Philautus dubois
● Philautus duboisi
● Philautus erythrophthalmus
● Philautus eximius
● Philautus extirpo
● Philautus femoralis
● Philautus fergusonianus
● ● Philautus flaviventris
● Philautus folicola
● Philautus frankenbergi
● Philautus fulvus
● Philautus garo
● ● Philautus glandulosus
● Philautus gracilipes
● ● Philautus graminirupes
● ● Philautus griet
● ● Philautus gryllus
● Philautus gunungensis
● Philautus hainanus
● Philautus hallidayi
● Philautus halyi
● Philautus hazelae
● Philautus hoffmanni
● Philautus hoipolloi
● Philautus hosii
● Philautus hypomelas
● Philautus ingeri
● Philautus jacobsoni
● Philautus jerdonii
● Philautus jinxiuensis
● Philautus kempiae
● Philautus kempii
● Philautus kerangae
● Philautus leitensis
● Philautus leucorhinus
● Philautus limbus
● ● Philautus longchuanensis
● Philautus longicrus
● Philautus lunatus
● ● Philautus luteolus
● Philautus macropus
● Philautus maia
● Philautus malcolmsmithi
● Philautus maosonensis
● Philautus medogensis
● ● Philautus menglaensis
● Philautus microdiscus
● Philautus microtympanum
● Philautus mittermeieri
● Philautus mjobergi
● Philautus mooreorum
● ● Philautus munnarensis
● Philautus namdaphaensis
● Philautus nanus
● Philautus narainensis
● Philautus nasutus
● Philautus neelanethrus
● Philautus nemus
● ● Philautus nerostagona
● Philautus ocellatus
● ● Philautus ochlandrae
● Philautus ocularis
● Philautus odontotarsus
● Philautus oxyrhynchus
● Philautus pallidipes
● Philautus papillosus
● Philautus pardus
● Philautus parkeri
● ● Philautus parvulus
● Philautus petersi
● Philautus petilus
● Philautus pleurotaenia
● Philautus poecilius
● Philautus polillensis
● ● Philautus ponmudi
● Philautus poppiae
● Philautus popularis
● Philautus procax
● Philautus quyeti
● Philautus refugii
● Philautus regius
● Philautus reticulatus
● Philautus rhododiscus
● Philautus romeri
● Philautus rugatus
● Philautus rus
● ● Philautus sahai
● Philautus sanctipalustris
● Philautus sanctisilvaticus
● Philautus sarasinorum
● Philautus saueri
● Philautus schmackeri
● Philautus schmarda
● Philautus semiruber
● ● Philautus shillongensis
● ● Philautus signatus
● Philautus silus
● Philautus silvaticus
● Philautus simba
● Philautus similipalensis
● Philautus similis
● Philautus sordidus
● Philautus steineri
● Philautus stellatus
● Philautus stictomerus
● Philautus stuarti
● Philautus supercornutus
● Philautus surdus
● Philautus surrufus
● Philautus tectus
● Philautus temporalis
● ● Philautus terebrans
● ● Philautus tinniens
● ● Philautus travancoricus
● Philautus truongsonensis
● ● Philautus tuberohumerus
● Philautus tytthus
● Philautus umbra
● Philautus variabilis
● Philautus vermiculatus
● ● Philautus viridis
● Philautus vittiger
● Philautus williamsi
● Philautus woodi
● Philautus worcesteri
● Philautus wynaadensis
● Philautus zal
● Philautus zamboangensis
● Philautus zimmeri
● Philautus zorro


Why does GBIF have duplicate frogs? As for the gibbon example, the names come from different sources, and GBIF doesn't have access to (or doesn't use) data that tells it that the names are synonyms. In this case there is a clash between the Catalogue of Life, which doesn't recognise Raorchestes, and IUCN Red List, which does. The end result is a mess.

We clearly need better tools for catching these problems. We also need a decent database of taxonomic names and synonyms. The Catalogue of Life is, frankly, grossly inadequate in this respect, especially for vertebrate taxa. Increasingly it's becoming clear that the classification underlying the GBIF portal needs some serious work.

Thursday, June 13, 2013

BioNames - colourful phylogenies and downloadable SVG

My latest tweak to BioNames is to add colour to the phylogenies. Terminal nodes with the same name are labelled with the same background colour. For example, here is a tree for fiddler and ghost crabs:
Colour
The colours make it easier to see that this tree has a mixture of a few sequences from divergent taxa, and a lot of sequences from the same taxa.

Note that you can now also download the SVG drawing of the tree. Click on the Download button and (in at least some browsers, such as Chrome) the SVG will download. Other browsers may open the SVG in a separate window, in which case simply save it to your computer.

Wednesday, June 12, 2013

The first five minutes are free - renting articles on DeepDyve

Deepdyve 4colorIn 2011 I wrote a short post about DeepDyve, a service where you could rent access to an article. DeepDyve has launched a "5-Minute Freemium" service where you can view an article online for 5 minutes, for free. You have to log in, either with DeepDyve or using Facebook, but no actual money changes hands. If you want to read for longer, or download an article then you have to get out your credit card.

I've added support for DeepDyve to BioNames. If an article is available in DeepDyve, BioNames displays a link (see http://bionames.org/references/6952b806f87de2106669b2412043a4ab for an example). DeepDyve makes it possible to quickly check a fact (for example, the spelling of a taxonomic name). It obviously doesn't tackle bigger issues such as access to text for data mining, but if you just need to check something, or follow a lead, then it's an interesting and useful wrinkle on publishing models.

Gibbons and GBIF: good grief what a mess

52678 580 360One reason I built BioNames (and the related digital archive BioStor) was to create tools to help make sense of taxonomic names. In exploring databases such as GBIF and the NCBI taxonomy every so often you come across cases where things have gone horribly wrong, and to make sense of them you have to drill down into the taxonomic literature.

It's becoming increasingly clear to me that large parts of the GBIF classification that underpins their data portal is, well, a mess. There are duplicate taxa, homonyms, orphan genera, and so on. Now, building a global taxonomy on the scale of GBIF is a tough problem. They are merging a lot of individual classifications into an overall synthesis. That would be a challenging problem in itself, but it's compounded by inconsistent use of names for the same taxon. In other words, synonymy. This is the greatest self-inflicted wound in taxonomy, the desire to have names be meaningful in terms of relationships (i.e., species in the same genus should be related). If you require that, then the consequence is a mess (unless you have a really good taxonomic database in place to track name changes, and we don't).

As an example, consider the White-browed Gibbon (shown here in an image from EOL). In GBIF this taxon occurs in at least three different places in the GBIF classification (each name has occurrence data associated with it):

GBIF idNameSourceOccurrences
5219549Hylobates hoolock (Harlan, 1834)The Catalogue of Life, 3rd January 2011141
4267262Bunopithecus hoolock Harlan, 1834Mammal Species of the World, 3rd edition2
5786121Hoolock hoolock (Harlan, 1834)IUCN Red List of Threatened Species3


To keep things simple I've omitted the subspecies (such as Bunopithecus hoolock hoolock). Note that three key resources for names (the Catalogue of Life, Mammal Species of the World, and the IUCN) can't agree on what to call this ape. The names are also not entirely consistent. For example, as written, Bunopithecus hoolock Harlan, 1834 (from Mammal Species of the World, 3rd edition) would imply that this was the original name for this gibbon (because the authority [Harlan, 1834] is not in parentheses). This is incorrect, the original name of the White-browed Gibbon is Simia hoolock, and you can see the original description in BioStor:

Harlan R (1834) Description of a Species of Orang, from the north-eastern province of British East India, lately the kingdom of Assam. Transactions of the American Philosophical Society 4: 52–59. http://biostor.org/reference/127799
Since then it has been shuffled around various genera, including a genus (Hoolock) for which it is the type species:
Mootnick A, Groves C (2005) A new generic name for the hoolock gibbon (Hylobatidae). International Journal of Primatology 26(4): 971–976. doi: 10.1007/s10764-005-5332-4.
GBIF regards all three names as being different taxa, despite all being names for the same gibbon. The practical consequence of this is that anyone seeking a comprehensive summary of what GBIF knows about the White-browed Gibbon is going to get different data depending on which name they use. In my experience this is not an uncommon occurrence (bats as another case where the GBIF classification is a terrible hodgepodge).

My goal here is not to berate GBIF, they are trying to aggregate messy, inconsistent data on a massive scale. But we need tools to flag cases like this poor gibbon, and ways to ensure that once we've found a problem it is fixed once and for all.

Saturday, June 08, 2013

BioNames phylogenies screencast

I've created a short screencast showing some of the phylogenies in BioNames. If you want to see these for yourself here are the links:

Friday, June 07, 2013

BioNames - Phylogenies? Yes, phylogenies

One of the things that didn't make last week's deadline for launching BioNames was the inclusion of phylogenies. This was disappointing as one of the reasons I built BioNames was to help span what I see as the gulf between classical biodiversity informatics and its emphasis on taxonomic names and classification, and modern phylogenetics where the tree is the primary focus, not some arbitrary way to partition it up.

So, where to get lots of phylogenies? I use the wonderful PhyLoTA database built by Mike Sanderson and colleagues:
Sanderson, M., Boss, D., Chen, D., Cranston, K., & Wehe, A. (2008). The PhyLoTA Browser: Processing GenBank for Molecular Phylogenetics Research. Systematic Biology, 57(3), 335-346. doi:10.1080/10635150802158688

I grabbed a dump of the trees, matched them to sequences in GenBank (more accurately, the European version, EMBL), did some post processing of those sequences, through them into CouchDB, built a SVG viewer, and voilà.

Here is a tree for the fig wasp family Agaonidae, showing the interactive zoomable tree viewer, and thumbnails for other trees for this taxon:

Moretrees

Here is a phylogeny for a genus of deep-sea mussels (Bathymodiolus), showing a map based on those sequences that are georeferenced in GenBank:

Mussels

Lastly here is the page for the
bat family Vespertilionidae in the NCBI classification. Click on the "Data" tab to see this view.

Bats

There's still lots to do on this, but the key parts are in place. Personally I can happily while away the day just browsing through the trees, looking for case where taxa lack scientific names, obvious cases of synonymy (take a look at this tree for fiddler and ghost crabs, for example), and evidence that "species" have considerable internal phylogenetic structure.

Tuesday, June 04, 2013

BioNames and where taxonomy is published

I've added a simple "dashboard" to BioNames to display some basic data about what is in the database. Apart from a table of the number of bibliographic identifiers in the database (currently there are 54,422 publications with DOIs, for example), there are some graphic summaries. These are a bit slow to load as they are created on the fly.

Publishers

The first summarises the relative frequency of articles from different publishers (broadly defined to include digital repositories such as DSpace and JSTOR). For most of this information I'm using data returned when I resolve a DOI at CrossRef. The data is incomplete and likely to change as I add more articles, and CouchDB finally catches up and indexes all the data.

Publishers

The biggest blob is BioStor, which is my project to extract articles from BHL. Magnolia Press publish Zootaxa, then there are some well-known mainstream publishers such as Springer, Wiley, and Taylor & Francis (Informa UK). These publishers have digitised the back catalogues of a number of society journals, so their prominence here doesn't mean that they are actively publishing new taxonomic content. One use for a diagram like this is to think about what content to data mine. BioStor content is open access (via BHL) and so can be readily mined. Some articles in Zootaxa are open access and so could also be downloaded and processed. Then we have the big commercial publishers, who have a significant fraction of taxonomic content behind their paywalls. If the community was to think about mining this data, then this diagram suggests which publishers to start asking first.

Journals

The next diagram shows articles grouped by journal (using the journal's ISSN).

Journals

There circles are too small to be labelled usefully. A couple of things strike me. The first is the sheer number of journals! The taxonomic literature is widely scattered across numerous different outlets, which is part of the challenge of indexing the literature (and this diagram includes only those journals that have ISSNs, many smaller or older ones don't). There is no one journal which dominates the landscape (the largest circle on the top right is Zootaxa). But this diagram spans the complete history of taxonomic publication, so includes large journals (such as Annals and Magazine of Natural History) that no longer exist (at least in their present form). Might be useful to slice this diagram by, say, decade to get a clearer picture of patterns of publication.

As the database builds I post some more summaries at BioNames.

Monday, June 03, 2013

BioNames and altmetrics

One consequence of having a database of literature with external identifiers such as DOIs is that we can plug into a bunch of external services to get additional information about a reference. For example, altmetric can take a DOI and display some article level metrics. As an experiment I've added code for altmetric badges to the web page in BioNames that displays publications. For example, here is the ZooKeys paper "An extraordinary new family of spiders from caves in the Pacific Northwest (Araneae, Trogloraptoridae, new family)" in BioNames:

Altmetric

The "About" tab displays the altmetric badge with a bunch of metrics of engagement with this paper. If you click on the badge you get more details about what people have been saying about this paper.

It would be great to explore this across the complete set of taxonomic papers so that we could get a sense the degree of engagement people have with the latest taxonomic literature.

Friday, May 31, 2013

BioNames now live - Report on project

Bionames3
BioNames (http://bionames.org) is live. Getting to this point was supported by funding from EOL as part of their Computable Data Challenge. The award from EOL is paying for Ryan Schenk to work on the interface and overall design of the web site, and over the last few weeks we've been working increasingly frantically to get things ready. "Ready" is a relative concept. The project is far from finished from my perspective, there is a mound of data (millions of names, hundreds of thousands of publications) that is being cleaned, cross-linked, and ultimately visualised. But the EOL funding came with a deadline and adult supervision (aka Cyndy Parr), so it was a great incentive to get something function out the door.

What is BioNames?

Elsewhere I've argued that biodiversity informatics is fundamentally about linking stuff together, and BioNames tackles the link between a name and its publication. Ultimately I want each taxon name to be linked to its original description, and that description has a digital identifier (such as a DOI). It's a small step, but building those links, coupled with (where possible) bringing those publications together in one place provides a platform to potentially do some cool stuff (more on this later). Since about 2009 I've been working on building a database of these links, and have been documenting progress (or it's lack) along the way (e.g., search iPhylo for "itaxon").

Here are some screen shots (and links so you can see for your self). It's a very early stage release, but you'll get the idea.

GBIF classiifcation of Rousettus with Ryan's awesome taxon name timeline.
Bionames1

Viewing a paper A Tarzan yell for conservation: a new chameleon, Calumma tarzan sp. n., proposed as a flagship species for the creation of new nature reserves in MadagascarBionames4

Coverage of articles in a journal (Proceedings of the Entomological Society of Washington).
Bionames2What got built?

There is a bunch of code and documentation online:


There is also a Darwin Core Archive format dump, which *cough* fun to create.

There have been progress reports on this blog (search for BioNames). You can also see what we got up to in the github logs.

What didn't happen

The original proposal (http://dx.doi.org/10.6084/m9.figshare.92091) was, of course, a tad ambitious, and a number of things haven't made it into this release. Phylogenies are the biggest casualty, but they are close (see Viewing phylogenies on the web: Javascript conversion of Newick tree to SVG for experiments on visualisation). It just wasn't possible to get them ready in time for the May 31 deadline. But this is on the to do list.

What's next

Now that there is a functioning web site there are several directions to explore. There is a lot of data cleaning to do, many missing references to add, taxon names to map to GBIF and NCBI, and more. I've completely glossed over the issue of reconciling author names, it's clear that the same author can appear multiple times because of variations in how their name has been recorded in various databases. There are various ways to tackle this, the most interesting is to use tools like Mendeley or ORCID to enable people to "claim" their identity.

Now that there is a mapping between the NCBI taxonomy and taxon names linked to literature, it would be great to add phylogenetic data to BioNames (which was part of the original plan). One way is by importing PhyLoTA, another is by adding support for BLAST searches that generate trees. For example, for a given taxon we could create a list of suitable sequences (e.g., DNA barcodes) and enable users to generate BLAST trees to get a sense of what the taxon is related to (and, in many cases, how much genetic differentiation there is within that taxon).

Given that BioNames has a lot of full text (from BioStor as well as numerous sources of PDFs and scans) there is huge scope for data mining. Obvious things to do are extract taxon names, geographic localities, and specimen codes (using tools I already have for BioStor). Then there is the challenge of extracting lists of literature cited and building citation networks. A small proportion of the taxonomic literature exists in XML (e.g., articles in PLoS, Zookeys, and various SciElo journals), which makes this task a lot easier. Given that many of the cited papers will already be in BioNames, we could build a taxonomic literature reader that enabled you to treat the literature in BioNames as one big interlinked, browesable archive. I'm posting a list of ideas on Trello.