iPhylo: January 2016

Roderic D. M. Page

Monday, January 25, 2016

iSpecies is back: mashing up species data

A decade ago (OMG, that can't be right, an actual decade ago) I created "iSpecies", a simple little tool to mashup a variety of data from GBIF, NCBI, Yahoo, Wikipedia, and Google Scholar to create a search engine for species. It was written in PHP, relied on some degree of *cough* web scraping to get its data, and was a bit of a toy (although that didn't stop me complaining that it could do more than EOL at the time). Eventually I got sick of dealing with Google Scholar constantly changing it's HTML and blocking IP addresses to stop people harvesting data (I once managed to get my entire campus blocked), or services disappearing such as Yahoo's image search, and I eventually pulled the plug on it.

A short course I run on "phyloinformatics" starts this week and one of the examples I show is a crude Javascript-based mashup. It struck me that I could tweak that and recreate a simple version of iSpecies, and that's exactly what I've done: http://ispecies.org.

It's nothing fancy, just takes a species name and searches GBIF, EOL, CrossRef, and Open Tree of Life, grabs some data and puts it together on a web page. There are lots of limitations (e.g., only fetches the first 300 localities in GBIF, requires scientific names, tree viewer is pretty awful) but it was pretty simple to put together. It's entirely client-side based, the code is all in the HTML file (and a few Javascript libraries) (the code is on GitHub: https://github.com/rdmpage/ispecies).

Fun as this was, there's a bigger problem with iSpecies and that's that it is a "mashup". I'm simply grabbing data from different sources and redisplaying it. What I really want is what has been described as a "mashup" (awful term, don't use it), that is, I want to combine the data so that it is more than the sum of its parts. For example, some of the data could be cross linked (especially if add a few more sources and we drill down a bit). Some of the papers discovered by CrossRef may include original descriptions, or may be the source of some of the points plotted on the GBIF map. Some may include the phylogenies used to build the Open Tree of Life tree. In order to build a data mashup instead of a web mashup we need to operate at the level of data rather than just human-readable web pages. That is the next thing I'd like to work on, and in many ways it shouldn't be a big leap. The new iSpecies was fairly easy to create because we now have a bunch of web services that all speak JSON. It's a small step from JSON to JSON-LD (especially if the JSON-LD is constructed with reuse in mind). So while it's nice to see iSpecies back, there's a much more interesting next step to think about.

Wednesday, January 13, 2016

Surfacing the deep data of taxonomy

My paper "Surfacing the deep data of taxonomy" (based on a presentation I gave in 2011) has appeared in print as part to a special issue of Zookeys:

Page, R. (2016, January 7). Surfacing the deep data of taxonomy. ZooKeys. Pensoft Publishers. http://doi.org/10.3897/zookeys.550.9293

The manuscript was written shortly after the talk, but as is the nature of edited volumes it's taken a while to appear.

My tweet about the paper sparked some interesting comments from David Shorthouse.

@KlausRiede @rdmpage Revitalize? Yes. Through less focus on taxa and more on uniquely identified, shared data objects and altmetrics.
— David Shorthouse (@dpsSpiders) January 8, 2016

This is an appealing vision, because it seems unlikely that having multiple, small communities clustered around taxa will ever have the impact that taxonomists might like to have. Perhaps if we switch to focussing on objects (sequences, specimens, papers), notions of identity (e.g., DOIs, ORCID), and alternative measures of impact we can increase the visibility and perceived importance of the field. In this context, the recent paper "Wikiometrics: A Wikipedia Based Ranking System" http://arxiv.org/abs/1601.01058 looks interesting. A big consideration will be how connected is the network connecting taxonomists, papers, sequences, specimens, and names. If it's anything like the network of readers in Mendeley then we may face some challenges in community building around such a network.