Thursday, May 16, 2013

The impact of museum collections: one collection ≈ one Nobel Prize

359f89198dca80b0f99b3208e1cfedde
Ideas on measuring the "impact" of a natural history collection have been bubbling along, as reflected in recent comments on iPhylo, and some offline discussions I've been having with David Blackburn and Alan Resetar.

My focus has been at the specimen-level, with a view to motivation the adoption of persistent specimen-level identifiers so that we can citations of specimens over time (e.g., in publications and databases such as GenBank). Not only does this provide a measure of the "impact" of a collection, it helps with provenance. If we sequence a specimen that is subsequently assigend to a different taxon and we have a way of tracking that specimen via its identifier, then we can transmit that new identification to other consumers of data based on that specimen. For example, we could automatically notify GenBank that what we thought was an x is actually a y.

So I made a simple "league table" of museum collections based on specimens cited in BioStor. There are all sorts of issues with this approach. Once you rank collections, people may use that to argue some can be axed and more resources funnelled into others. A more positive approach would be to indetify collections that are underused, and try and figure out why. And in the same way that taxonomic papers may have a citation long life, specimens may sit in a museum for a long time before being cited (for example, when eventually recognised as a new species doi:10.1016/j.cub.2012.10.029). So, metrics can be a double-edged sword.

Citing specimens is a useful metric, but not all citations are equal, and not all citations are immediate. A specimen that yields DNA sequences that are published in, say, Nature, arguably has more weight than a specimen listed in a rarely cited paper. Likewise, subsequent citations of a paper that cites a specimen should confer more weight on the value of that specimen. Elsewhere (doi:10.1093/bib/bbn022, preprint here: hdl:10101/npre.2008.1760.1) I've argued for a Google PageRank-style way to measure the impact of a specimen that takes into account papers and other objects derived from a specimen (e.g., images, sequences).

Meanwhile, Morgan Jackson alerted me to a quicker way to get a measure of the impact of the collection.
The "short note" Morgan refers to is by Kevin Winker and Jack J. Withrow:
Winker, K., & Withrow, J. J. (2013). Natural history: Small collections make a big impact. Nature, 493(7433), 480–480. doi:10.1038/493480b

They constructed a Google Scholar profile and collected papers that cite the University of Alaska Museum's bird collection (see here for full details). The h-score of this collection of papers is 42, which Winkler and Withrow note is "equivalent to an average Nobel laureate in physics". Here's the graph of citations over time:

Chart  1
It's a neat trick, if a little time consuming. But one advantage it has is that it puts collections on a similar footing to individual researchers. You could imagine asking the question "how much money would you spend supporting a researcher at this level?" How does this compare to the resources actually being spent?

One thing I hope will emerge from discussions like this is a desire to make specimens first-class citizens of the web, with stable identifiers that enable them to be cited in the same way we cite papers and, increasingly, data sets.

Thursday, May 02, 2013

GBIF data quality: visualising Mesibov's millipedes

Bob Mesibov (who has been a guest author on this blog) recently published a paper on data quality in in ZooKeys:

Mesibov, R. (2013). A specialist’s audit of aggregated occurrence records. ZooKeys, 293(0), 1–18. doi:10.3897/zookeys.293.5111

In this paper Bob documents some significant discrepancies between data in his
Millipedes of Australia (MoA) database and the equivalent data in the Atlas of Living Australia and GBIF (disclosure, I was a reviewer of the paper, and also sit on GBIF's science committee). This paper spawned a thread on TAXACOM, and also came up at the GBIF meeting I was at earlier this week.

One thing lacking from the discussion is a clear sense of just how big are the discrepancies between GBIF and MoA data, so I grabbed the data provided by Bob (http://dx.doi.org/10.3897/zookeys.293.5111.app and extracted the records where GBIF and MoA disagreed. I converted these to GeoJSON and threw them on Google Maps:

Mesibov2

You can see a live version here http://bl.ocks.org/rdmpage/raw/5501293/ (it can take a little while for the map to appear). I've connected the MoA and GBIF localities for the same occurrence by a straight line, and the the MoA records are encircled by an estimate of their uncertainty (for many records the circle is invisible at this scale).

There are some fairly spectacular discrepancies, and a lot of relatively small scale displacements of records. Does this matter? The answer to this question will depend on what people want to do with the data. You may regard the discrepancies as serious (certainly it's interesting that there are so many differences between the two data sets), or minor given the geographic scale. But visualising them at least makes it possible to form a judgement.

Thursday, April 25, 2013

BioNames update - live mockup

Things are finally coming together, at least enough to have a functioning demo. It looks awful, but shows the main things I want BioNames to do. One thing I'm most concerned about at this stage is the possible confusion users might experience between taxon names and concepts. For example, there are two pages about Pteropus, one about the name Pteropus, the other about the bat that bears this name (as understood by GBIF).

The demo is live at http://bionames.org/bionames-api/mockup_index.php (note that this is a temporary URL so I can't guarantee it will be online when you read this).

BioNames live mockup from Roderic Page on Vimeo.

Monday, April 22, 2013

BioNames update - reconciliation strategies

Over on Google Plus (yeah, me neither) Donat Agosti is giving me a hard time regarding the quality of some data that I am using. I've responded to Donat directly, but here I just want to quickly outline two different approaches to cleaning and reconciling bibliographic metadata.

The problem addressed by Donat is the issue of multiple strings for the same journal (e.g., the plethora of different abbreviations and permutations people use to refer to the same journal). In trying to make sense of this mess there are a couple of strategies we can use. One is to cluster the strings into sets that we think refer to the same thing, e.g.:

R1
We could then synthesise the preferred journal name from this set. We could make some sort of consensus string, for example. There are also some quite nice Bayesian methods for combining contradictory metadata.

Another approach, which I use, is to map the strings to a third party identifier, in this case an ISSN:

R2
Once I've done this I can use the identifier to refer to the journal, hence ultimately I don't particularly care what string is best for the journal (indeed, I can defer to a third party for this decision).

The point is obsessing with clean, "correct" bibliographic metadata is something of a fool's errand. Obviously, it's nice to have clean metadata if you can get it, but in many cases there is no exact answer to what is the correct metadata. Some journals have multiple names (e.g., in different languages), some run different volume numbering schemes in parallel, and date of publication can be rather problematic (see my Mendeley group on publication dates). If we can map a publication to a globally unique identifier, such as a DOI, then we can sidestep this issue and focus on what I think really matters - linking data together.

Thursday, April 18, 2013

Thoughts on GBIC 2012 and a vision of the future of biodiversity informatics

This seems to be the season for big, arm-wavy documents about the future of biodiversity informatics (see A decadal view of biodiversity informatics: challenges and priorities). An equivalent document is being drafted based on the Global Biodiversity Informatics Conference (GBIC 2012) conference. Writing these documents is hard work, they have to balance a set of conflicting visions, predict the future, and communicate a coherent plan to people who either could help make this happen, or feel they have a stake in the outcome.

Leaving all those constraints behind, and waving arms wildly, here's one take on the future of biodiversity informatics. I see three themes.

1. Knowing what we know

We have a limited grasp of how much we actually know, and crap tools to summarise this knowledge. I want a Google Analytics for biodiversity data where I can see at a glance the current state of our knowledge (e.g., what is the rate of sequencing of environmental samples in the Mediterranean? How much of Indonesia's amphibian fauna is in protected areas?). These are fairly trivial queries. If Google can analyse web traffic from sites being hit over a million times per day ( ~ 365 million hits per year) we can do the same thing on GBIF-scale databases. There is huge scope here for cool visualisation of the growth of our knowledge, such as this:

If biologists were explorers (Mammalia)... from Andrew W Hill on Vimeo.


Imagine the GBIF classification like this:

filesystem visualisation from wonderful websolutions on Vimeo.

2. Life stream

Terrible title, but this is where we monitor change, both "organic" and anthropogenic. This is where we use data mining to do a sentiment analysis of the biosphere, looking to detect changes such as outbreaks of disease, invasive species, etc. This builds on 1 but focusses on change. Imagine a "news service" for biology along the lines of tools available to financial markets (e.g., Silobreaker):



This is where we interface with decision makers, in the sense that Braulio Dias's statement "I am convinced that the lack of adequate biodiversity monitoring is at the heart of our difficulties to make convincing arguments" is true, this tackles that question.

3. Modelling the biosphere

Time to model all life on Earth (http://dx.doi.org/10.1038/493295a) is our equivalent of a moon shot (oh how I hate that analogy). Purves et al. have made the case, this is the task that will galvanise people outside the taxonomy/biodiversity community. This is real megascience (1. is data collection, 2. is data mining and analysis). Climate modellers and oceanographers get to do this:



Can we do the same?

Wednesday, April 17, 2013

Reconciling author names using Open Refine and VIAF

RefineIn an earlier post I discussed using Open Refine (formerly Google Refine) to clean and reconcile taxon names. I've added an additional service that can be used to reconcile author names that uses the Virtual International Authority File (VIAF) API. Using this service we can match authors to VIAF identifiers (you may have noticed these appearing on people's pages in Wikipedia, e.g. Mary J. Rathbun's Wikipedia page lists her VIAF as 61796012).

To use the service follow the instructions in the earlier post but add the service:

http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_viaf.php

This service is fairly crude, in particular, I make no attempt to score the matches that VIAR returns because this would require parsing and normalising author names. This could be added if needed. If you want some exmaple names to try, here are some taxonomists:


George A Boulenger
G A Boulenger
Wilhelm Michaelsen
W Michaelsen
Colin Campbell Sanborn
Suzanne Hand
Philip Hershkovitz
Yehudah Leopold Werner
W B Spencer
Norman Platnick

Tuesday, April 16, 2013

A decadal view of biodiversity informatics: challenges and priorities

LogoBMC Ecology has published Alex Hardisty and Dave Roberts' white paper on biodiversity informatics:

Hardisty, A., & Roberts, D. (2013). A decadal view of biodiversity informatics: challenges and priorities. BMC Ecology, 13(1), 16. doi:10.1186/1472-6785-13-16

Here are their 12 recommendations (with some comments of my own):

  1. Open Data, should be normal practice and should embody the principles of being accessible, assessable, intelligible and usable.

    Seems obvious, but data providers are often reluctant to open "their" data up for reuse.
  2. Data encoding should allow analysis across multiple scales, e.g. from nanometers to planet-wide and from fractions of a second to millions of years, and such encoding schemes need to be developed. Individual data sets will have application over a small fraction of these scales, but the encoding schema needs to facilitate the integration of various data sets in a single analytical structure.

    No I don't know what this means either, but I'm guessing that it's relevant if we want to attempt this: doi:10.1038/493295a
  3. Infrastructure projects should devote significant resources to market the service they develop, specifically to attract users from outside the project-funded community, and ideally in significant numbers. To make such an investment effective, projects should release their service early and update often, in response to user feedback.

    Put simply, make something that is both useful and easy to use. Simples.
  4. Build a complete list of currently used taxon names with a statement of their interrelationships (e.g. this is a spelling variation; this is a synonym; etc.). This is a much simpler challenge than building a list of valid names, and an essential pre-requisite.

    One of the simplest tasks, first tackled successfully by uBio, now moribund. The Global Names project seems stalled, intent on drowning in acronym soup (GNA,GNI,GNUB, GNITE).
  5. Attach a Persistent Identifier (PID) to every resource so that they can be linked to one another. Part of the PID should be a common syntactic structure, such as ‘DOI: ...’ so that any instance can be simply found in a free-text search.

    DOIs have won the identifier wars, and everything citable (publications, figures, datasets) is acquiring one. The mistake to avoid is forgetting that identifiers need services built on top of them (see http://labs.crossref.org/ for some DOI-related tools). The core service we need is reverse lookup: given this thing (publication, specimen, etc.) what is its identifier?
  6. Implement a system of author identifiers so that the individual contributing a resource can be identified. This, in combination with the PID (above), will allow the computation of the impact of any contribution and the provenance of any resource.

    This is a solved problem, assuming ORCID continues to gain momentum. For past authors VIAF has identifiers (which are being incorporated into Wikipedia).
  7. Make use of trusted third-party authentication measures so that users can easily work with multiple resources without having to log into each one separately.

    Again, a solved problem. People routinely use third parties such as Google and Facebook for this purpose.
  8. Build a repository for classifications (classification bank) that will allow, in combination with the list of taxonomic names, automatic construction of taxonomies to close gaps in coverage.

    Let's not, let's focus on the only two classifications that actually matter because they are linked to data, namely GBIF and NCBI. If we want one classification to coalesce around make it GBIF (NCBI will grow anyway).
  9. Develop a single portal for currently accepted names - one of the priority requirements for most users.

    Yup, still haven't got this, we clearly didn't get the memo about point 3.
  10. Standards and tools are needed to structure data into a linked format by using the potential of vocabularies and ontologies for all biodiversity facets, including: taxonomy, environmental factors, ecosystem functioning and services, and data streams like DNA (up to genomics).

    The most successful vocabulary we've come up with (Darwin Core) is essentially an agreed way to label columns in Excel spreadsheets. I've argued elsewhere that focussing on vocabularies and ontologies distracts from the real prerequisite for linking stuff together, namely reusable identifiers (see 5). No point developing labels for links if you don't have the links.
  11. Mechanisms to evaluate data quality and fitness-for-purpose are required.

    Our data is inaccurate and full of holes, and we lack decent tools for visualising and fixing this (hence my interest in putting the GBIF classification into GitHub).
  12. A next-generation infrastructure is needed to manage ever-increasing amounts of observational data.

    Not our problem, see doi:10.1038/nature11875 (by which I mean lots of people need massive storage, so it will be solved)

Food for thought. I suspect we will see the gaggle of biodiversity informatics projects will seek to align themselves with some of these goals, carving up the territory. Sadly, we have yet to find a way to coalesce critical mass around tackling these challenges. It's a cliché, but I can't help thinking "what would Google do?" or, more, precisely, "what would a Google of biodiversity look like?"