Tuesday, April 16, 2013

A decadal view of biodiversity informatics: challenges and priorities

LogoBMC Ecology has published Alex Hardisty and Dave Roberts' white paper on biodiversity informatics:

Hardisty, A., & Roberts, D. (2013). A decadal view of biodiversity informatics: challenges and priorities. BMC Ecology, 13(1), 16. doi:10.1186/1472-6785-13-16

Here are their 12 recommendations (with some comments of my own):

  1. Open Data, should be normal practice and should embody the principles of being accessible, assessable, intelligible and usable.

    Seems obvious, but data providers are often reluctant to open "their" data up for reuse.
  2. Data encoding should allow analysis across multiple scales, e.g. from nanometers to planet-wide and from fractions of a second to millions of years, and such encoding schemes need to be developed. Individual data sets will have application over a small fraction of these scales, but the encoding schema needs to facilitate the integration of various data sets in a single analytical structure.

    No I don't know what this means either, but I'm guessing that it's relevant if we want to attempt this: doi:10.1038/493295a
  3. Infrastructure projects should devote significant resources to market the service they develop, specifically to attract users from outside the project-funded community, and ideally in significant numbers. To make such an investment effective, projects should release their service early and update often, in response to user feedback.

    Put simply, make something that is both useful and easy to use. Simples.
  4. Build a complete list of currently used taxon names with a statement of their interrelationships (e.g. this is a spelling variation; this is a synonym; etc.). This is a much simpler challenge than building a list of valid names, and an essential pre-requisite.

    One of the simplest tasks, first tackled successfully by uBio, now moribund. The Global Names project seems stalled, intent on drowning in acronym soup (GNA,GNI,GNUB, GNITE).
  5. Attach a Persistent Identifier (PID) to every resource so that they can be linked to one another. Part of the PID should be a common syntactic structure, such as ‘DOI: ...’ so that any instance can be simply found in a free-text search.

    DOIs have won the identifier wars, and everything citable (publications, figures, datasets) is acquiring one. The mistake to avoid is forgetting that identifiers need services built on top of them (see http://labs.crossref.org/ for some DOI-related tools). The core service we need is reverse lookup: given this thing (publication, specimen, etc.) what is its identifier?
  6. Implement a system of author identifiers so that the individual contributing a resource can be identified. This, in combination with the PID (above), will allow the computation of the impact of any contribution and the provenance of any resource.

    This is a solved problem, assuming ORCID continues to gain momentum. For past authors VIAF has identifiers (which are being incorporated into Wikipedia).
  7. Make use of trusted third-party authentication measures so that users can easily work with multiple resources without having to log into each one separately.

    Again, a solved problem. People routinely use third parties such as Google and Facebook for this purpose.
  8. Build a repository for classifications (classification bank) that will allow, in combination with the list of taxonomic names, automatic construction of taxonomies to close gaps in coverage.

    Let's not, let's focus on the only two classifications that actually matter because they are linked to data, namely GBIF and NCBI. If we want one classification to coalesce around make it GBIF (NCBI will grow anyway).
  9. Develop a single portal for currently accepted names - one of the priority requirements for most users.

    Yup, still haven't got this, we clearly didn't get the memo about point 3.
  10. Standards and tools are needed to structure data into a linked format by using the potential of vocabularies and ontologies for all biodiversity facets, including: taxonomy, environmental factors, ecosystem functioning and services, and data streams like DNA (up to genomics).

    The most successful vocabulary we've come up with (Darwin Core) is essentially an agreed way to label columns in Excel spreadsheets. I've argued elsewhere that focussing on vocabularies and ontologies distracts from the real prerequisite for linking stuff together, namely reusable identifiers (see 5). No point developing labels for links if you don't have the links.
  11. Mechanisms to evaluate data quality and fitness-for-purpose are required.

    Our data is inaccurate and full of holes, and we lack decent tools for visualising and fixing this (hence my interest in putting the GBIF classification into GitHub).
  12. A next-generation infrastructure is needed to manage ever-increasing amounts of observational data.

    Not our problem, see doi:10.1038/nature11875 (by which I mean lots of people need massive storage, so it will be solved)

Food for thought. I suspect we will see the gaggle of biodiversity informatics projects will seek to align themselves with some of these goals, carving up the territory. Sadly, we have yet to find a way to coalesce critical mass around tackling these challenges. It's a cliché, but I can't help thinking "what would Google do?" or, more, precisely, "what would a Google of biodiversity look like?"