iPhylo: April 2012

Roderic D. M. Page

Tuesday, April 24, 2012

Dark taxa even darker: NCBI pulls (some) DNA barcodes from GenBank (updated)

Dark taxa have become even darker. NCBI has pulled the plug on large numbers of DNA barcode sequences that lack scientific names. For example, taxon Cyclopoida sp. BOLD:AAG9771 (tax_id 818059) now has a sparse page that has no associated sequences. From an earlier download of EMBL I know that this taxon is associated with at least 5 sequences, such as GU679674. But if you go to that sequence you get this:

So the the sequence is hidden. You can retrieve it by clicking on the obsolete version link, but by default it is hidden.

It's an extraordinary state of affairs that a huge slice of fundamental biodiversity data has been effectively "pulled" from view.

UpdateSujeevan Ratnasingham from iBOL has pointed out that the sequence I'd used above (GU679674) was not one of the ones hidden by NCBI, rather it was suppressed at the request of the investigator (which I'd have realised if I'd paid more attention to the screenshot). HQ918317 is an example of a BOLD record that was suppressed:

Friday, April 20, 2012

Quick thoughts on specimen identifiers

Based on recent discussions my sense is that our community will continue to thrash the issue of identifiers to death, repeating many of the debates that have gone on (and will go on) in other areas. To be trite, it seems to me we have three criteria: cheap, resolvable, and persistent. We get to pick two.

Cheap and resolvable means URLs, which everybody is nervous about because they break. They don't have to break, but for a bunch of reasons they do.

Cheap and persistent means things like Darwin Triplet Core or URNs. You can write things on paper and they will persist (the Biodiversity Heritage Library shows us that), but how in the digital era do we do anything with this? If it's not resolvable what, exactly, is the point? We tried URNs — even ones that were resolvable (LSIDs) — and that was a disaster (we learnt a lot, but what a mess).

Resolvable and persistent. This is where technologies such as DOIs reside. If every specimen had a DOI would we still be having this discussion? We'd have a resolvable identifier that is resistant to change (including loss of museum domain names, specimens moving to new institutions, etc.), and one that is already in use by CrossRef and DataCite, and will also play ball with linked data folks.

In practical terms, what if we had a convention that each collection gets it's own DOI prefix "10.nnnn", after which it appends whatever specimen identifier makes sense (and is unique within that collection).

The bulk of specimen identifiers in the wild are of the form "Institution" "Catalogue number", e.g. ANSP 332467 (from the example I discussed in BHL and GBIF as biomedical databases).

If we wrote this as a DOI of the form <doi prefix>/Collection/InstitutionCatalogue number then we'd have identifiers that (in part) matched what most people would expect to see. In the example above we would have something like:

10.nnnnn/MAL/ANSP332467

where "MAL" is the acronym for the Malacology collection. This is pretty close to "ANSP 332467", is human friendly, but would also be resolvable. It also carries limited branding, so if the specimen was moved from it's current collection to a new institution, people wouldn't get too upset by the presence of "ANSP"). It would also help make the links between specimen codes and DOIs. We couldn't rely on 10.nnnnn/MAL/ANSP332467 being a specimen in the Academy of Natural Sciences's malacological collection, but it would be a good place to start looking.

As I've argued before, we could centralise the minting of these identifiers using GBIF, but do it in a such a way that host institutions could assume responsibility for it if and when they are able (i.e., initially GBIF is responsible for managing the DOI prefixes for each institution, with the option for institutions to do this). The beauty of identifiers like DOIs is that from the user's perspective the identifier is unchanged.

I'm hoping we'll make some progress on this in the coming months...

Thursday, April 05, 2012

EOL Computable Data Challenge community

Now we are awash in challenges! EOL has announced its Computable Data Challenge:

We invite ideas for scientific research projects that use EOL, including the Biodiversity Heritage Library (BHL), to answer questions in biology. The specific field of biological interest for the challenge is open; projects in ecology, evolution, behavior, conservation biology, developmental biology, or systematics may be most appropriate. Projects advancing informatics alone may be less competitive. EOL may be used as a source of biological information, to establish a sampling strategy, to assist the retrieval of computable data by mapping identifiers across sources (e.g. to accomplish name resolution), and/or in other innovative ways. Projects involving data or text or image mining of EOL or BHL content are encouraged. Current EOL data and API shall be used; suggestions for modification of content or the API could be a deliverable of the project. We encourage the use of data not yet in EOL for analyses. In all cases projects must honor terms of use and licensing as appropriate.

Some $US 50,000 is on offer. "Challenge" is perhaps a misnomer, as EOL is offering this money not as a prize at the end, but rather to fund one or more proposals (submitted by 22 May) that are accepted. So, it's essentially a grant competition (with a pleasingly minimal amount of administrivia). There is also a Computable Data Challenge community to discuss the challenge.

It's great to see EOL trying different strategies to engage with developers. Of the different challenges EOL is running this one is perhaps the most appealing to me, because one of my biggest complaints about EOL is that it's hard to envisage "doing science" with it. For example, we can download GenBank and cluster sequences into gene families, or grab data from GBIF and model species distributions, but what could we do with EOL? This challenge will be a chance to explore the extent to which EOL can support science, which I would argue will be a key part of its long term future.