Tuesday, March 27, 2012

BHL and GBIF as biomedical databases

When I think of the Biodiversity Heritage Library (BHL) or GBIF I tend to think of taxonomy and biodiversity. Folk wisdom has it that BHL is full of old books, mostly pre-1923. Great for finding old taxonomic names, or nice artwork, but not exactly "modern" biology. GBIF is mainly about displaying organism distributions based on museum specimens, the primary data of taxonomic research. Again, great stuff, but aren't museums simply full of dead stuff that people have collected and forgotten about?

But BHL has a lot more post-1923 content than I suspect most people realise (several museum or society journals have 21st century issues in BHL's archives, for example). Continuing the theme of linking BHL and GBIF content, as part of a forthcoming project on taxonomic names (to be made available "real soon now") I stumbled across this 1976 paper in BHL (now in BioStor):

Monograph on "Lithoglyphopsis" aperta, the snail host of Mekong River Schistosomiasis by Davis et al..

Malacologia157576inst 0263

This paper has been indexed in PubMed (PMID:948206, but as far as I'm aware, BHL (and BioStor) has the only digital copy of this paper. (As a side note, wouldn't it be great if PubMed could link to BHL content?).

The article page in BioStor shows a map derived from the OCR text, showing a two localities:


Below the map are the specimen codes I've automatically extracted from the OCR text, linked to the corresponding records in GBIF, which are georeferenced (e.g., ANSP Malacology 330925).

If we joined these things up just a little more, we could do some useful things. For example, what if a researcher searching in PubMed for schistosomiasis in South East Asia could find the Davis et al. paper, and then go to BHL or BioStor to read it? What if a researcher looking at gastropod distributions in the Mekong River in the GBIF portal could see that BHL had publications on diseases associated with these organisms (as well as their taxonomy and biology). We could also traverse the link from GBIF to BHL to PubMed and provide a direct route from distribution maps to biomedical literature.

It seems there's scope for trying to connect BHL, GBIF, and PubMed, and that BHL and GBIF may have important roles to play in providing access to basic information about organisms that have a serious impact on human populations.

Wednesday, March 21, 2012

iEvoBio 2012 Challenge: Synthesizing phylogenies

0150The iEvoBio 2012 Challenge has been announced, and the topic is synthesizing phylogenies. The task:

Somewhere, buried in large sets of trees, lies a stunning new revelation, a baffling discovery, the answer to a longstanding controversy, or simply something not obvious to the naked eye. The mission of the 2012 iEvoBio challenge is to find those revelations, discoveries and answers within your own data and/or within one of the datasets provided by the challenge. What new scientifically interesting results can you pull from these trees, using any combination of techniques at your disposal?

The rules of this challenge are:
  1. The set of trees you use must have at least 10,000 leaves in total. Acceptable entries could be a set comprising 2,500 distinct trees covering the same four taxa, a single tree with 10,000+ leaves, or anything in between.
  2. Your results must be scientifically new.
  3. The data, or at least a description of the data, must be publicly available. If working with your own dataset, you must at least provide a summary of the data you used (see below for the minimum description that must be provided).
  4. The source code of any tool and/or method developed as part of your challenge submission must be publicly downloadable under an OSI-approved open-source license (or dedicated to the public domain) at the latest by the time of the conference.

For more details see the challenge site. Deadline for submission is June 25, 2012.

Thursday, March 01, 2012

Yet more reasons to have specimen identifiers: annotating GenBank sequences

One reason I'm pursuing the theme of specimen identifiers (and identifiers in general) is the central role they play in annotating databases. To give a concrete example, I (among others) have argued for a wiki-style annotation layer on top of GenBank to capture things such as sequencing errors, updated species names, etc. Annotation is a lot easier if we have consistent identifiers for the things being annotated. For example, every GenBank sequence has a unique accession number, so if you and I are discussing sequence DQ055738, you and I can be sure we are talking about the same thing.

Sequence DQ055738 is interesting because Hua et al. A Revised Phylogeny of Holarctic Treefrogs (Genus Hyla) Based on Nuclear and Mitochondrial DNA Sequences (http://dx.doi.org/10.1655/08-058R1.1 - note the nice identifier we have for this article) have suggested this sequence (published in http://dx.doi.org/10.1554/05-284.1, another nice identifier) is misidentified. Given these identifiers we could construct various statements, such as:

DQ055738 -> published in -> doi:10.1554/05-284.1
DQ055738 -> annotated by -> doi:10.1655/08-058R1.1

(I've omitted the http:// stuff to keep things legible). Hua et al: state the following:

However, the tissue number of this specimen (LSUMZ H-19067) is similar to that of a specimen of H. versicolor (LSUMZ H-19077), which appears to have been processed at the same time (C. Austin, personal communication). Therefore, we hypothesize that the sequence data for H. gratiosa used by Smith et al. (2005) were actually from H. versicolor.

It would be nice if we had unique, resolvable identifiers for LSUMZ H-19067 and LSUMZ H-19077 so that we could construct statements linking the sequence, the publications, and the specimens. But we don't. Nor is it obvious how to find out anything more about LSUMZ H-19067 and LSUMZ H-19077. By contrast, for the DOI or the sequence accession I know how to get more information, in either human- or machine-readable form.

The acronym LSUMZ in this case is the Lousiana State University Museum of Natural Science Herpetology collection (http://biocol.org/urn:lsid:biocol.org:col:34806). Just to confuse matters, LSUMZ specimens in GBIF use LSU as the acronym for Lousiana State University Museum of Natural Science. Given that GBIF's data comes from LSU itself, it's odd (but not surprising) that there's a muddle about which acronym to use (it would be nice to clear this up, but then anybody building identifiers based on those acronyms is in for some heartbreak).

If I look at GBIF LSUMZ records there aren't specimens with the catalogue numbers H-19067 or H-19077. However, after a bit of poking around, and a helpful file from GBIF's Tim Robertson, I discovered that the LSUMZ herpetology tissue numbers (which is what the H-* codes actually are) are stored in GBIF, so I've found the corresponding specimens are http://data.gbif.org/occurrences/45716232 (LSU Herp 84850, LSUMZ HerpNet Tissue 19067) and http://data.gbif.org/occurrences/45710033 (LSU Herp 84862, LSUMZ HerpNet Tissue 19077). (Note that Hua et al. tell the reader that LSU 84850 = LSUMZ H-19067, but don't give the specimen code for LSUMZ H-19077).

Now I have some resolvable identifiers, so I could construct statements like:

DQ055738 -> voucher -> occurrences/45716232
DQ055738 -> voucher -> occurrences/45710033
+-> according to -> doi:10.1655/08-058R1.1

Let's skip over whether this is actually the best way to record the annotation, the point is we can now start to construct statements that can be linked to the wider world. If someone else has made statements about these specimens, and they used the GBIF URL, then we could aggregate those and learn more about these specimen and their associated sequences. Without globally unique, stable, resolvable identifiers we are left to flounder around in the bowels of various databases searching for something that may or may not be the object being discussed. Isn't it time we did something about this?