Wednesday, December 24, 2014

I'm worried about the Biodiversity Heritage Library (BHL)

BHL Small LogoOne of my guilty pleasures on a Sunday morning is browsing new content on the Biodiversity Heritage Library (BHL). Indeed, so addicted am I to this that I have an IFTTT.com feed set to forward the BHL RSS feed to my iPhone (via the Pushover app. So, when I wake most Sunday mornings I have a red badge on Pushover announcing fresh BHL content for me to browse, and potentially add to BioStor.

But lately, there has been less and less content that is suitable for BioStor, and this reflects two trends that bother me. The first, which I've blogged about before, is that an increasing amount of BHL content is not hosted by BHL itself. Instead, BHL has links to external providers. For the reasons I've given earlier, I find this to be a jarring user experience, and it greatly reduces the utility of BHL (for example, this external content is not taxonomically searchable).

The other trend that worries me is that recently BHL content has been dominated by a single provider, namely the U.S. Department of Agriculture. To give you a sense of how dominant the USDA now is, below is a chart of the contribution of different sources to BHL over time.

Chart annotated

I built this chart by querying the BHL API and extracting data on each item in BHL (source code and raw data available on github). Unfortunately the API doesn't return information on what each item was scanned, but because the identifier for each item (its ItemID) is an increasing integer, if we order the items by their integer ID then we order them by the date they were added. I've binned the data into units of 1000 (in other words, every item with an ItemID < 1000 is in bin 0, ItemIDs 1000 to 1999 are in bin 1, and so on). The chart shows the top 20 contributors to BHL, with the Smithsonian as the number one contributor.

The chart shows a number of interesting patterns, but there are a couple I want to highlight. The first is the noticeable spikes representing the addition of externally hosted material (from the American Museum of Natural History Library and the Biblioteca Digital del Real Jardin Botanico de Madrid). The second is the recent dominance of content from the USDA.

Now, to be fair, I must acknowledge that I have my own bias as to what BHL content is most valuable. My own focus is on the taxonomic literature, especially original descriptions, but also taxonomic revisions (useful for automatically extracting synonyms). Discovering these in BHL is what motivated me to build BioStor, and then BioNames, the later being a database that aims to link every animal taxon name to its original description. BioNames would be much poorer if it wasn't for BioStor (and hence BHL).

If, however, your interest is agriculture in the United States, then the USDA content is obviously a potential goldmine of information on topics such as past crop use, pricing policies, etc. But this a topic that is both taxonomically narrow (economically important organisms are a tiny fraction of biodiversity), and, by definition, geographically narrow.

To be clear, I don't have any problem with BHL having USDA content as such, it's a tremendous resource. But I worry that lately BHL has been pretty much all USDA content. There is still a huge amount of literature that has yet to be scanned. I'd like to see BHL actively going after major museums and libraries that have yet to contribute. I especially want to see more post-1923 content. BHL has managed to get post-1923 content from some of its contributors, it needs a lot more. On obvious target is those institutions that signed the Bouchout Declaration. If you've signed up to providing "free and open use of digital resources about biodiversity", then let's see something tangible from that - open up your libraries and your publications, scan them, and make them part of BHL. I'm especially looking at European institutions who (with some notable exceptions) really should be doing a lot better.

It's possible that the current dominance of USDA content is a temporary phenomenon. Looking at the chart above, BHL acquires content in a fairly episodic manner, suggesting that it is largely at the mercy of what its contributors can provide, and when they can do so. Maybe in a few months there will be a bunch of content that is taxonomically and geographically rich, and I will be spending weekends furiously harvesting that content for BioStor. But until then, my Sundays are not nearly as much fun as they used to be.

Thursday, December 18, 2014

Linking data from the NHM portal with content in BHL

02932 580 360One reason I'm excited by the launch of the NHM data portal is that it opens up opportunities to link publications about specimens i the NHM to the record of the specimens themselves. For example, consider specimen 1977.3097, which is in the new portal as http://data.nhm.ac.uk/dataset/collection-specimens/resource/05ff2255-c38a-40c9-b657-4ccb55ab2feb/record/2336568 (possibly the ugliest URL ever).

1977 3097

This specimen is of the bat Pteralopex acrodonta, shown in the image to the right (by William N. Beckon, taken from the EOL page for this species). This species was described in the following paper:
Hill JE, Beckon WN (1978) A new species of Pteralopex Thomas, 1888 (Chiroptera: Pteropodidae) from the Fiji Islands. Bulletin of the British Museum (Natural History) Zoology 34(2): 65–82. http://biostor.org/reference/8
This paper is in my BioStor project, and if you visit BioStor you'll see see that BioStor has extracted a specimen code (BM(NH) 77.3097) and also has a map of localities extracted from the paper.

Map
Looking at the paper we discover that BM(NH) 77.3097 is the type specimen of Pteralopex acrodonta:
HOLOTYPE. BM(NH) 77.3097. Adult . Ridge about 300 m NE of the Des Voeux Peak Radio Telephone Antenna Tower, Taveuni Island, Fiji Islands, 16° 50½' S, 179° 58' W, c. 3840ft (1170 m). Collected 3 May 1977 by W. N. Beckon, died 6-7 May 1977. Caught in mist net on ridge summit : bulldozed land with secondary scrubby growth, adjacent to primary forest. Original number 104. Skin and skull.
Note that the NHM data portal doesn't know that 1977.3097 is the holotype, nor does it have the latitude and longitude. Hence, if we can link 1977.3097 to BM(NH) 77.3097 we can augment the information in the NHM portal.

This specimen has also been cited in a subsequent paper:
Helgen, K. M. (2005, November). Systematics of the Pacific monkey‐faced bats (Chiroptera: Pteropodidae), with a new species of Pteralopex and a new Fijian genus . Systematics and Biodiversity. Informa UK Limited. doi:10.1017/s1477200005001702
You can read this paper in BioNames. In this paper Helgen creates a new genus, Mirimiri for Pteralopex acrodonta, and cites the holotype (as BMNH 1977.3097). Hence, if we could extract that specimen code from the text and link it to the NHM record we could have two citations for this specimen, and note that the taxon the specimen belongs to is also known as Mirimiri acrodonta.

Imagine being able to do this across the whole NHM data portal. The original description of this bat was published in a journal published by the NHM (and part of a volume contributed by the NHM to the Biodiversity Heritage Library). With a *cough* little work we could join up these two NHM digital resources (specimen and paper) to provide a more detailed view what we know about this specimen. From my perspective this cross-linking between the different digital assets of an institution such as the NHM (as well as linking to external data such as other publications, GenBank sequences, etc.) is where the real value of digitisation lies. It has the potential to be much more than simply moving paper catalogues and publications online.

Wednesday, December 17, 2014

The Natural History Museum launches their data portal

XVlUOuC5The Natural History Museum has released their data portal (http://data.nhm.ac.uk/). As of now it contains 2,439,827 of the Museum's 80 million specimens, so it's still early days. I gather that soon this data will also appear in GBIF, ending the unfortunate situation where data from one of the premier natural history collections in the world was conspicuous by its absence.

I've not had a chance to explore it in much detail, but one thing I'm keen to do is see whether I can link citations of NHM specimens in the literature (e.g., articles in BioStor) with records in the NHM portal. Being able to dip this would enable all sorts of cool things, such as being able to track what researchers have said about particular specimens, as well as develop citation metrics for the collection.

Nhmportal

Is DNA barcoding dead?

On a recent trip to the Natural History Museum, London, the subject of DNA barcoding came up, and I got the clear impression that people at the NHM thought classical DNA barcoding was pretty much irrelevant, given recent developments in sequencing technology. For example, why sequence just COI when you can use shotgun sequencing to get the whole mitogenome? I was a little taken aback, although this is a view that's getting some traction, e.g. [1,2]. There is also the more radical view that focussing on phylogenetics is itself less useful than, say, "evolutionary gene networks" based on massive sequencing of multiple markers [3].

At the risk of seeming old-fashioned in liking DNA barcoding, I think there's a bigger issue at stake (see also [4]). DNA barcoding isn't simply a case of using a single, short marker to identify animal species. It's the fact that it's a globalised, standardised approach that makes it so powerful. In the wonderful book "A Vast Machine" [5], Paul Edwards talks about "global data" and "making data global". The idea is that not only do we want data that is global in coverage ("global data"), but we want data that can be integrated ("making data global"). In other words, not only do we want data from everywhere in the world, say, we also need an agreed coordinate system (e.g., latitude and longitude) in order to put each data item in a global context. DNA barcoding makes data global by standardising what a barcode is (a given fragment of COI), and what metadata needs to be associated with a sequence to be a barcode (e.g., latitude and longitude) (see, e.g. Guest post: response to "Putting GenBank Data on the Map"). By insisting on this standardisation, we potentially sacrifice the kinds of cool things that can be done with metagenomics, but the tradeoff is that we can do things like put a million barcodes on a map:

Bold
To regard barcoding as dead or outdated we'd need an equivalent effort to make metagenomic sequences of animals global in the same way that DNA barcoding is. Now, it may well be that the economics of sequencing is such that it is just as cheap to shotgun sequence mitogenomes, say, as to extract single markers such as COI. If that's the case, and we can get a standardised suite of markers across all taxa, and we can do this across museum collections (like Hebert et al.'s [6] DNA barcoding "blitz" of 41,650 specimens in a butterfly collection), then I'm all for it. But it's not clear to me that this is the case.

This also leaves aside the issue of standardising other things's much as the metadata. For instance, Dowton et al. [2] state that "recent developments make a barcoding approach that utilizes a single locus outdated" (see Collins and Cruickshank [4] for a response). Dowton et al. make use of data they published earlier [7,8]. Out of curiosity I looked at some of these sequences in GenBank, such as JN964715. This is a COI sequence, in other words, a classical DNA barcode. Unfortunately, it lacks a latitude and longitude. By leaving off latitude and longitude (despite the authors having this information, as it is in the supplemental material for [7]) the authors have missed an opportunity to make their data global.

For me the take home message here is that whether you think DNA barcoding is outdated depends in part what your goal is. Clearly barcoding as a sequencing technology has been superseded by more recent developments. But to dismiss it on those grounds is to miss the bigger picture of what is a stake, namely the chance to have comparable data for millions of samples across the globe.

References

  1. TAYLOR, H. R., & HARRIS, W. E. (2012, February 22). An emergent science on the brink of irrelevance: a review of the past 8 years of DNA barcoding. Molecular Ecology Resources. Wiley-Blackwell. doi:10.1111/j.1755-0998.2012.03119.x
  2. Dowton, M., Meiklejohn, K., Cameron, S. L., & Wallman, J. (2014, March 28). A Preliminary Framework for DNA Barcoding, Incorporating the Multispecies Coalescent. Systematic Biology. Oxford University Press (OUP). doi:10.1093/sysbio/syu028
  3. Bittner, L., Halary, S., Payri, C., Cruaud, C., de Reviers, B., Lopez, P., & Bapteste, E. (2010). Some considerations for analyzing biodiversity using integrative metagenomics and gene networks. Biol Direct. Springer Science + Business Media. doi:10.1186/1745-6150-5-47
  4. Collins, R. A., & Cruickshank, R. H. (2014, August 12). Known Knowns, Known Unknowns, Unknown Unknowns and Unknown Knowns in DNA Barcoding: A Comment on Dowton et al. Systematic Biology. Oxford University Press (OUP). doi:10.1093/sysbio/syu060
  5. Edwards, Paul N. A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming. MIT Press ISBN: 9780262013925
  6. Hebert, P. D. N., deWaard, J. R., Zakharov, E. V., Prosser, S. W. J., Sones, J. E., McKeown, J. T. A., Mantle, B., et al. (2013, July 10). A DNA “Barcode Blitz”: Rapid Digitization and Sequencing of a Natural History Collection. (S.-O. Kolokotronis, Ed.)PLoS ONE. Public Library of Science (PLoS). doi:10.1371/journal.pone.0068535
  7. Meiklejohn, K. A., Wallman, J. F., Pape, T., Cameron, S. L., & Dowton, M. (2013, October). Utility of COI, CAD and morphological data for resolving relationships within the genus Sarcophaga (sensu lato) (Diptera: Sarcophagidae): A preliminary study. Molecular Phylogenetics and Evolution. Elsevier BV. doi:10.1016/j.ympev.2013.04.034
  8. Meiklejohn, K. A., Wallman, J. F., Cameron, S. L., & Dowton, M. (2012). Comprehensive evaluation of DNA barcoding for the molecular species identification of forensically important Australian Sarcophagidae (Diptera). Invertebrate Systematics. CSIRO Publishing. doi:10.1071/is12008

Tuesday, December 09, 2014

Guest post: Top 10 species names and what they mean

The following is a guest post by Bob Mesibov. Bob

The i4Life project has very kindly liberated Catalogue of Life (CoL) data from its database, and you can now download the latest CoL as a set of plain text, tab-separated tables here.

One of the first things I did with my download was check the 'taxa.txt' table for species name popularity*. Here they are, the top 10 species names for animals and plants, with their frequencies in the CoL list and their usual meanings:

Animals


2732 gracilis = slender
2373 elegans = elegant
2231 bicolor = two-coloured
2066 similis = similar
1995 affinis = near
1937 australis = southern
1740 minor = lesser
1718 orientalis = eastern
1708 simplex = simple
1350 unicolor = one-coloured

Plants


1871 gracilis = slender
1545 angustifolia = narrow-leaved
1475 pubescens = hairy
1336 parviflora = few-flowered
1330 elegans = elegant
1324 grandiflora = large-flowered
1277 latifolia = broad-leaved
1155 montana = (of a) mountain
1124 longifolia = long-leaved
1102 acuminata = pointed

Take the numbers cum grano salis. The first thing I did with the CoL tables was check for duplicates, and they're there, unfortunately. It's interesting, though, that gracilis tops the taxonomists' poll for both the animal and plant kingdoms.

*With the GNU/Linux commands


awk -F"\t" '($11 == "Animalia") && ($8 == "species") {print $20}' taxa.txt | sort | uniq -c | sort -nr | head
awk -F"\t" '($11 == "Plantae") && ($8 == "species") {print $20}' taxa.txt | sort | uniq -c | sort -nr | head

Tuesday, December 02, 2014

GBIF Ebbe Nielsen Challenge

The GBIF Ebbe Nielsen Challenge is open! From the official announcement
The GBIF Secretariat has launched the inaugural GBIF Ebbe Nielsen Challenge, hoping to inspire innovative applications of open-access biodiversity data by scientists, informaticians, data modelers, cartographers and other experts.
First prize is €20,000, full details on prizes and entry requirements are on the Challenge web site. To judge the entries GBIF has assembled a panel of judges comprising people both inside and outside GBIF and its advisory committees:

Large  3 Lucas Joppa
 Scientist, Computational Ecology and Environmental Sciences Group / Microsoft Research
Large  2 Mary Klein
 President & CEO / NatureServe
Download Tanya Abrahamse
 CEO / SANBI: South African National Biodiversity Institute
Large  1 Arturo H. AriƱo
 Professor of Ecology / University of Navarra
Large Roderic Page (that's me)
 Professor of Taxonomy / University of Glasgow

This is the first time we've run the challenge, so the topic is wide open. Below I've put together some ideas that are simply designed to get you thinking (and are in no way intended to limit the sort of things that could be entered).

400px GOS weighted unifrac fullEvolutionary trees
Increasingly DNA sequences from DNA barcoding and metabarcoding are being used to study biodiversity. How can we integrate that data into GBIF? Can we decorate GBIF maps with evolutionary trees?
GoogleforestChange over timeGlobal Forest Watch is an impressive example of how change in the biosphere can be monitored over time. Can we do something similar with GBIF data? Alternatively, if the level of temporal or spatial resolution in GBIF data isn't high enough, can we combine these sources in some way?
DashboardDashboard
GBIF has started to provide
graphical summaries of its data
, and there is lots to be done in this area. Can we have a Google Analytics-style summary of GBIF data?


This merely scratches the surface of what could be done, and indeed one of the reasons for having the challenge is to start a conversation about what can be done with half a billion data records.