One of my guilty pleasures on a Sunday morning is browsing new content on the Biodiversity Heritage Library (BHL). Indeed, so addicted am I to this that I have an IFTTT.com feed set to forward the BHL RSS feed to my iPhone (via the Pushover app. So, when I wake most Sunday mornings I have a red badge on Pushover announcing fresh BHL content for me to browse, and potentially add to BioStor.
But lately, there has been less and less content that is suitable for BioStor, and this reflects two trends that bother me. The first, which I've blogged about before, is that an increasing amount of BHL content is not hosted by BHL itself. Instead, BHL has links to external providers. For the reasons I've given earlier, I find this to be a jarring user experience, and it greatly reduces the utility of BHL (for example, this external content is not taxonomically searchable).
The other trend that worries me is that recently BHL content has been dominated by a single provider, namely the U.S. Department of Agriculture. To give you a sense of how dominant the USDA now is, below is a chart of the contribution of different sources to BHL over time.
I built this chart by querying the BHL API and extracting data on each item in BHL (source code and raw data available on github). Unfortunately the API doesn't return information on what each item was scanned, but because the identifier for each item (its ItemID) is an increasing integer, if we order the items by their integer ID then we order them by the date they were added. I've binned the data into units of 1000 (in other words, every item with an ItemID < 1000 is in bin 0, ItemIDs 1000 to 1999 are in bin 1, and so on). The chart shows the top 20 contributors to BHL, with the Smithsonian as the number one contributor.
The chart shows a number of interesting patterns, but there are a couple I want to highlight. The first is the noticeable spikes representing the addition of externally hosted material (from the American Museum of Natural History Library and the Biblioteca Digital del Real Jardin Botanico de Madrid). The second is the recent dominance of content from the USDA.
Now, to be fair, I must acknowledge that I have my own bias as to what BHL content is most valuable. My own focus is on the taxonomic literature, especially original descriptions, but also taxonomic revisions (useful for automatically extracting synonyms). Discovering these in BHL is what motivated me to build BioStor, and then BioNames, the later being a database that aims to link every animal taxon name to its original description. BioNames would be much poorer if it wasn't for BioStor (and hence BHL).
If, however, your interest is agriculture in the United States, then the USDA content is obviously a potential goldmine of information on topics such as past crop use, pricing policies, etc. But this a topic that is both taxonomically narrow (economically important organisms are a tiny fraction of biodiversity), and, by definition, geographically narrow.
To be clear, I don't have any problem with BHL having USDA content as such, it's a tremendous resource. But I worry that lately BHL has been pretty much all USDA content. There is still a huge amount of literature that has yet to be scanned. I'd like to see BHL actively going after major museums and libraries that have yet to contribute. I especially want to see more post-1923 content. BHL has managed to get post-1923 content from some of its contributors, it needs a lot more. On obvious target is those institutions that signed the Bouchout Declaration. If you've signed up to providing "free and open use of digital resources about biodiversity", then let's see something tangible from that - open up your libraries and your publications, scan them, and make them part of BHL. I'm especially looking at European institutions who (with some notable exceptions) really should be doing a lot better.
It's possible that the current dominance of USDA content is a temporary phenomenon. Looking at the chart above, BHL acquires content in a fairly episodic manner, suggesting that it is largely at the mercy of what its contributors can provide, and when they can do so. Maybe in a few months there will be a bunch of content that is taxonomically and geographically rich, and I will be spending weekends furiously harvesting that content for BioStor. But until then, my Sundays are not nearly as much fun as they used to be.
Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.
ISSN 2051-8188. Written content on this site is licensed under a Creative Commons Attribution 4.0 International license.
Wednesday, December 24, 2014
Thursday, December 18, 2014
Linking data from the NHM portal with content in BHL
One reason I'm excited by the launch of the NHM data portal is that it opens up opportunities to link publications about specimens i the NHM to the record of the specimens themselves. For example, consider specimen 1977.3097, which is in the new portal as http://data.nhm.ac.uk/dataset/collection-specimens/resource/05ff2255-c38a-40c9-b657-4ccb55ab2feb/record/2336568 (possibly the ugliest URL ever).
This specimen is of the bat Pteralopex acrodonta, shown in the image to the right (by William N. Beckon, taken from the EOL page for this species). This species was described in the following paper:
Looking at the paper we discover that BM(NH) 77.3097 is the type specimen of Pteralopex acrodonta:
This specimen has also been cited in a subsequent paper:
Imagine being able to do this across the whole NHM data portal. The original description of this bat was published in a journal published by the NHM (and part of a volume contributed by the NHM to the Biodiversity Heritage Library). With a *cough* little work we could join up these two NHM digital resources (specimen and paper) to provide a more detailed view what we know about this specimen. From my perspective this cross-linking between the different digital assets of an institution such as the NHM (as well as linking to external data such as other publications, GenBank sequences, etc.) is where the real value of digitisation lies. It has the potential to be much more than simply moving paper catalogues and publications online.
This specimen is of the bat Pteralopex acrodonta, shown in the image to the right (by William N. Beckon, taken from the EOL page for this species). This species was described in the following paper:
Hill JE, Beckon WN (1978) A new species of Pteralopex Thomas, 1888 (Chiroptera: Pteropodidae) from the Fiji Islands. Bulletin of the British Museum (Natural History) Zoology 34(2): 65–82. http://biostor.org/reference/8This paper is in my BioStor project, and if you visit BioStor you'll see see that BioStor has extracted a specimen code (BM(NH) 77.3097) and also has a map of localities extracted from the paper.
Looking at the paper we discover that BM(NH) 77.3097 is the type specimen of Pteralopex acrodonta:
HOLOTYPE. BM(NH) 77.3097. Adult . Ridge about 300 m NE of the Des Voeux Peak Radio Telephone Antenna Tower, Taveuni Island, Fiji Islands, 16° 50½' S, 179° 58' W, c. 3840ft (1170 m). Collected 3 May 1977 by W. N. Beckon, died 6-7 May 1977. Caught in mist net on ridge summit : bulldozed land with secondary scrubby growth, adjacent to primary forest. Original number 104. Skin and skull.Note that the NHM data portal doesn't know that 1977.3097 is the holotype, nor does it have the latitude and longitude. Hence, if we can link 1977.3097 to BM(NH) 77.3097 we can augment the information in the NHM portal.
This specimen has also been cited in a subsequent paper:
Helgen, K. M. (2005, November). Systematics of the Pacific monkey‐faced bats (Chiroptera: Pteropodidae), with a new species of Pteralopex and a new Fijian genus . Systematics and Biodiversity. Informa UK Limited. doi:10.1017/s1477200005001702You can read this paper in BioNames. In this paper Helgen creates a new genus, Mirimiri for Pteralopex acrodonta, and cites the holotype (as BMNH 1977.3097). Hence, if we could extract that specimen code from the text and link it to the NHM record we could have two citations for this specimen, and note that the taxon the specimen belongs to is also known as Mirimiri acrodonta.
Imagine being able to do this across the whole NHM data portal. The original description of this bat was published in a journal published by the NHM (and part of a volume contributed by the NHM to the Biodiversity Heritage Library). With a *cough* little work we could join up these two NHM digital resources (specimen and paper) to provide a more detailed view what we know about this specimen. From my perspective this cross-linking between the different digital assets of an institution such as the NHM (as well as linking to external data such as other publications, GenBank sequences, etc.) is where the real value of digitisation lies. It has the potential to be much more than simply moving paper catalogues and publications online.
Wednesday, December 17, 2014
The Natural History Museum launches their data portal
The Natural History Museum has released their data portal (http://data.nhm.ac.uk/). As of now it contains 2,439,827 of the Museum's 80 million specimens, so it's still early days. I gather that soon this data will also appear in GBIF, ending the unfortunate situation where data from one of the premier natural history collections in the world was conspicuous by its absence.
I've not had a chance to explore it in much detail, but one thing I'm keen to do is see whether I can link citations of NHM specimens in the literature (e.g., articles in BioStor) with records in the NHM portal. Being able to dip this would enable all sorts of cool things, such as being able to track what researchers have said about particular specimens, as well as develop citation metrics for the collection.
I've not had a chance to explore it in much detail, but one thing I'm keen to do is see whether I can link citations of NHM specimens in the literature (e.g., articles in BioStor) with records in the NHM portal. Being able to dip this would enable all sorts of cool things, such as being able to track what researchers have said about particular specimens, as well as develop citation metrics for the collection.
Is DNA barcoding dead?
On a recent trip to the Natural History Museum, London, the subject of DNA barcoding came up, and I got the clear impression that people at the NHM thought classical DNA barcoding was pretty much irrelevant, given recent developments in sequencing technology. For example, why sequence just COI when you can use shotgun sequencing to get the whole mitogenome? I was a little taken aback, although this is a view that's getting some traction, e.g. [1,2]. There is also the more radical view that focussing on phylogenetics is itself less useful than, say, "evolutionary gene networks" based on massive sequencing of multiple markers [3].
At the risk of seeming old-fashioned in liking DNA barcoding, I think there's a bigger issue at stake (see also [4]). DNA barcoding isn't simply a case of using a single, short marker to identify animal species. It's the fact that it's a globalised, standardised approach that makes it so powerful. In the wonderful book "A Vast Machine" [5], Paul Edwards talks about "global data" and "making data global". The idea is that not only do we want data that is global in coverage ("global data"), but we want data that can be integrated ("making data global"). In other words, not only do we want data from everywhere in the world, say, we also need an agreed coordinate system (e.g., latitude and longitude) in order to put each data item in a global context. DNA barcoding makes data global by standardising what a barcode is (a given fragment of COI), and what metadata needs to be associated with a sequence to be a barcode (e.g., latitude and longitude) (see, e.g. Guest post: response to "Putting GenBank Data on the Map"). By insisting on this standardisation, we potentially sacrifice the kinds of cool things that can be done with metagenomics, but the tradeoff is that we can do things like put a million barcodes on a map:
To regard barcoding as dead or outdated we'd need an equivalent effort to make metagenomic sequences of animals global in the same way that DNA barcoding is. Now, it may well be that the economics of sequencing is such that it is just as cheap to shotgun sequence mitogenomes, say, as to extract single markers such as COI. If that's the case, and we can get a standardised suite of markers across all taxa, and we can do this across museum collections (like Hebert et al.'s [6] DNA barcoding "blitz" of 41,650 specimens in a butterfly collection), then I'm all for it. But it's not clear to me that this is the case.
This also leaves aside the issue of standardising other things's much as the metadata. For instance, Dowton et al. [2] state that "recent developments make a barcoding approach that utilizes a single locus outdated" (see Collins and Cruickshank [4] for a response). Dowton et al. make use of data they published earlier [7,8]. Out of curiosity I looked at some of these sequences in GenBank, such as JN964715. This is a COI sequence, in other words, a classical DNA barcode. Unfortunately, it lacks a latitude and longitude. By leaving off latitude and longitude (despite the authors having this information, as it is in the supplemental material for [7]) the authors have missed an opportunity to make their data global.
For me the take home message here is that whether you think DNA barcoding is outdated depends in part what your goal is. Clearly barcoding as a sequencing technology has been superseded by more recent developments. But to dismiss it on those grounds is to miss the bigger picture of what is a stake, namely the chance to have comparable data for millions of samples across the globe.
At the risk of seeming old-fashioned in liking DNA barcoding, I think there's a bigger issue at stake (see also [4]). DNA barcoding isn't simply a case of using a single, short marker to identify animal species. It's the fact that it's a globalised, standardised approach that makes it so powerful. In the wonderful book "A Vast Machine" [5], Paul Edwards talks about "global data" and "making data global". The idea is that not only do we want data that is global in coverage ("global data"), but we want data that can be integrated ("making data global"). In other words, not only do we want data from everywhere in the world, say, we also need an agreed coordinate system (e.g., latitude and longitude) in order to put each data item in a global context. DNA barcoding makes data global by standardising what a barcode is (a given fragment of COI), and what metadata needs to be associated with a sequence to be a barcode (e.g., latitude and longitude) (see, e.g. Guest post: response to "Putting GenBank Data on the Map"). By insisting on this standardisation, we potentially sacrifice the kinds of cool things that can be done with metagenomics, but the tradeoff is that we can do things like put a million barcodes on a map:
To regard barcoding as dead or outdated we'd need an equivalent effort to make metagenomic sequences of animals global in the same way that DNA barcoding is. Now, it may well be that the economics of sequencing is such that it is just as cheap to shotgun sequence mitogenomes, say, as to extract single markers such as COI. If that's the case, and we can get a standardised suite of markers across all taxa, and we can do this across museum collections (like Hebert et al.'s [6] DNA barcoding "blitz" of 41,650 specimens in a butterfly collection), then I'm all for it. But it's not clear to me that this is the case.
This also leaves aside the issue of standardising other things's much as the metadata. For instance, Dowton et al. [2] state that "recent developments make a barcoding approach that utilizes a single locus outdated" (see Collins and Cruickshank [4] for a response). Dowton et al. make use of data they published earlier [7,8]. Out of curiosity I looked at some of these sequences in GenBank, such as JN964715. This is a COI sequence, in other words, a classical DNA barcode. Unfortunately, it lacks a latitude and longitude. By leaving off latitude and longitude (despite the authors having this information, as it is in the supplemental material for [7]) the authors have missed an opportunity to make their data global.
For me the take home message here is that whether you think DNA barcoding is outdated depends in part what your goal is. Clearly barcoding as a sequencing technology has been superseded by more recent developments. But to dismiss it on those grounds is to miss the bigger picture of what is a stake, namely the chance to have comparable data for millions of samples across the globe.
References
- TAYLOR, H. R., & HARRIS, W. E. (2012, February 22). An emergent science on the brink of irrelevance: a review of the past 8 years of DNA barcoding. Molecular Ecology Resources. Wiley-Blackwell. doi:10.1111/j.1755-0998.2012.03119.x
- Dowton, M., Meiklejohn, K., Cameron, S. L., & Wallman, J. (2014, March 28). A Preliminary Framework for DNA Barcoding, Incorporating the Multispecies Coalescent. Systematic Biology. Oxford University Press (OUP). doi:10.1093/sysbio/syu028
- Bittner, L., Halary, S., Payri, C., Cruaud, C., de Reviers, B., Lopez, P., & Bapteste, E. (2010). Some considerations for analyzing biodiversity using integrative metagenomics and gene networks. Biol Direct. Springer Science + Business Media. doi:10.1186/1745-6150-5-47
- Collins, R. A., & Cruickshank, R. H. (2014, August 12). Known Knowns, Known Unknowns, Unknown Unknowns and Unknown Knowns in DNA Barcoding: A Comment on Dowton et al. Systematic Biology. Oxford University Press (OUP). doi:10.1093/sysbio/syu060
- Edwards, Paul N. A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming. MIT Press ISBN: 9780262013925
- Hebert, P. D. N., deWaard, J. R., Zakharov, E. V., Prosser, S. W. J., Sones, J. E., McKeown, J. T. A., Mantle, B., et al. (2013, July 10). A DNA “Barcode Blitz”: Rapid Digitization and Sequencing of a Natural History Collection. (S.-O. Kolokotronis, Ed.)PLoS ONE. Public Library of Science (PLoS). doi:10.1371/journal.pone.0068535
- Meiklejohn, K. A., Wallman, J. F., Pape, T., Cameron, S. L., & Dowton, M. (2013, October). Utility of COI, CAD and morphological data for resolving relationships within the genus Sarcophaga (sensu lato) (Diptera: Sarcophagidae): A preliminary study. Molecular Phylogenetics and Evolution. Elsevier BV. doi:10.1016/j.ympev.2013.04.034
- Meiklejohn, K. A., Wallman, J. F., Cameron, S. L., & Dowton, M. (2012). Comprehensive evaluation of DNA barcoding for the molecular species identification of forensically important Australian Sarcophagidae (Diptera). Invertebrate Systematics. CSIRO Publishing. doi:10.1071/is12008
Tuesday, December 09, 2014
Guest post: Top 10 species names and what they mean
The following is a guest post by Bob Mesibov.
The i4Life project has very kindly liberated Catalogue of Life (CoL) data from its database, and you can now download the latest CoL as a set of plain text, tab-separated tables here.
One of the first things I did with my download was check the 'taxa.txt' table for species name popularity*. Here they are, the top 10 species names for animals and plants, with their frequencies in the CoL list and their usual meanings:
2732 gracilis = slender
2373 elegans = elegant
2231 bicolor = two-coloured
2066 similis = similar
1995 affinis = near
1937 australis = southern
1740 minor = lesser
1718 orientalis = eastern
1708 simplex = simple
1350 unicolor = one-coloured
1871 gracilis = slender
1545 angustifolia = narrow-leaved
1475 pubescens = hairy
1336 parviflora = few-flowered
1330 elegans = elegant
1324 grandiflora = large-flowered
1277 latifolia = broad-leaved
1155 montana = (of a) mountain
1124 longifolia = long-leaved
1102 acuminata = pointed
Take the numbers cum grano salis. The first thing I did with the CoL tables was check for duplicates, and they're there, unfortunately. It's interesting, though, that gracilis tops the taxonomists' poll for both the animal and plant kingdoms.
*With the GNU/Linux commands
awk -F"\t" '($11 == "Animalia") && ($8 == "species") {print $20}' taxa.txt | sort | uniq -c | sort -nr | head
awk -F"\t" '($11 == "Plantae") && ($8 == "species") {print $20}' taxa.txt | sort | uniq -c | sort -nr | head
The i4Life project has very kindly liberated Catalogue of Life (CoL) data from its database, and you can now download the latest CoL as a set of plain text, tab-separated tables here.
One of the first things I did with my download was check the 'taxa.txt' table for species name popularity*. Here they are, the top 10 species names for animals and plants, with their frequencies in the CoL list and their usual meanings:
Animals
2732 gracilis = slender
2373 elegans = elegant
2231 bicolor = two-coloured
2066 similis = similar
1995 affinis = near
1937 australis = southern
1740 minor = lesser
1718 orientalis = eastern
1708 simplex = simple
1350 unicolor = one-coloured
Plants
1871 gracilis = slender
1545 angustifolia = narrow-leaved
1475 pubescens = hairy
1336 parviflora = few-flowered
1330 elegans = elegant
1324 grandiflora = large-flowered
1277 latifolia = broad-leaved
1155 montana = (of a) mountain
1124 longifolia = long-leaved
1102 acuminata = pointed
Take the numbers cum grano salis. The first thing I did with the CoL tables was check for duplicates, and they're there, unfortunately. It's interesting, though, that gracilis tops the taxonomists' poll for both the animal and plant kingdoms.
*With the GNU/Linux commands
awk -F"\t" '($11 == "Animalia") && ($8 == "species") {print $20}' taxa.txt | sort | uniq -c | sort -nr | head
awk -F"\t" '($11 == "Plantae") && ($8 == "species") {print $20}' taxa.txt | sort | uniq -c | sort -nr | head
Tuesday, December 02, 2014
GBIF Ebbe Nielsen Challenge
The GBIF Ebbe Nielsen Challenge is open! From the official announcement
This is the first time we've run the challenge, so the topic is wide open. Below I've put together some ideas that are simply designed to get you thinking (and are in no way intended to limit the sort of things that could be entered).
This merely scratches the surface of what could be done, and indeed one of the reasons for having the challenge is to start a conversation about what can be done with half a billion data records.
The GBIF Secretariat has launched the inaugural GBIF Ebbe Nielsen Challenge, hoping to inspire innovative applications of open-access biodiversity data by scientists, informaticians, data modelers, cartographers and other experts.First prize is €20,000, full details on prizes and entry requirements are on the Challenge web site. To judge the entries GBIF has assembled a panel of judges comprising people both inside and outside GBIF and its advisory committees:
Lucas Joppa
Scientist, Computational Ecology and Environmental Sciences Group / Microsoft Research
Scientist, Computational Ecology and Environmental Sciences Group / Microsoft Research
Mary Klein
President & CEO / NatureServe
President & CEO / NatureServe
Tanya Abrahamse
CEO / SANBI: South African National Biodiversity Institute
CEO / SANBI: South African National Biodiversity Institute
Arturo H. AriƱo
Professor of Ecology / University of Navarra
Professor of Ecology / University of Navarra
Roderic Page (that's me)
Professor of Taxonomy / University of Glasgow
Professor of Taxonomy / University of Glasgow
This is the first time we've run the challenge, so the topic is wide open. Below I've put together some ideas that are simply designed to get you thinking (and are in no way intended to limit the sort of things that could be entered).
Evolutionary trees
Increasingly DNA sequences from DNA barcoding and metabarcoding are being used to study biodiversity. How can we integrate that data into GBIF? Can we decorate GBIF maps with evolutionary trees?
Increasingly DNA sequences from DNA barcoding and metabarcoding are being used to study biodiversity. How can we integrate that data into GBIF? Can we decorate GBIF maps with evolutionary trees?
Change over timeGlobal Forest Watch is an impressive example of how change in the biosphere can be monitored over time. Can we do something similar with GBIF data? Alternatively, if the level of temporal or spatial resolution in GBIF data isn't high enough, can we combine these sources in some way?
Dashboard
GBIF has started to provide
graphical summaries of its data, and there is lots to be done in this area. Can we have a Google Analytics-style summary of GBIF data?
GBIF has started to provide
graphical summaries of its data, and there is lots to be done in this area. Can we have a Google Analytics-style summary of GBIF data?
This merely scratches the surface of what could be done, and indeed one of the reasons for having the challenge is to start a conversation about what can be done with half a billion data records.
Sunday, November 23, 2014
Automatically extracting possible taxonomic synonyms from the literature
Quick notes on an experimental feature I've added to BioNames. It attempts to identify possible taxonomic synonyms by extracting pairs of names with the same species name that appear together on the same page of text. The text could be full text for an open access article, OCR text from BHL, or the title and abstract for an article. For example, the following paper creates a new combination, Hadwenius tursionis, for a parasite of the bottlenose dolphin. This name is a synonym of Synthesium tursionis.
If we extract taxonomic names from the title and abstract we have the pair (Synthesium tursionis, Hadwenius tursionis). If we do this across all the text currently in BioNames then we discover other pairs of names that include Synthesium tursionis, joining these together we can create a graph of co-occurrence of names that are synonyms (see Synthesium tursionis).
These graphs are computed automatically, and there is inevitably scope for error. Taxa that are not synonyms may have the same specific name (e.g., parasites and hosts may have the same specific name), and some of the names extracted from the text may be erroneous. At the same time, anecdotally it is a useful way to discover links between names. Even better, this approach means that we have the associated evidence for each pair of names. The interface in BioNames lists the references that contain the pairs of names, so you can evaluate the evidence for synonymy. It would be useful to try and evaluate the automatically detected synonyms by comparisons with existing lists of synonyms (e.g., from GBIF).
FernƔndez, M., Balbuena, J. A., & Raga, J. A. (1994, July). Hadwenius tursionis (Marchi, 1873) n. comb. (Digenea, Campulidae) from the bottlenose dolphin Tursiops truncatus (Montagu, 1821) in the western Mediterranean. Syst Parasitol. Springer Science + Business Media. doi:10.1007/bf00009519
The taxonomic position of Synthesium tursionis (Marchi, 1873) (Digenea, Campulidae) is revised, based on material from 147 worms from four bottlenose dolphins Tursiops truncatus stranded off the Comunidad Valenciana (Spanish western Mediterranean). The species is transferred to Hadwenius, as H. tursionis n. comb., and characterised by a high length/width ratio of the body, spinose cirrus and unarmed metraterm. Synthesium, a monotypic genus, becomes a synonym of Hadwenius. The intraspecific variation of some morphological traits is briefly discussed.
If we extract taxonomic names from the title and abstract we have the pair (Synthesium tursionis, Hadwenius tursionis). If we do this across all the text currently in BioNames then we discover other pairs of names that include Synthesium tursionis, joining these together we can create a graph of co-occurrence of names that are synonyms (see Synthesium tursionis).
These graphs are computed automatically, and there is inevitably scope for error. Taxa that are not synonyms may have the same specific name (e.g., parasites and hosts may have the same specific name), and some of the names extracted from the text may be erroneous. At the same time, anecdotally it is a useful way to discover links between names. Even better, this approach means that we have the associated evidence for each pair of names. The interface in BioNames lists the references that contain the pairs of names, so you can evaluate the evidence for synonymy. It would be useful to try and evaluate the automatically detected synonyms by comparisons with existing lists of synonyms (e.g., from GBIF).
Tuesday, October 21, 2014
On identifiers (again)
I'm going to the TDWG Identifier Workshop this weekend, so I thought I'd jot down a few notes. The biodiversity informatics community has been at this for a while, and we still haven't got identifiers sorted out.
From my perspective as both a data aggregator (e.g., BioNames) and a data provider (e.g., BioStor) there are four things I think we need to tackle in order to make significant progress.
A basic challenge is to go from strings, such as bibliographic citations, specimen codes, taxonomic names, etc., to digital identifiers for those things. Most of our data is not born digital, and so we spend a lot of time mapping strings to identifiers. For example, publishers do this a lot when they take the list of literature cited at the end of a manuscript and add DOIs. Hence, one of the first things CrossRef did was provide a discovery service for publishers. This has now morphed into a very slick search tool http://search.crossref.org. Without discoverabilty, nobody is going to find the identifiers in the first place.
Given an identifier it has to be resolvable (for both people and machines), and I'd argue that at least in the early days of getting that identifier accepted, there needs to be a single point of resolution. Some people are arguing that we should separate identifiers from their resolution, partly based on arguments that "hey, we can always Google the identifier". This argument strikes me as wrong-headed for a several of reasons.
Firstly, Google is not a resolution service. There's no API, so it's not scalable. Secondly, if you Google an identifier (e.g., 10.7717/peerj.190) you get a bunch of hits, which one is the definitive source of information on the thing with that identifier? It's not at all obvious, and indeed this is one of the reasons publishers adopted DOIs in the first place. If you Google a paper you can get all sorts of hits and all sorts of versions (preprint, manuscripts, PDFs on multiple servers, etc.). In contrast the DOI gives you a way to access the definitive version.
Another way of thinking about this is in terms of trust. At some point down the road we might have tools that can assess the trust worthiness of a source, and we will need these if we develop decent tools to annotate data (see More on annotating biodiversity data: beyond sticky notes and wikis). But until then the simplest way to engender trust is to have a single point of resolution (like http://dx.doi.org for DOIs). Think about how people now trust DOIs. They've become a mark of respectability for journals (no DOIs, you're not a serious journal), and new ideas such as citing diagrams and data gained further credence once sites like figshare started using DOIs.
Another reason resolvability matters is that I think it's a litmus test of how serious we are. One reason LSIDs failed is that we made them too hard to resolve, and as a consequence people simply minted "fake" LSIDs, dumb strings that didn't resolve. Nobody complained (because, let's face it, nobody was using them), so LSIDs became devalued to the point of uselessness. Anybody can mint a string and call it an identifier, if it costs nothing that's a good estimate of its actual value.
Resolvability leads to persistence. Sometimes we hear the cliche that "persistence is a social matter, not a technological one". This is a vacuous platitude. The kind of technology adopted can have a big impact on the sociology.
The easiest form of identifier is a simple HTTP URL. But let's think about what happens when we use them. If I spend a lot of time mapping my data to somebody else's URLs (e.g., links to papers or specimens) I am taking a big risk in assuming that the provider of those URLs will keep those "live". At the same time, in linking to those URLs, I constrain the provider - if they decide that their URL scheme isn't particularly good and want to change it (or their institution decides to move to new servers or a new domain), they will break resources like mine that link to them. So a decision they made about their URL structure - perhaps late one Friday afternoon in one of those meetings where everybody just wants to go to the pub - will come back to haunt them.
One way to tackle this is indirection, which is the idea behind DOIs and PURLs, for example. Instead of directly linking to a provider URL, we link to an intermediate identifier. This means that I have some confidence that all my hard work won't be undone (I have seen whole journals disappear because somebody redesigned an institutional web site), and the provider can mess with different technologies for serving their content, secure in the knowledge that external parties won't be affected (because they link to the intermediate identifier). Programmers will recognise this as encapsulation.
Some have argued that we can achieve persistence by simply insisting on it. For example, we fire off a memo to the IT folks saying "don't break these links!". Really? We have that degree of power over our institutional IT policies? This also misses the great opportunity that centralised indirection provides us with. In the case of DOIs for publications, CrossRef sits in the middle, managing the DOIs (in the sense that if a DOI breaks you have a single place to go and complain). Because they also aggregate all the bibliographic metadata, they are automatically able to support discoverability (they can easily map bibliographic metadata to DOIs). So by solving persistence we also solve discoverability.
Lastly, if we are serious about this we need to think about how to engineer the widespread adoption of the identifier. In other words, I think we need network effects. When you join a social networking site, one of the first things they do is ask permission to see your "contacts" (who you already know). If any of those people are already on the network, you can instantly see that ("hey, Jane is here, and so is Bob"). Likewise, the network can target those you know who aren't on the network and prompt them to join.
If we are going to promote the use of identifiers, then it's no use thinking about simply adding identifiers to things, we need to think about ways to grow the network, ideally by adding networks at a time (like a person's list of contacts), not single records. CrossRef does this with articles: when publishers submit an article to CrossRef, they are encouraged to submit not just that article and it's DOI, but the list of all references in the list of literature cited, identified where possible by DOIs. This means CrossRef is building a citation graph, so it can quickly demonstrate value to its members (through cited-by linking).
So, we need to think of ways of demonstrating value, and growing the network of identifiers more rapidling than one identifier at a time. Otherwise, it is hard to see how it would gain critical mass. In the context of, say, specimens, I think an obvious way to do this is have services that tell a natural history collection how many times its specimens have been cited in the primary literature, or have been used as vouchers for DNA seqences. We can then generate metrics of use (as well as start to trace the provenance of our data).
I've no idea what will come out of the TDWG Workshop, but my own view is that unless we tackle these issues, and have a clear sense of how they interrelate, then we won't make much progress. These things are intertwined, and locally optimal solutions ("hey, it's easy, I'll just slap a URL on everything") aren't enough ("OK, how exactly do I find your URL? What happens when it breaks?"). If we want to link stuff together as part of the infrastructure of biodiversity informatics, then we need to think strategically. The goal is not to solve the identifier problem, the goal is to build the biodiversity knowledge graph.
From my perspective as both a data aggregator (e.g., BioNames) and a data provider (e.g., BioStor) there are four things I think we need to tackle in order to make significant progress.
Discoverability (strings to things)
A basic challenge is to go from strings, such as bibliographic citations, specimen codes, taxonomic names, etc., to digital identifiers for those things. Most of our data is not born digital, and so we spend a lot of time mapping strings to identifiers. For example, publishers do this a lot when they take the list of literature cited at the end of a manuscript and add DOIs. Hence, one of the first things CrossRef did was provide a discovery service for publishers. This has now morphed into a very slick search tool http://search.crossref.org. Without discoverabilty, nobody is going to find the identifiers in the first place.
Resolvability
Given an identifier it has to be resolvable (for both people and machines), and I'd argue that at least in the early days of getting that identifier accepted, there needs to be a single point of resolution. Some people are arguing that we should separate identifiers from their resolution, partly based on arguments that "hey, we can always Google the identifier". This argument strikes me as wrong-headed for a several of reasons.
Firstly, Google is not a resolution service. There's no API, so it's not scalable. Secondly, if you Google an identifier (e.g., 10.7717/peerj.190) you get a bunch of hits, which one is the definitive source of information on the thing with that identifier? It's not at all obvious, and indeed this is one of the reasons publishers adopted DOIs in the first place. If you Google a paper you can get all sorts of hits and all sorts of versions (preprint, manuscripts, PDFs on multiple servers, etc.). In contrast the DOI gives you a way to access the definitive version.
Another way of thinking about this is in terms of trust. At some point down the road we might have tools that can assess the trust worthiness of a source, and we will need these if we develop decent tools to annotate data (see More on annotating biodiversity data: beyond sticky notes and wikis). But until then the simplest way to engender trust is to have a single point of resolution (like http://dx.doi.org for DOIs). Think about how people now trust DOIs. They've become a mark of respectability for journals (no DOIs, you're not a serious journal), and new ideas such as citing diagrams and data gained further credence once sites like figshare started using DOIs.
Another reason resolvability matters is that I think it's a litmus test of how serious we are. One reason LSIDs failed is that we made them too hard to resolve, and as a consequence people simply minted "fake" LSIDs, dumb strings that didn't resolve. Nobody complained (because, let's face it, nobody was using them), so LSIDs became devalued to the point of uselessness. Anybody can mint a string and call it an identifier, if it costs nothing that's a good estimate of its actual value.
Persistence
Resolvability leads to persistence. Sometimes we hear the cliche that "persistence is a social matter, not a technological one". This is a vacuous platitude. The kind of technology adopted can have a big impact on the sociology.
The easiest form of identifier is a simple HTTP URL. But let's think about what happens when we use them. If I spend a lot of time mapping my data to somebody else's URLs (e.g., links to papers or specimens) I am taking a big risk in assuming that the provider of those URLs will keep those "live". At the same time, in linking to those URLs, I constrain the provider - if they decide that their URL scheme isn't particularly good and want to change it (or their institution decides to move to new servers or a new domain), they will break resources like mine that link to them. So a decision they made about their URL structure - perhaps late one Friday afternoon in one of those meetings where everybody just wants to go to the pub - will come back to haunt them.
One way to tackle this is indirection, which is the idea behind DOIs and PURLs, for example. Instead of directly linking to a provider URL, we link to an intermediate identifier. This means that I have some confidence that all my hard work won't be undone (I have seen whole journals disappear because somebody redesigned an institutional web site), and the provider can mess with different technologies for serving their content, secure in the knowledge that external parties won't be affected (because they link to the intermediate identifier). Programmers will recognise this as encapsulation.
Some have argued that we can achieve persistence by simply insisting on it. For example, we fire off a memo to the IT folks saying "don't break these links!". Really? We have that degree of power over our institutional IT policies? This also misses the great opportunity that centralised indirection provides us with. In the case of DOIs for publications, CrossRef sits in the middle, managing the DOIs (in the sense that if a DOI breaks you have a single place to go and complain). Because they also aggregate all the bibliographic metadata, they are automatically able to support discoverability (they can easily map bibliographic metadata to DOIs). So by solving persistence we also solve discoverability.
Network effects
Lastly, if we are serious about this we need to think about how to engineer the widespread adoption of the identifier. In other words, I think we need network effects. When you join a social networking site, one of the first things they do is ask permission to see your "contacts" (who you already know). If any of those people are already on the network, you can instantly see that ("hey, Jane is here, and so is Bob"). Likewise, the network can target those you know who aren't on the network and prompt them to join.
If we are going to promote the use of identifiers, then it's no use thinking about simply adding identifiers to things, we need to think about ways to grow the network, ideally by adding networks at a time (like a person's list of contacts), not single records. CrossRef does this with articles: when publishers submit an article to CrossRef, they are encouraged to submit not just that article and it's DOI, but the list of all references in the list of literature cited, identified where possible by DOIs. This means CrossRef is building a citation graph, so it can quickly demonstrate value to its members (through cited-by linking).
So, we need to think of ways of demonstrating value, and growing the network of identifiers more rapidling than one identifier at a time. Otherwise, it is hard to see how it would gain critical mass. In the context of, say, specimens, I think an obvious way to do this is have services that tell a natural history collection how many times its specimens have been cited in the primary literature, or have been used as vouchers for DNA seqences. We can then generate metrics of use (as well as start to trace the provenance of our data).
Summary
I've no idea what will come out of the TDWG Workshop, but my own view is that unless we tackle these issues, and have a clear sense of how they interrelate, then we won't make much progress. These things are intertwined, and locally optimal solutions ("hey, it's easy, I'll just slap a URL on everything") aren't enough ("OK, how exactly do I find your URL? What happens when it breaks?"). If we want to link stuff together as part of the infrastructure of biodiversity informatics, then we need to think strategically. The goal is not to solve the identifier problem, the goal is to build the biodiversity knowledge graph.
Thursday, October 02, 2014
BioStor and JournalMap: a geographic interface to articles from the Biodiversity Heritage Library
Great to see @JournalMap jump from ~11000 to ~17000 articles: http://t.co/bVarqDGGVU
— Ken Mankoff (@mankoff) September 27, 2014
The recent jump from ~11000 to ~17000 articles in JournalMap is mostly due to JournalMap ingesting content from my BioStor database. BioStor extracts articles from the Biodiversity Heritage Library (BHL), and in turn these get fed back into BHL as "parts" (you can see these in the "Table of Contents" tab when viewing a scanned volume in BHL).
In addition to extracting articles, BioStor pulls out latitude and longitude pairs mentioned in the OCR text and creates little Google Maps for articles that have geotagged content. Working with Jason Karl (@jwkarl), JournalMap now talks to BioStor and grabs all its geotagged articles so that you can browse them in JournalMap. As a consequence, journals such as Proceedings of The Biological Society of Washington now appear on their map (this journal is third most geotagged journal in JournalMap).
As an example of what you can do in JournalMap, here's a screenshot showing localities in Tanzania, and an article from BioStor being displayed:
JournalMap is an elegant interface to the biodiversity literature, and adding BioStor as a source is a nice example of how the Biodiversity Heritage Library's content is becoming more widely used. BioStor would only be possible if BHL made its content and metadata available for easy downloading. This is a lesson I wish other projects would learn. Instead of focussing on building flash-looking portals, make sure (a) you have lots of content, and (b) make it easy for developers to get that content so they can do cool things with it. BHL does well in this regard — other projects, such as BHL-Europe, not so much.
Tuesday, September 23, 2014
Exploring the chameleon dataset: broken GBIF links and lack of georeferencing
Following on from the discussion of the African chameleon data, I've started to explore Angelique Hjarding's data in more detail. The data is available from figshare (doi:10.6084/m9.figshare.1141858), so I've grabbed a copy and put it in github. Several things are immediately apparent.
The last point is worrying, and reflects the fact that at present you can't trust GBIF occurrence URLs to be stable over time. Most of the specimens in Angelique's data are probably still in GBIF, but the GBIF occurrenceID (and hence URL) will have changed. This pretty much kills any notion of reproducibility, and it will require some fussing to be able to find the new URLs for these records.
That the GBIF occurrenceIDs are no longer valid also makes it very difficult to make use of any data cleaning I or anyone else attempts with this data. If I georeference some of the specimens, I can't simply tell GBIF that I've got improved data. Nor is it obvious how I would give this information to the original providers using, say VertNet's github repositories. All in all a mess, and a sad reflection on our inability to have persistent identifiers for occurrences.
To help explore the data I've created some GeoJSON files to get a sense of the distribution of the data. Here are the point localities, a few have clearly got issues.
I also drew some polygons around points for the same taxon, to get a sense of their distributions.
Taxa represent by less than three distinct localities are presented by place marker, the rest by polygons.
I'll keep playing with this data as time allows, and try to get a sense of how hard it would be to go from what GBIF provides to what is actually going to be useful.
- There is a lot of ungeoreferenced data. With a little work this could be geotagged and hence placed on a map.
- There are some errors with the georeferenced data (chameleons in Soutb America or off the coast, a locality in Tanzania that is now in Ethiopia, etc.).
- Rather alarmingly, most of the URLs to GBIF records that Angelique gives in the dataset no longer resolve.
The last point is worrying, and reflects the fact that at present you can't trust GBIF occurrence URLs to be stable over time. Most of the specimens in Angelique's data are probably still in GBIF, but the GBIF occurrenceID (and hence URL) will have changed. This pretty much kills any notion of reproducibility, and it will require some fussing to be able to find the new URLs for these records.
That the GBIF occurrenceIDs are no longer valid also makes it very difficult to make use of any data cleaning I or anyone else attempts with this data. If I georeference some of the specimens, I can't simply tell GBIF that I've got improved data. Nor is it obvious how I would give this information to the original providers using, say VertNet's github repositories. All in all a mess, and a sad reflection on our inability to have persistent identifiers for occurrences.
To help explore the data I've created some GeoJSON files to get a sense of the distribution of the data. Here are the point localities, a few have clearly got issues.
I also drew some polygons around points for the same taxon, to get a sense of their distributions.
Taxa represent by less than three distinct localities are presented by place marker, the rest by polygons.
I'll keep playing with this data as time allows, and try to get a sense of how hard it would be to go from what GBIF provides to what is actually going to be useful.
Monday, September 22, 2014
GBIF Science Committee Report slides #gb21
Just back from GB21, the GBIF Governing Board meeting (the first such meeting I've attended). It was in New Delhi, and this was also my first time in india, which is an amazing place. At some point I may blog about the experience: the heat, the sheer number of people, the juxtaposition of wealth and poverty, the traffic (chaotic in a wonderfully self-organising sort of way), seeing birds of prey wheel overhead around a hotel in a major city, followed by fruit bats skimming the trees in the evening, the joys of haggling with tuk-tuk drivers, and the wonder that is the Taj Mahal.
Lots to also think about regarding the meeting. A somewhat unsatisfactory conversation about licensing started on Twitter, so at some point I need to revisit this.
But for now, here are the slides from my summary of the GBIF Science Committee's activities. It discusses the forthcoming Ebbe Nielsen Challenge (details still be worked on so the slides are not the final word), the challenges of adding sequence data to GBIF, and the much-discussed case of the chamaeleons.
Lots to also think about regarding the meeting. A somewhat unsatisfactory conversation about licensing started on Twitter, so at some point I need to revisit this.
#gb21 Fly in ointment of adopting/enforcing @creativecommons licenses for @GBIF is that most taxonomic databases would be excluded :O
— Roderic Page (@rdmpage) September 18, 2014
But for now, here are the slides from my summary of the GBIF Science Committee's activities. It discusses the forthcoming Ebbe Nielsen Challenge (details still be worked on so the slides are not the final word), the challenges of adding sequence data to GBIF, and the much-discussed case of the chamaeleons.
Thursday, August 28, 2014
BioNames database can be downloaded
My BioNames project has been going for over a year now, but I hadn't gotten around to providing bulk access to the data I've been collecting and cleaning. I've gone some way towards fixing this. You can now grab a snapshot of the BioNames database as a Darwin Core Archive here. This snapshot was generated on the 22nd August, so it is already a little out of date (BioNames is edited almost daily as I clean and annotate it when I should be doing other things).
The data dump doesn't capture all the information in the BioNames as I've tried to keep it simple, and Darwin Core is a bit of a pain to deal with. The actual database is in CouchDB which is (mostly) an absolute joy to work with. I replicate the database to Cloudant, which means there's a copy "in the cloud". A number of my other CouchDB projects are also in Cloudant, in the case of Australian Faunal Directory and BOL DNA Barcode Map the data is also served directly from Cloudant.
The data dump doesn't capture all the information in the BioNames as I've tried to keep it simple, and Darwin Core is a bit of a pain to deal with. The actual database is in CouchDB which is (mostly) an absolute joy to work with. I replicate the database to Cloudant, which means there's a copy "in the cloud". A number of my other CouchDB projects are also in Cloudant, in the case of Australian Faunal Directory and BOL DNA Barcode Map the data is also served directly from Cloudant.
Monday, August 25, 2014
Geotagging stats for BioStor
Note to self for upcoming discussion with JournalMap.
As of Monday August 25th, BioStor has 106,617 articles comprising 1,484,050 BHL pages. From the full text for these articles, I have extracted 45,452 distinct localities (i.e., geotagged with latitude and longitude). 15,860 BHL pages in BioStor pages have at least one geotag, these pages belong to 5,675 BioStor articles.
In summary, BioStor has 5,675 full-text articles that are geotagged. The largest number of geotags for an article is 2,421, for DistribuciĆ³n geogrĆ”fica de la fauna de anfibios del Uruguay (doi:10.5479/si.23317515.134.1).
The SQL for the queries is here.
As of Monday August 25th, BioStor has 106,617 articles comprising 1,484,050 BHL pages. From the full text for these articles, I have extracted 45,452 distinct localities (i.e., geotagged with latitude and longitude). 15,860 BHL pages in BioStor pages have at least one geotag, these pages belong to 5,675 BioStor articles.
In summary, BioStor has 5,675 full-text articles that are geotagged. The largest number of geotags for an article is 2,421, for DistribuciĆ³n geogrĆ”fica de la fauna de anfibios del Uruguay (doi:10.5479/si.23317515.134.1).
The SQL for the queries is here.
Tuesday, August 19, 2014
Guest post: Response to the discussion on Red List assessments of East African chameleons
This is guest post by Angelique Hjarding in response to discussion on this blog about the paper below.
One of the most important issues that has been raised is the sharing of cleaned and vetted datasets. It has been suggested that the datasets used in our study be uploaded to a repository that can be cited and shared. This is possible for data that was downloaded from GBIF as they have already done the legwork to obtain data sharing agreements with the contributing organizations. So as long as credit is properly given to the source of the data, publicly sharing data accessed through GBIF should be acceptable. At the time the manuscript was submitted for publication, we were unaware of sites such as http://figshare.com where the data could be stored and shared with no additional cost to the contributor. The dataset used in the study that used GBIF data has now been made available in this way.
It starts to get tricky with doing the same for the expert vetted data. This dataset consists primarily of data gather by the expert from museum records and published literature. So in this case it is not a question of why the expert doesn’t share. The question is why the museum data and any additional literature records are not on GBIF already. As has been pointed out in our analysis (and confirmed by Rod) most of these museums do not currently have data sharing agreements with GBIF. Therefore, the expert who compiled the data does not have the permission of the museums to share their data second hand. Bottom line, all of the data used in this study that was not accessed through GBIF is currently available from the sources directly. That is, for anyone who wants to take the time contact the museums for permission to use their data for research and to compile it. We also do not believe there is blame on museums that have not yet shared their data with forums such as GBIF. Mobilisation of data is an enormous task, and near impossible if funds and staff are not available. With regards to the particular comment regarding the lack of data sharing by NHML and other museums, we need to recognise what the task at hand would mean, and rather address ways such a monumental, and valuable, collection could be mobilised. A further issue should be raised around literature records that are not necessarily encapsulated in museum collections, but are buried in old and obscure manuscripts. To our knowledge, there is no way to mobilise those records either, because they are not attached to a specimen. Further, because there are no specimens, extreme care must be taken if such records were to be mobilised in order to ensure quality control. Again, assistance of expert knowledge would be highly beneficial, yet these things take time and require funds.
Another issue that was raised is why didn’t we go directly to GBIF to fix the records? The point of our research was not to clean and update GBIF/museum data but to evaluate the effect of expert vetting and museum data mobilization in an applied conservation setting. As it has been pointed out, the lead author was working at GBIF during the course of the research. An effort was made to provide a checklist of the updated taxonomy to GBIF at the time, but there was no GBIF mechanism for providing updates. This appears to still be the case. In addition, two GBIF staff provided comments on the paper and were acknowledged for their input. We are happy to provide an updated taxonomy to help improve the data quality, should some submission tool for updates be made available.
Finally we would like to address the question, why use GBIF data if we know it needs some work before it can be used? We believe this is a very important debate for at least two reasons. First, when data is made public, we believe there are many researchers who work under the assumption that the data is ready for use with minimal further work. We believe they assume that the taxonomy is up to date; that the records are in the right place; and that the records provided relate to the name that is attached to those records. Many of the papers that have used GBIF data have undertaken broad scale macroecological analyses where, perhaps, the errors we have shown matter little. But some of these synthetic studies have also proposed that their results can be used for decision making by companies, which starts to raise concerns especially if the company wants to know the exact species that its activities could impact. As we have shown, for chameleons at least, such advice would be hard to provide using the raw GBIF data.
Second, we are aware that there is another group of researchers using GBIF data who "know that to use GBIF's data you need to do a certain amount of previous work and run some tests, and if the data does not pass the tests, you don't use it." We are not sure of the tests that are run, and it would be useful to have these spelled out for broader debate and potentially the development of some agreed protocols for data cleaning for various uses.
Our underlying reason for writing the paper was not to enter into debate of which data are best between GBIF and an expert compiled dataset. We are extremely pleased that GBIF data exist, and are freely available for the use of all. This certainly has to be part of the future of 'better data for better decisions', but we are concerned that we should not just accept that the data is the best we can get, but should instead look for ways to improve it, for all kinds of purposes. As such, we would like to suggest that the discussion focuses some energy on ways to address the shortcomings of the present system, but also that the community who would benefit from the data address ways to assist the dataholders to mobilise their information in terms of accessing the resources required to digitise and make data available, and maintain updated taxonomy for their holdings. In an era of declining funding for Museum-based taxonomy in many parts of the world this is certainly a challenge that needs to be addressed.
We welcome further discussion as this is a very important topic, not only for conservation but also in terms of improved access to biodiversity knowledge, which is critical for many reasons.
Angelique Hjarding http://orcid.org/0000-0002-9279-4893
Krystal Tolley
Neil Burgess
Hjarding, A., Tolley, K. A., & Burgess, N. D. (2014, July 10). Red List assessments of East African chameleons: a case study of why we need experts. Oryx. Cambridge University Press (CUP). doi:10.1017/s0030605313001427Thank you for highlighting our recent publication and for the very interesting comments. We wanted to take the opportunity to address some of the issues brought up in both your review and from reader comments.
One of the most important issues that has been raised is the sharing of cleaned and vetted datasets. It has been suggested that the datasets used in our study be uploaded to a repository that can be cited and shared. This is possible for data that was downloaded from GBIF as they have already done the legwork to obtain data sharing agreements with the contributing organizations. So as long as credit is properly given to the source of the data, publicly sharing data accessed through GBIF should be acceptable. At the time the manuscript was submitted for publication, we were unaware of sites such as http://figshare.com where the data could be stored and shared with no additional cost to the contributor. The dataset used in the study that used GBIF data has now been made available in this way.
Angelique Hjarding. (2014). Endemic Chameleons of Kenya and Tanzania. Figshare. doi:10.6084/m9.figshare.1141858
It starts to get tricky with doing the same for the expert vetted data. This dataset consists primarily of data gather by the expert from museum records and published literature. So in this case it is not a question of why the expert doesn’t share. The question is why the museum data and any additional literature records are not on GBIF already. As has been pointed out in our analysis (and confirmed by Rod) most of these museums do not currently have data sharing agreements with GBIF. Therefore, the expert who compiled the data does not have the permission of the museums to share their data second hand. Bottom line, all of the data used in this study that was not accessed through GBIF is currently available from the sources directly. That is, for anyone who wants to take the time contact the museums for permission to use their data for research and to compile it. We also do not believe there is blame on museums that have not yet shared their data with forums such as GBIF. Mobilisation of data is an enormous task, and near impossible if funds and staff are not available. With regards to the particular comment regarding the lack of data sharing by NHML and other museums, we need to recognise what the task at hand would mean, and rather address ways such a monumental, and valuable, collection could be mobilised. A further issue should be raised around literature records that are not necessarily encapsulated in museum collections, but are buried in old and obscure manuscripts. To our knowledge, there is no way to mobilise those records either, because they are not attached to a specimen. Further, because there are no specimens, extreme care must be taken if such records were to be mobilised in order to ensure quality control. Again, assistance of expert knowledge would be highly beneficial, yet these things take time and require funds.
Another issue that was raised is why didn’t we go directly to GBIF to fix the records? The point of our research was not to clean and update GBIF/museum data but to evaluate the effect of expert vetting and museum data mobilization in an applied conservation setting. As it has been pointed out, the lead author was working at GBIF during the course of the research. An effort was made to provide a checklist of the updated taxonomy to GBIF at the time, but there was no GBIF mechanism for providing updates. This appears to still be the case. In addition, two GBIF staff provided comments on the paper and were acknowledged for their input. We are happy to provide an updated taxonomy to help improve the data quality, should some submission tool for updates be made available.
Finally we would like to address the question, why use GBIF data if we know it needs some work before it can be used? We believe this is a very important debate for at least two reasons. First, when data is made public, we believe there are many researchers who work under the assumption that the data is ready for use with minimal further work. We believe they assume that the taxonomy is up to date; that the records are in the right place; and that the records provided relate to the name that is attached to those records. Many of the papers that have used GBIF data have undertaken broad scale macroecological analyses where, perhaps, the errors we have shown matter little. But some of these synthetic studies have also proposed that their results can be used for decision making by companies, which starts to raise concerns especially if the company wants to know the exact species that its activities could impact. As we have shown, for chameleons at least, such advice would be hard to provide using the raw GBIF data.
Second, we are aware that there is another group of researchers using GBIF data who "know that to use GBIF's data you need to do a certain amount of previous work and run some tests, and if the data does not pass the tests, you don't use it." We are not sure of the tests that are run, and it would be useful to have these spelled out for broader debate and potentially the development of some agreed protocols for data cleaning for various uses.
Our underlying reason for writing the paper was not to enter into debate of which data are best between GBIF and an expert compiled dataset. We are extremely pleased that GBIF data exist, and are freely available for the use of all. This certainly has to be part of the future of 'better data for better decisions', but we are concerned that we should not just accept that the data is the best we can get, but should instead look for ways to improve it, for all kinds of purposes. As such, we would like to suggest that the discussion focuses some energy on ways to address the shortcomings of the present system, but also that the community who would benefit from the data address ways to assist the dataholders to mobilise their information in terms of accessing the resources required to digitise and make data available, and maintain updated taxonomy for their holdings. In an era of declining funding for Museum-based taxonomy in many parts of the world this is certainly a challenge that needs to be addressed.
We welcome further discussion as this is a very important topic, not only for conservation but also in terms of improved access to biodiversity knowledge, which is critical for many reasons.
Angelique Hjarding http://orcid.org/0000-0002-9279-4893
Krystal Tolley
Neil Burgess
Friday, August 15, 2014
Some design notes on modelling links between specimens and other kinds of data
If we view biodiversity data as part of the "biodiversity knowledge graph" then specimens are a fairly central feature of that graph. I'm looking at ways to link specimens to sequences, taxa, publications, etc., and doing this across multiple data providers. Here are some rough notes on trying to model this in a simple way.
For simplicity let's suppose that we have this basic model:
A specimen comes from a locality (ideally we have the latitude and longitude of that locality), it is assigned to a taxon, we have data derived from that specimen (e.g., one or more DNA sequences), and we have one or more publications about that specimen (e.g., a paper that publishes a taxon name for which the specimen is a type, or a paper that publishes a sequence for which the specimen is a voucher).
In GenBank we have sequences that have accession numbers, and these are linked to taxa (identified by NCBI tax ids). A nice feature of sequence databases is that taxa are explicitly defined by extension, that is, a taxon is the set of sequences assigned to a given taxon. Most (but not all, see Miller et al. doi:10.1186/1756-0500-2-101) sequences are also linked to a publication, which will usually have a PubMed id (PMID), and sometimes a DOI. Many sequences are also georeferenced (see Guest post: response to "Putting GenBank Data on the Map"). Most sequences aren't linked to a voucher specimen, but there is the implict notion of a source (in RDF-speak, many specimens are "blank nodes" Blank nodes for specimens without URI). Some sequences are associated with a specimen that has a museum code, and some are explicitly linked to the specimen by a URL.
Barcodes, as represented in BOLD are similar to sequences in GenBank. We have explicit taxa ("BINs") each of which has a URL, some also having DOIs. Most barcodes are georeferenced. There's some ambiguity about whether the URL for a barcode record identifies the barcode sequence, the specimen, or both. There may be a voucher code for the specimen. Some barcodes are linked to publications, but not (as far as I can see) in the data obtained from the API. Some barcodes are linked to the corresponding record in GenBank (which may or may not be supressed, see Dark taxa even darker: NCBI pulls (some) DNA barcodes from GenBank (updated)).
At it's core GBIF has occurrence records (many of these are specimen-based, but the majority of data in GBIF is actually observation-based), each of which has a unique id, and which is linked to a taxon, also with a unique id. As with the sequence databases, a taxon is a set of occurrences that have been assigned to that taxon. Many records in GBIF are georeferenced. There are limited cross links to other database - some occurrences list associated GenBank sequences. Some GBIF occurrences actually are sequences (e.g., the European Molecular Biology Laboratory Australian Mirror and the soon to be indexed Geographically tagged INSDC sequences), and barcodes are also making their way into GBIF (e.g., Zoologische Staatssammlung Muenchen - International Barcode of Life (iBOL) - Barcode of Life Project Specimen Data). Links to publications are limited.
Some individual natural history collections which are online provide specimen-level web pages and URLs (some even have DOIs, see DOIs for specimens are here, but we're not quite there yet), and some museums list associated GenBank sequences. In the diagram I've not linked the specimens to a taxon, because most specimens are tagged by a name, not an explicit taxon concept (unlike NCBI, BOLD, or GBIF).
Literature databases (represented here by BioStor, but could be other sources such as ZooKeys, for example) may contain articles that mention specimen codes. These articles may also mention taxon names, and geographic localities (including coordinates) (see, for example, Linking GBIF and the Biodiversity Heritage Library. Mining text for names, specimens, and localities is fairly easy, but linking these together is harder (i.e., this specimen is of this taxon, and was found at this locality).
If we have these separate sources and this trivial model, then we can imagine trying to tie information about the same specimen together across the different databases. Why might we want to do this. Here are three reasons:
All this is well and good, the trick is to actually make the links. Here things get horribly messy very quickly. Museum specimens are cited in inconsistent ways, we don't have widely used unique, resolvable specimen identifiers, and even if we did have these identifiers we don't have a global discovery mechanism for matching voucher codes to identifiers. GBIF would be an obvious part of a "global discovery mechanism" (bit like CrossRef but for specimens), GBIF can have multiple records for the same specimen. Sometimes this is because GBIF not only aggregates data from primary sources (such as museums) but also other aggregations which may themselves already include specimens harvested from primary sources. GBIF can also have multiple records because museums keep messing with their databases, try new variants of the Darwin Core triple, etc., resulting in records that look "new" to GBIF. Whole collections can be duplicate din this way.
One way to tackle this multiplicity of specimen records is to think in terms of "clusters" of specimens that are, in some sense, the same thing across multiple databases. For example, clustering a set of duplicated GBIF records together with the sequences derived from those specimens, perhaps including a DNA barcode, and a list of papers that mention that specimen. This is represented by the yellow bar through the diagram, it connects all the different pieces of information about a specimen into a single cluster. More *cough* later.
For simplicity let's suppose that we have this basic model:
A specimen comes from a locality (ideally we have the latitude and longitude of that locality), it is assigned to a taxon, we have data derived from that specimen (e.g., one or more DNA sequences), and we have one or more publications about that specimen (e.g., a paper that publishes a taxon name for which the specimen is a type, or a paper that publishes a sequence for which the specimen is a voucher).
NCBI
In GenBank we have sequences that have accession numbers, and these are linked to taxa (identified by NCBI tax ids). A nice feature of sequence databases is that taxa are explicitly defined by extension, that is, a taxon is the set of sequences assigned to a given taxon. Most (but not all, see Miller et al. doi:10.1186/1756-0500-2-101) sequences are also linked to a publication, which will usually have a PubMed id (PMID), and sometimes a DOI. Many sequences are also georeferenced (see Guest post: response to "Putting GenBank Data on the Map"). Most sequences aren't linked to a voucher specimen, but there is the implict notion of a source (in RDF-speak, many specimens are "blank nodes" Blank nodes for specimens without URI). Some sequences are associated with a specimen that has a museum code, and some are explicitly linked to the specimen by a URL.
DNA barcodes
Barcodes, as represented in BOLD are similar to sequences in GenBank. We have explicit taxa ("BINs") each of which has a URL, some also having DOIs. Most barcodes are georeferenced. There's some ambiguity about whether the URL for a barcode record identifies the barcode sequence, the specimen, or both. There may be a voucher code for the specimen. Some barcodes are linked to publications, but not (as far as I can see) in the data obtained from the API. Some barcodes are linked to the corresponding record in GenBank (which may or may not be supressed, see Dark taxa even darker: NCBI pulls (some) DNA barcodes from GenBank (updated)).
GBIF
At it's core GBIF has occurrence records (many of these are specimen-based, but the majority of data in GBIF is actually observation-based), each of which has a unique id, and which is linked to a taxon, also with a unique id. As with the sequence databases, a taxon is a set of occurrences that have been assigned to that taxon. Many records in GBIF are georeferenced. There are limited cross links to other database - some occurrences list associated GenBank sequences. Some GBIF occurrences actually are sequences (e.g., the European Molecular Biology Laboratory Australian Mirror and the soon to be indexed Geographically tagged INSDC sequences), and barcodes are also making their way into GBIF (e.g., Zoologische Staatssammlung Muenchen - International Barcode of Life (iBOL) - Barcode of Life Project Specimen Data). Links to publications are limited.
Museums and herbaria
Some individual natural history collections which are online provide specimen-level web pages and URLs (some even have DOIs, see DOIs for specimens are here, but we're not quite there yet), and some museums list associated GenBank sequences. In the diagram I've not linked the specimens to a taxon, because most specimens are tagged by a name, not an explicit taxon concept (unlike NCBI, BOLD, or GBIF).
Literature
Literature databases (represented here by BioStor, but could be other sources such as ZooKeys, for example) may contain articles that mention specimen codes. These articles may also mention taxon names, and geographic localities (including coordinates) (see, for example, Linking GBIF and the Biodiversity Heritage Library. Mining text for names, specimens, and localities is fairly easy, but linking these together is harder (i.e., this specimen is of this taxon, and was found at this locality).
Linking together
If we have these separate sources and this trivial model, then we can imagine trying to tie information about the same specimen together across the different databases. Why might we want to do this. Here are three reasons:
- Augmentation Combining information can enhance our understanding of a specimen. Perhaps a specimen in GBIF is a geographic outlier. A publication that mentions the specimen includes it in a new taxon, perhaps discovered by sequencing DNA extarcted from that specimen. Linking this information together resolves the problematic distribution.
- Provenance What is the evidence that a particular specimen belongs to a particualr taxon, or was collected at a particular locality? If we connect specimens to the literature we we can review the evidence for ourselves. If we have sequences we can run BLAST, build a tree, and see if we should rethink our classification of that sequence. Imagine being able to browse GBIF and see the evidence for each dot on the map?
- Citation Mentions in the literature, use as vouchers for DNA barcoding or other forms of sequencing can be thought of a "citation" of that specimen. Museums hosting that material could use metrics base don this to demonstrate the value of their collection (see also The impact of museum collections: one collection ≈ one Nobel Prize).
Making the links
All this is well and good, the trick is to actually make the links. Here things get horribly messy very quickly. Museum specimens are cited in inconsistent ways, we don't have widely used unique, resolvable specimen identifiers, and even if we did have these identifiers we don't have a global discovery mechanism for matching voucher codes to identifiers. GBIF would be an obvious part of a "global discovery mechanism" (bit like CrossRef but for specimens), GBIF can have multiple records for the same specimen. Sometimes this is because GBIF not only aggregates data from primary sources (such as museums) but also other aggregations which may themselves already include specimens harvested from primary sources. GBIF can also have multiple records because museums keep messing with their databases, try new variants of the Darwin Core triple, etc., resulting in records that look "new" to GBIF. Whole collections can be duplicate din this way.
One way to tackle this multiplicity of specimen records is to think in terms of "clusters" of specimens that are, in some sense, the same thing across multiple databases. For example, clustering a set of duplicated GBIF records together with the sequences derived from those specimens, perhaps including a DNA barcode, and a list of papers that mention that specimen. This is represented by the yellow bar through the diagram, it connects all the different pieces of information about a specimen into a single cluster. More *cough* later.
Thursday, August 14, 2014
Seven percent of GBIF data is usable - quick thoughts on Hjarding et al. 2014
Update: Angelique Hjarding and her co-authors have responded in a guest post on iPhylo.
The quality and fitness for use of GBIF-mobilised data is a topic of interest to anyone that uses GBIF data. As an example, a recent paper on African chameleons comes to some rather alarming conclusions concerning the utility of GBIF data:
Hjarding, A., Tolley, K. A., & Burgess, N. D. (2014, July 10). Red List assessments of East African chameleons: a case study of why we need experts. Oryx. Cambridge University Press (CUP). doi:10.1017/s0030605313001427
Here's the abstract (unfortunately the paper is behind a paywall):
The IUCN Red List of Threatened Species uses geographical distribution as a key criterion in assessing the conservation status of species. Accurate knowledge of a species’ distribution is therefore essential to ensure the correct categorization is applied. Here we compare the geographical distribution of 35 species of chameleons endemic to East Africa, using data from the Global Biodiversity Information Facility (GBIF) and data compiled by a taxonomic expert. Data screening showed 99.9%of GBIF records used outdated taxonomy and 20% had no locality coordinates. Conversely the expert dataset used 100%up-to-date taxonomy and only seven records (3%) had no coordinates. Both datasets were used to generate range maps for each species, which were then used in preliminary Red List categorization. There was disparity in the categories of 10 species, with eight being assigned a lower threat category based on GBIF data compared with expert data, and the other two assigned a higher category. Our results suggest that before conducting desktop assessments of the threatened status of species, aggregated museum locality data should be vetted against current taxonomy and localities should be verified. We conclude that available online databases are not an adequate substitute for taxonomic experts in assessing the threatened status of species and that Red List assessments may be compromised unless this extra step of verification is carried out.
The authors used two data sets, one from GBIF, the other provided by an expert to compute the conservation status for each chameleon species endemic to Kenya and/or Tanzania. After screening the GBIF data for taxonomic and geographic issues, a mere 7% of the data remained - 93% of the 2304 records downloaded from GBIF were discarded.
This study raises a number of questions, some of which I will touch on here. Before doing so, it's worth noting that it's unfortunate that neither of the two data sets used in this study (the data downloaded from GBIF, and the expert data set assembled by Colin Tilbury) are provided by the authors, so our ability to further explore the results is limited. This is a pity, especially now that citable data repositories such as Dryad and Figshare are available. The value of this paper would have been enhanced if both datasets were archived.
Below is Table 1 from the paper, "Museums from which locality records for East African chameleons were obtained for the expert and GBIF datasets":
Museum | Expert dataset | GBIF |
---|---|---|
Afrika Museum, The Netherlands | x | |
American Museum of Natural History, USA | x | |
Bishop Museum, USA | x | |
British Museum of Natural History, UK | x | |
Brussels Museum of Natural Sciences, Belgium | x | |
California Academy of Sciences, USA | x | |
Ditsong Museum, South Africa | x | x |
Los Angeles County Museum of Natural History, USA | x | |
Museum für Naturkunde, Germany | x | |
Museum of Comparative Zoology (Harvard University), USA | x | |
Naturhistorisches Museum Wien, Austria | x | |
Smithsonian Institution, USA | x | |
South African Museum, South Africa | x | |
Trento Museum of Natural Sciences, Italy | x | |
University of Dar es Salaam, Tanzania | x | |
Zoological Research Museum Alexander Koenig, Germany | x |
It is striking that there is virtually no overlap in data sources available to GBIF and the sources used by the expert. Some of the museums have no presence in GBIF, including some major collections (I'm looking at you, The Natural History Museum), but some museums do contribute to GBIF, but not their herpetology specimens. So, GBIF has some work to do in mobilising more data (Why is this data not in GBIF? What are the impediments to that happening?). Then there are museums that have data in GBIF, but not in a form useful for this study. For example, the American Museum of Natural History has 327,622 herpetology specimens in GBIF, but not one of these is georeferenced! Given that there are records in GenBank for AMNH specimens that are georeferenced, I suspect that the AMNH collection has deliberately not made geographic coordinates available, which raises the obvious question - why?
GBIF coverage
I had a quick look at GBIF to get some idea of the geographic coverage of the relevant herpetology collections (or animal collections if herps weren't separated out). Below are maps for some of these collections. The AMNH is empty, as is the smaller Zoological Research Museum Alexander Koenig collection (which supplied some of the expert data).
American Museum of Natural History, USA
Bishop Museum, USA
California Academy of Sciences, USA
Ditsong Museum, South Africa
Los Angeles County Museum of Natural History, USA
Museum für Naturkunde, Germany
Museum of Comparative Zoology (Harvard University), USA
Smithsonian Institution, USA
Zoological Research Museum Alexander Koenig, Germany
Some collections are relevant, such as the California Academy of Sciences, but a number of the collections in GBIF simply don't have georeferenced data on chameleons. Then there are several museums that are listed as sources for the expert database and which contribute to GBIF, but haven't digitised their herp collections, or haven't made these available to GBIF.
Taxonomy
The other issue encountered by Hjarding et al. 2014 is that the GBIF taxonomy for chameleons is out of date (2302 of 2304 GBIF-sourced records needed to be updated). Chameleons are a fairly small group, and it's not like there are hundreds of new species being discovered each year (see timeline in BioNames), 2006 was a bumper year with 12 new taxonomic names added. But there has been a lot of recent phylogenetic work which has clarified relationships, and as a result species get shuffled around different genera, resulting in a plethora of synonyms. GBIF's taxonomy has lagged behind current research, and also manages to horribly mangle the chameleon taxonomy is does have. For example, the genus Trioceros is not even placed within the chameleon family Chamaeleonidae but is simply listed as a reptile, which means anyone searching for data on the family Chamaeleonidae will all the Trioceros species.
Summary
The use case for this study seems one of the most basic that GBIF should be able to meet - given some distributions of organisms, compute an assessment of their conservation status. That GBIF-mobilised data is so patently not up to the task in this case is cause for concern.
However, I don't see this is simply a case of expert data set versus GBIF data, I think it's more complicated than that. A big issue here is data availability, and also the extent of data release (assuming that the AMNH is actively withholding geographic coordinates for some, if not most of its specimens). GBIF should be asking those museums that provide data why they've not made georeferenced data available, and if its because the museums simply haven't been able to do this, then how can it help this process? It should also be asking why museums which are part of GBIF haven't mobilised their herpetology data, and again, what can it do to help? Lastly, in an age of rapid taxonomic change driven by phylogenetic analysis, GBIF needs to overhaul the glacial pace at which it incorporates new taxonomic information.
Monday, August 04, 2014
Realizing Lessons of the Last 20 Years: A Manifesto for Data Provisioning & Aggregation Services for the Digital Humanities (A Position Paper)
I stumbled across this paper (found on the GBIF Public Library):
The first sentence of the abstract makes the paper sound a bit of a slog to read, but actually it's a great fun, full of pithy comments on the state of digital humanities. Almost all of this is highly relevant to mobilising natural history data. Here are the paper's main points (emphasis added):
Recommended reading.
Oldman, D., de Doerr, M., de Jong, G., Norton, B., & Wikman, T. (2014, July). Realizing Lessons of the Last 20 Years: A Manifesto for Data Provisioning and Aggregation Services for the Digital Humanities (A Position Paper) System. D-Lib Magazine. CNRI Acct. doi:10.1045/july2014-oldman
The first sentence of the abstract makes the paper sound a bit of a slog to read, but actually it's a great fun, full of pithy comments on the state of digital humanities. Almost all of this is highly relevant to mobilising natural history data. Here are the paper's main points (emphasis added):
- Cultural heritage data provided by different organisations cannot be properly integrated using data models based wholly or partly on a fixed set of data fields and values, and even less so on 'core metadata'. Additionally, integration based on artificial and/or overly generalised relationships (divorced from local practice and knowledge) simply create superficial aggregations of data that remain effectively siloed since all useful meaning is available only from the primary source. This approach creates highly limited resources unable to reveal the significance of the source information, support meaningful harmonisation of data or support more sophisticated use cases. It is restricted to simple query and retrieval by 'finding aids' criteria.
- The same level of quality in data representation is required for public engagement as it is for research and education. The proposition that general audiences do not need the same level of quality and the ability to travel through different datasets using semantic relationships is a fiction and is damaging to the establishment of new and enduring audiences.
- Thirdly, data provisioning for integrated systems must be based on a distributed system of processes in which data providers are an integral part, and not on a simple and mechanical view of information system aggregation, regardless of the complexity of the chosen data models. This more distributed approach requires a new reference model for the sector. This position contrasts with many past and existing systems that are largely centralised and where the expertise and practice of providers is divorced.
Recommended reading.
Wednesday, June 11, 2014
The vision thing - it's all about the links
@rdmpage@AlexHardisty@proibiosphere Well, take part in the process of clarification!
— Pensoft Publishers (@Pensoft) June 10, 2014
I've been involved in a few Twitter exchanges about the upcoming pro-iBiosphere meeting regarding the "Open Biodiversity Knowledge Management System (OBKMS)", which is the topic of the meeting. Because for the life of me I can't find an explanation of what "Open Biodiversity Knowledge Management System" is, other than vague generalities and appeals to the magic pixie dust that is "Linked Open Data" and "RDF", I've been grumbling away on Twitter.
So, here's my take on what needs to be done. Fundamentally, if we are going to link biodiversity information together we need to build a network. What we have (for the most part) at the moment is a bunch of nodes (which you can think of as data providers such as natural history collections, databases, etc., or different kinds of data, such as names, publications, sequences, specimens, etc.).
We'd like a network, so that we can link information together, perhaps to discover new knowledge, to serve as a pathway for analyses that combine different sorts of data, and so on:
A network has nodes and links. Without the links there's no network. The fundamental problem as I see it is that we have nodes that have clear stakeholders (e.g., individual, museums, herbaria, publishers, database owners, etc.). They often build links, but they are typically incomplete (they don't link to everything that is relevant), and transitory (there's no mechanism to facilitate persistence of the links). There is no stakeholder for whom the links are most important. So, we have this:
This sucks. I think we need an entity, a project, and organisation, whatever you want to call it for whom the network is everything. In other words, they see the world like this:
If this is how you view the world, then your aim is to build that network. You live or die based on the performance of that network. You make sure the links exist, they are discoverable, and that they persist. You don't have the same interests as the nodes, but clearly you need to provide value to them because they are the endpoints of your links. But you also have users who don't need the nodes per see, they need the network.
If you buy this, then you need to think about how to grow the network. Are there network effects that you can leverage, in the same way CrossRef has with publishers submitting lists of literature cited linked to DOIs, or in social media where you give access to your list of contacts to build your social graph?
If the network is the goal, you don't just think "let's just stick HTTP URLs on everything and it will all be good". You can think like that if you are a node, because if the links die you can still persist (you'll still have people visiting your own web site). But if you are a network and the links die, you are in big trouble. So you develop ways to make the network robust. This is one reason why CrossRef uses an identifier based on indirection, it makes it easier to ensure the network persists in the face of change in how the nodes serve their data. What is often missed is that this also frees up the nodes, because they don't need to commit to serving a given URL in perpetuity, indirections shields them from this.
In order to serve users of the network, you want to ensure you can satisfy their needs rapidly. This leads to things like caching links and basic data about the end points of those links (think how Google caches the contents of web pages so if the site is offline you may still find what you are looking for).
If your business depends on the network, then you need to think how you can create incentives for nodes to join. For example, what services can you offer them that make you invaluable to the nodes? Once you crack that, then all sorts of things can happen. Take structured markup as an example. Google is driving this on the web using schema.org. If you want to be properly indexed by Google, and have Google display your content in a rich form (e.g., thumbnails, review ratings, location, etc.) you need to mark up your page in a way Google understands. Given that some businesses live or die based on their Google ranking, there's a strong incentive for web sites to adopt this markup. There's a strong incentive for Google to encourage markup so that it can provide informative results for its users (otherwise they might rely on "social search" via Facebook and mobile apps). This is the kind of thing you want the network to aim for.
In summary, this is my take on where we are at in biodiversity informatics. The challenge is that the organisations in the room discussing this are typically all nodes, and I'd argue that by definition they aren't in a position to solve the problem. You need to pivot (ghastly word) and think about it from the perspective of the network. Imagine you were to form a company whose mission was to build that network. How would you do it, how would you convince the nodes to engage, what value would you offer them, what value would you offer users of the network? If we start thinking along those lines, then I think we can make progress.
Subscribe to:
Posts (Atom)