Friday, October 28, 2011

Sherborn presentation on Open Taxonomy

Here is my presentation from today's Anchoring Biodiversity Information: From Sherborn to the 21st century and beyond meeting.

All the presentations will be posted online, along with podcasts of the audio. Meantime, presentations by Dave Remsen and Chris Freeland are already online.

Thursday, October 27, 2011

Linking taxonomic names to literature: beyond digitised 5×3 index cards

Tomorrow is the Anchoring Biodiversity Information: From Sherborn to the 21st century and beyond meeting. It should be an interesting gathering, albeit overshadowed by the sudden death of Frank Bisby.

I'm giving a talk entitled "Open Taxonomy", in which I argue that most taxonomic databases are little more than digitised collections of 5×3 index cards, where literature is treated as dumb citation strings rather than as resources with digital identifiers. To make the discussion concrete I've created a mapping between the Index to Organism Names (ION) database and a range of bibliographic sources, such as CrossRef (for DOIs), BioStor, JSTOR, etc.

This mapping is online at

So far I've managed to link some 200,000 animal names to a literature identifier, and a good fraction of these articles are freely available, either as images in BioStor and Gallica (for I've created a simple viewer) or as PDFs (which are displayed using Google Docs.

Some examples are:

The site is obviously a work in progress, and there's a lot to be done to the interface, but I hope it conveys the key point: a significant fraction of the primary taxonomic literature is online, and we should be linking to this. The days of digitised 5×3 index cards are past.

Friday, October 21, 2011

Final thoughts on TDWG RDF challenge

Quick final comment on the TDWG Challenge - what is RDF good for?. As I noted in the previous post, Olivier Rovellotti (@orovellotti) and Javier de la Torre (@jatorre) have produced some nice visualisations of the frog data set:
Nice as these are, I can't help feeling that they actually help make my point about the current state of RDF in biodiversity informatics. The only responses to my challenge have been to use geography, where the shared coordinate system (latitude and longitude) facilitates integration. Having geographic coordinates means we don't need to have shared identifiers to do something useful, and I think it's no accident that GBIF is one of the most important resources we have. Geography is also the easiest way to integrate across other fields (e.g., climate).

But what of the other dimensions? What I'm really after are links across datasets that enable us to make new inferences, or address interesting questions. The challenge is still there...

Thursday, October 20, 2011

Reflections on the TDWG RDF "Challenge"

This is a follow up to my previous post TDWG Challenge - what is RDF good for? where I'm being, frankly, a pain in the arse, and asking why we bother with RDF? In many ways I'm not particularly anti-RDF, but it bothers me that there's a big disconnect between the reasons we are going down this route and how we are actually using RDF. In other words, if you like RDF and buy the promise of large-scale data integration while still being decentralised ("the web as database"), then we're doing it wrong.

As an aside, my own perspective is one of data integration. I want to link all this stuff together so I can follow a path through multiple datasets and extract the information I want. In other words, "linked data" (little "l", little "d"). I'm interested in fairly light weight integration, typically through shared identifiers. There is also integration via ontologies, which strikes me as a different, if related, problem, that in many ways is closer to the original vision of the Semantic Web as a giant inference engine. I think the concerns (and experience) of these two communities are somewhat different. I don't particularly care about ontologies, I want key-value pairs and reusable identifiers so I can link stuff together. If, for example, you're working on something like Phenoscape, then I think you have a rather more circumscribed set of data, with potentially complicated interrelationships that you want to make inferences on, in which case ontologies are your friend.

So, I posted a "challenge". It wasn't a challenge so much as a set of RDF to play with. What I'm interested in is seeing how easily we can string this data together to learn stuff. For example, using the RDF I posted earlier here is a table listing the name, conservation status, publication DOI and date, and (where available) image from Wikipedia for frogs with sequences in GenBank.

SpeciesStatusDOIYear describedImage
Atelopus nanayCR[0229:TNSOAA]2.0.CO;22002
Eleutherodactylus mariposaCR
Phrynopus kauneorumCR
Eleutherodactylus eunasterCR
Eleutherodactylus amadeusCR
Eleutherodactylus lamprotesCR
Churamiti maridadiCR
Eleutherodactylus thorectesCR
Eleutherodactylus apostatesCR
Leptodactylus silvanimbusCR
Eleutherodactylus sciagraphusCR
Bufo chavinCR[0216:NSOBAB]2.0.CO;22001
Eleutherodactylus fowleriCR
Ptychohyla hypomykterCR
Hyla suweonensisDD
Proceratophrys concavitympanumDD
Phrynopus bufoidesDD
Boophis periegetesDD
Phyllomedusa duellmaniDD
Boophis liamiDD
Hyalinobatrachium ignioculusDD[0091:ANSOHA]2.0.CO;22003
Proceratophrys cururuDD
Amolops bellulusDD[0536:ABANSO]2.0.CO;22000
Centrolene bacatumDD
Litoria kumaeDD
Phrynopus pesantesiDD
Gastrotheca galeataDD
Paratelmatobius cardosoiDD
Rhacophorus catamitusDD[0046:NAPKPF]2.0.CO;22002
Huia melasmaDD
Telmatobius vilamensisDD[0253:ANSOTA]2.0.CO;22003
Callulina kisiwamsituEN
Arthroleptis nikeaeEN
Eleutherodactylus amplinymphaEN
Eleutherodactylus glaphycompusEN
Bufo tacanensisEN
Phrynopus brackiEN
Telmatobius sibiricusEN[0127:ANSOTF]2.0.CO;22003
Cochranella macheEN
Eleutherodactylus melacaraEN
Plectrohyla glandulosaEN
Aglyptodactylus laticepsEN
Eleutherodactylus glamyrusEN
Gastrotheca trachycepsEN
Eleutherodactylus grahamiEN
Litoria havinaLC
Crinia ripariaLC
Litoria longirostrisLC
Osteocephalus mutaborLC
Leptobrachium nigropsLC
Pseudis tocantinsLC
Mantidactylus argenteusLC
Aglyptodactylus securiferLC
Pseudis cardosoiLC
Uperoleia inundataLC
Litoria pronimiaLC
Litoria paraewingiLC
Philautus aurifasciatusLC
Proceratophrys avelinoiLC
Osteocephalus deridensLC
Gephyromantis boulengeriLC
Crossodactylus caramaschiiLC
Rana yavapaiensisLC
Boophis lichenoidesLC
Megistolotis lignariusLC
Ansonia endauensisNE[466:ANSOAS]2.0.CO;22006
Ansonia kraensisNE
Arthroleptella landdrosiaNT
Litoria jungguyNT
Phrynobatrachus phyllophilusNT
Philautus ingeriVU
Gastrotheca dendronastesVU
Hyperolius cystocandicansVU
Boophis sambiranoVU
Ansonia torrentisVU
Telmatobufo australisVU
Stefania coxiVU[0327:EDOSAH]2.0.CO;22002
Oreolalax multipunctatusVU
Eleutherodactylus guantanameraVU
Spicospina flammocaeruleaVU
Cycloramphus acangatanVU
Leiopelma pakekaVU
Rana okaloosaeVU
Phrynobatrachus uzungwensisVU

This is a small fraction of the frog species actually in GenBank because I've filtered it down to those that have been linked to Wikipedia (from where we get the conservation status) and which were described in papers with DOIs (from which we get the date of description).

I generated this result using this SPARQL query on a triple store that had the primary data sources (Uniprot, Dbpedia, CrossRef, ION) loaded, together with the all-important "glue" datasets that link ION to CrossRef, and Uniprot to Dbpedia (see previous post for details):

PREFIX rdf: <>
PREFIX rdfs: <>
PREFIX dbpedia-owl: <>
PREFIX uniprot: <>
PREFIX tdwg_tn: <>
PREFIX tdwg_co: <>
PREFIX dcterms: <>

SELECT ?name ?status ?doi ?date ?thumbnail
?ncbi uniprot:scientificName ?name .
?ncbi rdfs:seeAlso ?dbpedia .
?dbpedia dbpedia-owl:conservationStatus ?status .
?ion tdwg_tn:nameComplete ?name .
?ion tdwg_co:publishedInCitation ?doi .
?doi dcterms:date ?date .

?dbpedia dbpedia-owl:thumbnail ?thumbnail
ORDER BY ASC(?status)

This table doesn't tell us a great deal, but we could, for example, graph date of description against conservation status (CR=critical, EN=endangered, VU=vulnerable, NT=not threatened, LC=least concern, DD=data deficient):
In other words, is it the case that more recently described species are more likely to be endangered than taxa we've known about for some time (based on the assumption that we've found all the common species already)? We could imagine extending this query to retrieve sequences for a class of frog (e.g., critically endangered) so we could compute a measure population genetic variation, etc. We shouldn't take the graph above too seriously because it's based on small fraction of the data, but you get the idea. As more frog taxonomy goes online (there's a lot of stuff in BHL and BioStor, for example) we could add more dates and build a dataset worth analysing properly.

It seems to me that these should be fairly simple things to do, yet they are the sort of thing that if we attempt today it's a world of hurt involving scripts, Excel, data cleaning, etc. before we can do the science.

The thing is, without the "glue" files mapping identifiers across different databases even this simple query isn't possible. Obviously we have no say in how many organisations publish RDF, but within the biodiversity informatics community we should make every effort to use external identifiers wherever possible so that we can make these links. This is the core of my complaint. If we are using RDF to foster data integration so we can query across the diverse data sets that speak to biodiversity, then we are doing it wrong.

Here is a nice visualisation of this dataset from @orovellotti (original here), made using ecoRelevé:

AcNbdh2CMAA3ysc png large

Wednesday, October 19, 2011

TDWG Challenge - what is RDF good for?

Last month, feeling particularly grumpy, I fired off an email to the TDWG-TAG mailing list with the subject Lobbing grenades: a challenge. Here's the email:
It's morning and the coffee hasn't quite kicked in yet, but reading through recent TDWG TAG posts, and mindful of the upcoming meeting in New Orleans (which sadly I won't be attending) I'm seeing a mismatch between the amount of effort being expended on discussions of vocabularies, ontologies, etc. and the concrete results we can point to.

Hence, a challenge:

"What new things have we learnt about biodiversity by converting biodiversity data into RDF?"

I'm not saying we can't learn new things, I'm simply asking what have we learnt so far?

Since around 2006 we have had literally millions of triples in the wild (uBio, ION, Index Fungorum, IPNI, Catalogue of Life, more recently Biodiversity Collections Index, Atlas of Living Australia, World Register of Marine Species, etc.), most of these using the same vocabulary. What new inferences have we made?

Let's make the challenge more concrete. Load all these data sources into a triple store (subchallenge - is this actually possible?). Perhaps add other RDF sources (DBpedia, Bio2RDF, CrossRef). What novel inferences can we make?

I may, of course, simply be in "grumpy old arse" mode, but we have millions of triples in the wild and nothing to show for it. I hope I'm not alone in wondering why...

In the context of the TDWG meeting (happening as we speak and which I'm following via Twitter, hashtag #tdwg) Joel Sachs asked me whether I had any specific data in mind that could form the basis of a discussion. So, here goes. I've assembled some small RDF data sets that it might be fun to play with. Each data set is for frogs, and I've divided them into two sets.

Primary data
These data sets are essentially unmodified RDF fetched from data providers:
  • uniprot.rdf Uniprot RDF for frogs in GenBank
  • ion.rdf Index of Organism Names (ION) RDF for taxonomic names for frogs (filtered to just those names that are also in GenBank, the RDF comes from ION LSIDs)
  • crossref.rdf CrossRef RDF for DOIs for publications that published new frog names (obtaining using CrossRef's support for Linked Data for DOIs)
  • dbpedia.rdf Dbpedia RDF for frogs in GenBank (Update 2011-10-20: the dbpedia.rdf file is a bit big, so here is subset.rdf which has just the conservation status and thumbnail image)

These sources give us information on genomics (at least, they tell us which taxa have been sequenced), where and when the original taxonomic description was published, and by whom, as well as some information on conservation status and what the frog looks like (via Dbpedia). Ideally we just load these files into a triple store and then ask a bunch of questions, such as what is the conservation status of frogs sequenced in Genbank?, is there correlation between the conservation status of a frog and the date it was discovered?, who has described the most frog species?, etc.

My contention is that actually we can't do any of this because the data is siloed due to the lack of shared identifiers and vocabularies (I suspect that there is not a single identifier any of these files share). The only way we can currently link these data sets together is by shared string literals (e.g., taxonomic names), in which case why bother with RDF? So my first challenge is to see whether any of the questions I've just listed can actually be tackled using this data.

In a slightly more constructive mode, to see if we can make progress I'm providing some additional RDF files, based on projects I'm working on to link data together. These files may help provide some of the missing "glue" to connect these data sets.

  • linkout.rdf The list of links between NCBI and Dbpedia (based on mapping in iPhylo LinkOut)
  • ion_doi.rdf A subset of publications listed in ION have DOIs, this file links the corresponding ION LSIDs to those DOIs (this file is from an ongoing project mapping names to primary literature)

The first file links the ION and CrossRef RDF, so we could start to ask questions about dates of discovery, who described what species, etc.. The second file links NCBI taxon ids (in this case in the form of UniProt URIs) to Wikipedia (in the form of Dbpedia URIs). Dbpedia has information on conservation status, and some frogs will also have pictures, so we can start to join genomics to conservation, as well as make some visualisations.

I've now added another RDF file for 1000 georeferenced GenBank sequences for frogs. The file is genbank.rdf. This file is generated from a local, processed version of EMBL, and uses a mixture of Dublin Core and TDWG vocabularies. Here's an example of a single record:

<?xml version="1.0"?>
<rdf:RDF xmlns:dcterms=""
<uniprot:Molecule rdf:about="">
<dcterms:description>Xenopus borealis voucher MHNG:Herp:2644.64
cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial.</dcterms:description>
<dcterms:subject rdf:resource=""/>
<dcterms:relation rdf:parseType="Resource">
<rdf:type rdf:resource=""/>
<toccurrence:identifiedToString>Xenopus borealis</toccurrence:identifiedToString>
<toccurrence:verbatimCoordinates>0.66 N 37.5 E</toccurrence:verbatimCoordinates>

I've added this simply so one could do some geographical queries.

Missing links
There are still lots of missing links here (for example, there's no explicit link between NCBI and ION, so we'd need to create this using taxonomic names), and we could add further links to the literature via sequences for taxa. Then there's the lack of geographic data. We could get some of this via georeferenced sequences in GenBank, but there's no RDF for this (Bio2RDF does have RDF for sequences but it ignores the bulk of the organismal metadata such as voucher specimens and latitude and longitude).

In many ways it's this lack of links that was point of my original email. The reality is that "linked data" isn't linked to anything like the extent that makes it useful. Simply pumping out RDF won't get us very far until we tackle this problem (see also my earlier post Linked data that isn't: the failings of RDF).

So, if you think RDF is the way to go, please tell me what you can learn from these data files.

Tuesday, October 11, 2011

DeepDyve - renting scientific articles

Deepdyve buttonBit late, but I stumbled across DeepDyve, which provides rental access to scientific papers for as little as $0.99. The pitch to publishers is:

Today, scholarly publisher sites receive over 2 billion visits per year from users who are unaffiliated with an institution yet convert less than 0.2% into a purchase or subscription. DeepDyve’s service is designed for these ‘unaffiliated users’ who need an easy and affordable access to authoritative information vital to their careers.

Renting a paper means you get to read it online, but you can't print or download it, and access is time limited (unless you purchase the article outright). You can also purchase monthly plans (think Spotify for papers).

It's an interesting model, and the interface looks nice. Here's a paper on Taxonomy and Diversity (

Leaving aside the issue of whether restricted access to the scientific literature is a good idea (even if it is relatively cheap) I'm curious about the business model and the long tail. One could imagine lots of people downloading a few high-visibility papers, and my sense (based on no actual data I should stress) is that DeepDyve's publishing partners are providing access to their first-tier journals.

Taxonomic literature is vast, but most individual papers will have few readers (describing a single new species is usually not big news, with obvious exceptions). But I wonder if in aggregate the potential taxonomic readership would be enough to make cheap access to that literature economic. Publishers such as Wiley, Taylor and Francis, and Springer have digitised some major taxonomic journals, how will they get a return on this? I suspect the a price tag of, say, €34.95 for an article on seabird lice (e.g., "Neue Zangenläuse (Mallophaga, Philopteridae) von procellariiformen und charadriiformen Wirten" will be too high for many people, but the chance to rent it for 24 hours for, say, $0.99, would be appealing. If this is the case, then maybe this would encourage publishers to digitise more of their back catalogue. It would be nice if everything is digitised and free, but I could live with digitised and cheap.

Thursday, October 06, 2011

My favourite Apple moment

In light of today's news here's my favourite Mac, the original iBook.
In many ways, it wasn't the machine itself so grabbed me (cool as it was), it was the experience of unpacking it when it arrived in my office over a decade ago. In the box with the computer and the mains cord was a disc about the size of a hockey puck (on the right in the image above). I looked at it and wondered what on Earth it was. It looked like a giant yo-yo, with cable wrapped around instead of string. Then the penny dropped — it was the power supply. You plugged the mains cord into the yo-yo, then unwound just as much cord as you needed (oh, and when you connected it in to your iBook the plug glowed orange if the battery needed charging, green if it was fully charged). The child inside me squealed with delight (being a grown up I laughed out loud, rather than actually squealing).

The iBook still works (the battery is long dead, but plug the yo-yo into the mains and it still works), and it manages to run an early version of Mac OS X.

If anybody has to ask why people love Apple products, it's not because of the "brand", or the "exclusivity", it's because of the joy they can invoke. Someone cared enough to make the most mundane task — plugging a laptop into the mains — into a thing of beauty.

Wednesday, October 05, 2011

Taxonomy - crisis, what crisis?

Following on from the last post How many species are there, and why do we get two very different answers from same data? another interesting paper has appeared in TREE:

Lucas N. Joppa, David L. Roberts, Stuart L. Pimm The population ecology and social behaviour of taxonomists Trends in Ecology & Evolution doi:10.1016/j.tree.2011.07.010

The paper analyses the "ecology and social habits of taxonomists" and concludes:

Conventional wisdom is highly prejudiced. It suggests that taxonomists were a formerly more numerous people, are in 'crisis', are becoming endangered and are generally asocial. We consider these hypotheses and reject them to varying degrees.

Queue flame war on TAXACOM, no doubt, but it's a refreshing conclusion, and it's based on actual data. Here I declare an interest. I was a reviewer, and in a fit of pique recommended rejection simply because the authors don't make the data available (they do, however, provide the R scripts used to do the analyses). As the authors patiently pointed out in their response to reviews, the various explicit or implicit licensing statements attached to taxonomic data mean they can't provide the data (and I'm assuming that in at least some cases the dark art of screen scrapping was used to get the data).

There's an irony here. Taxonomic databases are becoming hot topics, generating estimates of the scale of the task facing taxonomy, and diagnosing state of the discipline itself (according to Joppa et al. it's in rude health). This is the sort of thing that can have a major impact on how people perceive the discipline (and may influence how many resources are allocated to the subject). If taxonomists take issue with the analyses then they will find them difficult to repeat because the taxonomic data they've spent their careers gathering are under lock and key.

Tuesday, October 04, 2011

How many species are there, and why do we get two very different answers from same data?

GlobeTwo papers estimating the total number of species have recently been published, one in the open access journal PLoS Biology:

Camilo Mora, Derek P. Tittensor, Sina Adl, Alastair G. B. Simpson, Boris Worm. How Many Species Are There on Earth and in the Ocean?. PLoS Biol 9(8): e1001127. doi:10.1371/journal.pbio.1001127
SSB logo final
the second in Systematic Biology (which has an open access option but the authors didn't use it for this article):

Mark J. Costello, Simon Wilson and Brett Houlding. Predicting total global species richness using rates of species description and estimates of taxonomic effort. Syst Biol (2011) doi:10.1093/sysbio/syr080

The first paper has gained a lot of attention, in part because Jonathan Eisen Bacteria & archaea don't get no respect from interesting but flawed #PLoSBio paper on # of species on the planet was mightily pissed off about the estimates of the number:
Their estimates of ~ 10,000 or so bacteria and archaea on the planet are so completely out of touch in my opinion that this calls into question the validity of their method for bacteria and archaea at all.

The fuss over the number of bacteria and archaea seems to me to be largely a misunderstanding of how taxonomic databases count taxa. Databases like Catalogue of Life record described species, and most bacteria aren't formally described because they can't be cultured. Hence there will always be a disparity between the extent of diversity revealed by phylogenetics and by classical taxonomy.

The PLoS Biology paper has garnered a lot more reaction than the Systematic Biology paper (e.g., the commentary by Carl Zimmer in the New York TimesHow Many Species? A Study Says 8.7 Million, but It’s Tricky), which arguably has the more dramatic conclusion.

How many species, 8.7 million, or 1.8 to 2.0 million?

Whereas the Mora et al. in PLoS Biology concluded that there are some 8.7 million (±1.3 million SE) species on the planet, Costello et al. in Systematic Biology arrive at a much more conservative figure (1.8 to 2.0 million). The implications of these two studies are very different, one implies there's a lot of work to do, the other leads to headlines such as 'Every species on Earth could be discovered within 50 years'.

What is intriguing is that both studies use the same databases, Catalogue of Life and the World's Register of Marine Species, and yet arrive at very different results.

So, the question is, how did we arrive at two very different answers from the same data?