Showing posts with label TDWG. Show all posts
Showing posts with label TDWG. Show all posts

Wednesday, October 04, 2017

TDWG 2017: thoughts on day 3

Day three of TDWG 2017 highlighted some of the key obstacles facing biodiversity informatics.

After a fun series of "wild ideas" (nobody will easily forget David Bloom's "Kill your Darwin Core darlings") we had a wonderful keynote by Javier de la Torre (@jatorre) entitled "Everything happens somewhere, multiple times". Javier is CEO and founder of Carto, which provides tools for amazing geographic visualisations. Javier provided some pithy observations on standards, particularly the fate of official versus unofficial "community" standards (the community standards tend to be simpler, easier to use, and hence win out), and the potentially stifling effects standards can have on innovation, especially if conforming to standards becomes the goal rather than merely a feature.

The session Using Big Data Techniques to Cross Dataset Boundaries - Integration and Analysis of Multiple Datasets demonstrated the great range of things people want to do with data, but made little progress on integration. It still strikes me as bizarre that we haven't made much progress on minting and reusing identifiers for the same entities that we keep referring too. Channeling Steve Balmer:

Identifiers, identifiers, identifiers, identifiers

It's also striking to compare Javier de la Torre's work with Carto where there is a clear customer-driven focus (we need these tools to deliver this to users so that they can do what they want to do) versus the much less focussed approach of our community. Many of the things we aspire to won't happen until we identify some clear benefits for actual users. There's a tendency to build stuff for our own purposes (e.g., pretty much everything I do) or build stuff that we think people might/should want, but very little building stuff that people actually need.

TDWG also has something of an institutional memory problem. Franck Michel gave an elegant talk entitled A Reference Thesaurus for Biodiversity on the Web of Linked Data which discussed how the Muséum national d'Histoire naturelle's taxonomic database could be modelled in RDF (see for example http://taxref.mnhn.fr/lod/taxon/60878/10.0). There's a more detailed description of this work here:

This browser does not support PDFs. Please download the PDF to view it: Download PDF.

What struck me was how similar this was to the now deprecated TDWG LSID vocabulary, still used my most of the major taxonomic name databases (the nomenclatures). This is an instance where TDWG had a nice, workable solution, it lapsed into oblivion, only to be subsequently reinvented. This isn't to take anything away from Frank's work, which has a thorough discussion of the issues, and has a nice way to handle the the difference between asserting that two taxa are the same (owl:equivalentClass) and that a taxon/name hybrid (which is what many databases serve up because they don't distinguish between names and taxa) and a taxon might be the same (linking via the name they both share).

The fate of the RDF served by the nomenclators for the last decade illustrates a point I keep returning too (see also EOL Traitbank JSON-LD is broken). We tend to generate data and standards because it's the right thing to do, rather than because there's actually a demonstrable need for that data and those standards.

Bitcoin, biodiversity, and micropayments for open data

I gave a "wild ideas" talk at TDWG17 suggesting that the biodiversity community use Bitcoin to make micropayments to use data.

The argument runs like this:

  1. We like open data because it's free and it makes it easy to innovate, but we struggle to (a) get it funded and (b) it's hard to demonstrate value (hence pleas for credit/attribution, and begging for funding).
  2. The alternative of closed data, such as paying a subscription to access a database limits access and hence use and innovation, but generates an income to support the database, and the value of the database is easy to measure (it's how much money it generates).
  3. What if we have a "third model" where we pay small amounts of money to access data (micropayments)?

Micropayments as a way to pay creators is an old idea (it was part of Ted Nelson's Xanadu vision). Now that we have cryptocurrencies such as Bitcoin, micropayments are feasible. So we could imagine something like this:

  1. Access to raw datasets is free (you get what you pay for)
  2. Access to cleaned data comes at a cost (you are paying someone else to do the hard, tedious work of making the data usable)
  3. Micropayments are made using Bitcoin
  4. To help generate funds any spare computational capacity in the biodiversity community is used to mine Bitcoins

After the talk Dmitry Mozzherin sent me a link to Steem, and then this article about Steemit appeared in my Twitter stream:

Clearly this is an idea that has been bubbling around for a while. I think there is scope for thinking about ways to combine a degree of openness (we don't want to cripple access and innovation) with a way to fund that openness (nobody seems interested in giving us money to be open).

Tuesday, October 03, 2017

TDWG 2017: thoughts on day 1

Some random notes on the first day of TDWG 2017. First off, great organisation with the first usable conference calendar app that I've seen (https://tdwg2017.sched.com).

I gave the day's keynote address in the morning (slides below).

It was something of a stream of consciousness brain dump, and tried to cover a lot of (maybe too much) stuff. Among the topics I covered were Holly Bik's appeal for better links between genomic and taxonomic data, my iSpecies tool, some snarky comments on the Semantic Web (and an assertion that the reason that GenBank succeeded was due more to network effects than journals requiring authors to submit sequences there), a brief discussion of Wikidata (including using d3sparql to display classifications, see here), and the use of Hexastore to query data from BBC Wildlife. I also talked about Ted Nelson, Xanadu, using hypothes.is to annotate scientific papers (see Aggregating annotations on the scientific literature: a followup on the ReCon16 hackday), social factors in building knowledge graphs (touching on ORCID and some of the work by Nico Franz discussed here), and ended with some cautionary comments on the potential misuse of metrics based on knowledge graphs (using "league tables" of cited specimens, see GBIF specimens in BioStor: who are the top ten museums with citable specimens?).

TDWG is a great opportunity to find out what is going on in biodiversity informatics, and also to get a sense of where the problems are. For example, sitting through the Financial Models for Sustaining Biodiversity Informatics Products session you couldn't help being struck by (a) the number of different projects all essentially managing specimen data, and (b) the struggle they all face to obtain funding. If this was a commercial market there would be some pretty drastic consolidation happening. It also highlights the difficulty of providing services to a community that doesn't have much money.

I was also struck by Andrew Bentley's talk Interoperability, Attribution, and Value in the Web of Natural History Museum Data. In a series of slides Andrew outlined what he felt collections needed from aggregators, researchers, and publishers, e.g.:

Chatting to Andrew at the evening event at the Canadian Museum of Nature, I think there's a lot of potential for developing tools to provide collections with data on the use and impact of their collections. Text mining the biodiversity literature on a massive scale to extract (a) mentions of collections (e.g., their institutional acronyms) and (b) citations of specimens could generate metrics that would be helpful to collections. There's a great opportunity here for BHL to generate immediate value for natural history collections (many of which are also contributors to BHL).

Also had a chance to talk to Jorrit Poelen who works on Global Biotic Interactions (GloBI). He made some interesting comparisons between Hexastores (which I'd touched on in my keynote) and Linked Data Fragments.

The final session I attended was Towards robust interoperability in multi-omic approaches to biodiversity monitoring. The overwhelming impression was that there is a huge amount of genomic data, much of which does not easily fit into the classic, Linnean view of the world that characterises, say, GBIF. For most of the sequences we don't know what they are, and that might not be the most interesting question anyway (more interesting might be "what do they do?"). The extent to which these data can be shoehorned into GBIF is not clear to me, although doing so may result in some healthy rethinking of the scope of GBIF itself.

Tuesday, July 28, 2015

Modelling taxonomic names in databases

Quick notes on modelling taxonomic names in databases, as part of an ongoing discussion elsewhere about this topic.

Simple model

One model that is widely used (e.g., ITIS, WoRMS) and which is explicit in Darwin Core Archive is something like this:

Model1

We have a table for taxa and we don't distinguish between taxa and their names. the taxonomic hierarchy is represented by the parentID field, which points to your parent. If you don't have a (non NULL) value for parentID you are not an accepted taxon (i.e., you are a synonym), and the field acceptedID points to the accepted taxon. Simple, fits in a single database table (or, let's be honest, and Excel spreadsheet).

The tradeoff is that you conflate names and taxa, you can't easily describe name-only relationships (e.g., homonyms, nomenclatural synonyms) without inventing "taxa" for each name.

Separating names and taxa

The next model, which I've drawn rather clunky below as if you were doing this in a relational database, is based on the TDWG LSID vocabularies. One day someone will explain why the biodiversity informatics community basically ignored this work, despite the fact that all the key nomenclators use it.

Model2

In this model we separate out names as first-class objects with globally unique identifiers. The taxa table refers to the names table when it mentions a name. Any relationships between names are handled separately from taxa, so we can easily handle things like replacement names for homonyms, basionyms, etc. Not that we can also remove a lot of extraneous stuff from the taxa table. For example, if we decide that Poissonia heterantha is the accepted name for a taxon, we don't need to create taxa for Coursetia heterantha or Tephrosia heterantha, because by definition those names are synonyms of Poissonia heterantha.

The other great advantage of this model is that it enables us to take the work the nomenclators have done straight without having to first shoe-horn it into the Darwin Core format, which assumes that everything is a taxon.

Wednesday, July 22, 2015

Steve Baskauf on RDF and the "Rod Page Challenge"

Rdf w3c icon 128Steve Baskauf has concluded a thoughtful series of blog posts on RDF and biodiversity informatics with http://baskauf.blogspot.co.uk/2015/07/confessions-of-rdf-agnostic-part-7.html. In this post he discussed the "Rod Page Challenge", which was a series of grumpy posts I wrote (starting with this one) where I claimed RDF basically sucked, and to illustrate this I issued a challenge for people to do something interesting with some RDF I provided. Since this RDF didn't have a stable home I've put it on GitHub and it has a DOI http://dx.doi.org/10.5281/zenodo.20990 courtesy of GitHub's integration with Zenodo.

I argued that the RDF typically available was basically useless because it wasn't adequately linked (see Reflections on the TDWG RDF "Challenge"). Two of the RDF files I provided were created specifically created to tackle this problem (derived from my projects iPhylo Linkout http://dx.doi.org/10.1371/currents.RRN1228 and the precursor to BioNames http://dx.doi.org/10.7717/peerj.190). This marked pretty much the end of any interest I had in pursuing RDF.

Towards the end of Steve's post he writes:

At the close of my previous blog post, in addition to revisiting the Rod Page Challenge, I also promised to talk about what it would take to turn me from an RDF Agnostic into an RDF Believer. I will recap the main points about what I think it will take in order for the Rod Page Challenge to REALLY be met (i.e. for machines to make interesting inferences and provide humans with information about biodiversity that would not be obvious otherwise):

  1. Resource descriptions in RDF need to be rich in triples containing object properties that link to other IRI-identified resources.
  2. "Discovery" of IRI-identified resources is more likely to lead to interesting information when the linked IRIs are from Internet domains controlled by different providers.
  3. Materialized entailed triples do not necessarily lead to "learning" useful things. Materialized entailed triples are useful if they allow the construction of more clever or meaningful queries, or if they state relationships that would not be obvious to humans.

Steve's point 1 is essentially the point I was making with the challenge. At the time of the challenge, RDF from major biodiversity informatics projects was in silos, with few (if any) links to external resources (the kinds of things Steve refers to in his point 2). As a result, the promised benefits from RDF simply haven't materialised. The lesson I took from this is that we need rich, dense cross-links between data sources (the "biodiversity knowledge graph"), and that's one reason I've been obsessed with populating BioNames, which links animal names to the primary literature (I'm planning to extend this to plants as well). Turns out , creating lots of cross links is really hard work, much harder than simply pumping out a bunch of RDF and waiting for it to automagically coalesce into an all-connected knowledge graph.

I posed the challenge back in 2011, and since then I think the landscape has changed to the extent that I wonder if trying to "fix" RDF is really the way forward.

XML is dead

Anyone (sane) developing for the web and wanting to move data around is using JSON, XML is hideous and best avoided. Much of the early work on RDF used XML, which only made things even harder than they already were. JSON beats XML, to the extent that RDF itself now has a JSON serialisation, JSON-LD. But JSON-LD is about more than the semantic web (see JSON-LD and Why I Hate the Semantic Web), and has the great advantage that you can actually ignore all the RDF cruft (i.e., the namespaces) and simply treat the data as key-value pairs (yay!). Once you do that, then you can have fun with the data, especially with databases such as CouchDB ("fun" and "database" in the same sentence, I know!).

Key-value pairs, document stores, and graph databases

The NoSQL "movement" has thrown up all sorts of new ways to handle data and to think about databases. We can think of RDF as describing a graph, but it carries the burden of all the namespaces, vocabularies, and ontologies that come with it. Compare that with the fun (there's that word again) of graph databases such as Neo4J with its graph gists. The Neo4J folks have made a great job of publicising their approach, and making it easy and attractive to play with.

So, we're in a interesting time when there are a bunch of technologies available, and I think maybe it's time to ask whether the community's allegiance to RDF and the Semantic Web has been somewhat misplaced...

Tuesday, October 21, 2014

On identifiers (again)

I'm going to the TDWG Identifier Workshop this weekend, so I thought I'd jot down a few notes. The biodiversity informatics community has been at this for a while, and we still haven't got identifiers sorted out.

From my perspective as both a data aggregator (e.g., BioNames) and a data provider (e.g., BioStor) there are four things I think we need to tackle in order to make significant progress.

Discoverability (strings to things)


A basic challenge is to go from strings, such as bibliographic citations, specimen codes, taxonomic names, etc., to digital identifiers for those things. Most of our data is not born digital, and so we spend a lot of time mapping strings to identifiers. For example, publishers do this a lot when they take the list of literature cited at the end of a manuscript and add DOIs. Hence, one of the first things CrossRef did was provide a discovery service for publishers. This has now morphed into a very slick search tool http://search.crossref.org. Without discoverabilty, nobody is going to find the identifiers in the first place.

Resolvability


Given an identifier it has to be resolvable (for both people and machines), and I'd argue that at least in the early days of getting that identifier accepted, there needs to be a single point of resolution. Some people are arguing that we should separate identifiers from their resolution, partly based on arguments that "hey, we can always Google the identifier". This argument strikes me as wrong-headed for a several of reasons.

Firstly, Google is not a resolution service. There's no API, so it's not scalable. Secondly, if you Google an identifier (e.g., 10.7717/peerj.190) you get a bunch of hits, which one is the definitive source of information on the thing with that identifier? It's not at all obvious, and indeed this is one of the reasons publishers adopted DOIs in the first place. If you Google a paper you can get all sorts of hits and all sorts of versions (preprint, manuscripts, PDFs on multiple servers, etc.). In contrast the DOI gives you a way to access the definitive version.

Another way of thinking about this is in terms of trust. At some point down the road we might have tools that can assess the trust worthiness of a source, and we will need these if we develop decent tools to annotate data (see More on annotating biodiversity data: beyond sticky notes and wikis). But until then the simplest way to engender trust is to have a single point of resolution (like http://dx.doi.org for DOIs). Think about how people now trust DOIs. They've become a mark of respectability for journals (no DOIs, you're not a serious journal), and new ideas such as citing diagrams and data gained further credence once sites like figshare started using DOIs.

Another reason resolvability matters is that I think it's a litmus test of how serious we are. One reason LSIDs failed is that we made them too hard to resolve, and as a consequence people simply minted "fake" LSIDs, dumb strings that didn't resolve. Nobody complained (because, let's face it, nobody was using them), so LSIDs became devalued to the point of uselessness. Anybody can mint a string and call it an identifier, if it costs nothing that's a good estimate of its actual value.

Persistence


Resolvability leads to persistence. Sometimes we hear the cliche that "persistence is a social matter, not a technological one". This is a vacuous platitude. The kind of technology adopted can have a big impact on the sociology.

The easiest form of identifier is a simple HTTP URL. But let's think about what happens when we use them. If I spend a lot of time mapping my data to somebody else's URLs (e.g., links to papers or specimens) I am taking a big risk in assuming that the provider of those URLs will keep those "live". At the same time, in linking to those URLs, I constrain the provider - if they decide that their URL scheme isn't particularly good and want to change it (or their institution decides to move to new servers or a new domain), they will break resources like mine that link to them. So a decision they made about their URL structure - perhaps late one Friday afternoon in one of those meetings where everybody just wants to go to the pub - will come back to haunt them.

One way to tackle this is indirection, which is the idea behind DOIs and PURLs, for example. Instead of directly linking to a provider URL, we link to an intermediate identifier. This means that I have some confidence that all my hard work won't be undone (I have seen whole journals disappear because somebody redesigned an institutional web site), and the provider can mess with different technologies for serving their content, secure in the knowledge that external parties won't be affected (because they link to the intermediate identifier). Programmers will recognise this as encapsulation.

Some have argued that we can achieve persistence by simply insisting on it. For example, we fire off a memo to the IT folks saying "don't break these links!". Really? We have that degree of power over our institutional IT policies? This also misses the great opportunity that centralised indirection provides us with. In the case of DOIs for publications, CrossRef sits in the middle, managing the DOIs (in the sense that if a DOI breaks you have a single place to go and complain). Because they also aggregate all the bibliographic metadata, they are automatically able to support discoverability (they can easily map bibliographic metadata to DOIs). So by solving persistence we also solve discoverability.

Network effects


Lastly, if we are serious about this we need to think about how to engineer the widespread adoption of the identifier. In other words, I think we need network effects. When you join a social networking site, one of the first things they do is ask permission to see your "contacts" (who you already know). If any of those people are already on the network, you can instantly see that ("hey, Jane is here, and so is Bob"). Likewise, the network can target those you know who aren't on the network and prompt them to join.

If we are going to promote the use of identifiers, then it's no use thinking about simply adding identifiers to things, we need to think about ways to grow the network, ideally by adding networks at a time (like a person's list of contacts), not single records. CrossRef does this with articles: when publishers submit an article to CrossRef, they are encouraged to submit not just that article and it's DOI, but the list of all references in the list of literature cited, identified where possible by DOIs. This means CrossRef is building a citation graph, so it can quickly demonstrate value to its members (through cited-by linking).

So, we need to think of ways of demonstrating value, and growing the network of identifiers more rapidling than one identifier at a time. Otherwise, it is hard to see how it would gain critical mass. In the context of, say, specimens, I think an obvious way to do this is have services that tell a natural history collection how many times its specimens have been cited in the primary literature, or have been used as vouchers for DNA seqences. We can then generate metrics of use (as well as start to trace the provenance of our data).


Summary


I've no idea what will come out of the TDWG Workshop, but my own view is that unless we tackle these issues, and have a clear sense of how they interrelate, then we won't make much progress. These things are intertwined, and locally optimal solutions ("hey, it's easy, I'll just slap a URL on everything") aren't enough ("OK, how exactly do I find your URL? What happens when it breaks?"). If we want to link stuff together as part of the infrastructure of biodiversity informatics, then we need to think strategically. The goal is not to solve the identifier problem, the goal is to build the biodiversity knowledge graph.

Thursday, June 14, 2012

Taxonomy and the nine billion names of God

In Arthur C. Clarke's short story The Nine Billion Names of God Tibetan monks hire two programmers to help them generate all the the possible names of God. The monks believe that the purpose of the Universe is to generate those names, once that goal is achieved the Universe will end. As the understandably skeptical programmers leave having completed their task, they look up into the sky and notice that "overhead, without any fuss, the stars were going out."

Leaving aside the delicious irony that arises if we recast this story with the monks replaced by taxonomists, much of our work with taxonomic names seems to be enumerating endless permutations of the same names. Part of the problem is the way some databases store and provide access to names.


The simplest way to represent a taxonomic name is to just have the name (the "canonical name"), without additional bits such as the taxonomic authority. In my view, any taxonomic database that serves names should provide the canonical name. I'm not arguing that they shouldn't provide taxonomic authority information (ideally separately, but could also be as part of a canonical name + authority string), I just want them to also provide just the canonical name. For some reason this seems to upset people (e.g., this thread on the TDWG mailing lists), so let me explain why I think this matters.

Most people use taxonomic names without the authority (just Google a taxonomic name with and without it's authority and compare the number of hits). So, if your goal is to be of service to your users, make sure you provide the canonical name.

Then there is the issue of integrating data from different sources. The more parts to the name the more scope there is for ambiguity. For example, my first ever publication was a description of a new species of peacrab, Pinnotheres atrinicola, published in:

Page, R. D. M. (1983). Description of a new species of Pinnotheres , and redescription of P. novaezelandiae (Brachyura: Pinnotheridae) . New Zealand Journal of Zoology, 10(2), 151–162. doi:10.1080/03014223.1983.10423904

If we look for this name in ION we discover three records:

Pinnotheres atrinacolaurn:lsid:organismnames.com:name:1192320
Pinnotheres atrinicolaurn:lsid:organismnames.com:name:371872
Pinnotheres atrinicola Page 1983urn:lsid:organismnames.com:name:371873


Two are duplicates of "Pinnotheres atrinicola", with and without the authority, one is a misspelling ("Pinnotheres atrinacola"). Given just the name we already see that it's easy for people to get the spelling wrong and generate lexical variants.

If we now add the authority we get more potential for variation. ION write the authority as "Page 1983" (no comma), but other databases such as WoRMS write it as Page, 1983 (with comma). So we now have two variations of the name, and two for the authority, so 4 possible strings if we include both name and authority. This combinatorial explosion means that we can rapidly generate lots of strings that are fundamentally the same.

I'm not arguing that taxonomic authorities aren't useful, and I want them wherever they are known, but insisting that databases serve name + authority to the exclusion of just the canonical name is a recipe for disaster. One could argue that users can parse the string into name and authority components, but that's a headache (just take a look at taxon-name-processing for details). Why make users go through hoops to get basic information?

Another reason I'm wary of taxonomic authority strings is that people don't always understand the conventions. For example, in my previous post I used the following example for names that differed in authority string:

  • Demansia torquata Günther 1862
  • Demansia torquata (Günther, 1862)

The use of parentheses seems a small difference, but (a) it means the strings are different, and (b) the presence or absence of parentheses changes the meaning of the authority. In this example, Demansia torquata Günther 1862 means that Günther is the original author of the name Demansia torquata, and so if I search Günther's publications from 1862 for "Demansia torquata" I will find that name. Demansia torquata (Günther, 1862), on the other hand, means that Günther originally described this species in 1862, but he placed it in a different genus, so my search for "Demansia torquata" in 1862 is likely to be fruitless. So, if the authority is actually (Günther, 1862) but a database tells me it's Günther, 1862 I'd be wasting my time looking for the name in 1862.

As it turns out, this snake was originally described as Diemansia torquata (see "On new species of snakes in the collection of the British Museum" http://biostor.org/reference/50221). The genus name Diemansia differs from Demansia, hence (Günther, 1862) should be correct, but it looks like Diemansia and Demansia are just some of the variations of the same snake genus (see for example http://biodiversitylibrary.org/page/22393791). *Sigh*

Variation in taxonomic authority extends beyond parentheses. In a post on clustering strings I used examples of taxonomic authorities for the genus Helicella:

Ferrusac 1821
Bonavita 1965
Ferussa 1821
Fer.
Lamarck 1812
Ferussac 1821

There are six different strings here which correspond to three different authorities. In this example the name Helicella is a homonym (same name used for different taxa) so having the taxonomic authority can help decide which name is actually meant, but people can't seem to agree on how to spell the authority names, and in other cases they might not agree on dates of publication, hence we get variations such as those above. Even when authorities are useful, they come at a cost. And that's not even considering chresonyms where the authority isn't the original author, but instead is a form of citation of the use of a name.

All of this variation is a cause of ambiguity, and when we combine permutations of taxonomic names and taxonomic authorities, things start to get messy. Indeed, I'd argue that projects such as the Global Names Index (GNI) are essentially doing what Arthur C. Clarke's monks were doing, trying to capture near endless permutations of the same names. Given this, it seems crazy not to try and keep things as simple as possible. In the vast majority of cases I want the name, I don't want the rest of the cruff attached to it. Taxonomic authorities are really just proxies for citation, so lets focus on getting that information linked to names, and stop making life difficult for users.

Friday, October 21, 2011

Final thoughts on TDWG RDF challenge

Quick final comment on the TDWG Challenge - what is RDF good for?. As I noted in the previous post, Olivier Rovellotti (@orovellotti) and Javier de la Torre (@jatorre) have produced some nice visualisations of the frog data set:
Cartodb
Nice as these are, I can't help feeling that they actually help make my point about the current state of RDF in biodiversity informatics. The only responses to my challenge have been to use geography, where the shared coordinate system (latitude and longitude) facilitates integration. Having geographic coordinates means we don't need to have shared identifiers to do something useful, and I think it's no accident that GBIF is one of the most important resources we have. Geography is also the easiest way to integrate across other fields (e.g., climate).

But what of the other dimensions? What I'm really after are links across datasets that enable us to make new inferences, or address interesting questions. The challenge is still there...

Thursday, October 20, 2011

Reflections on the TDWG RDF "Challenge"

This is a follow up to my previous post TDWG Challenge - what is RDF good for? where I'm being, frankly, a pain in the arse, and asking why we bother with RDF? In many ways I'm not particularly anti-RDF, but it bothers me that there's a big disconnect between the reasons we are going down this route and how we are actually using RDF. In other words, if you like RDF and buy the promise of large-scale data integration while still being decentralised ("the web as database"), then we're doing it wrong.

As an aside, my own perspective is one of data integration. I want to link all this stuff together so I can follow a path through multiple datasets and extract the information I want. In other words, "linked data" (little "l", little "d"). I'm interested in fairly light weight integration, typically through shared identifiers. There is also integration via ontologies, which strikes me as a different, if related, problem, that in many ways is closer to the original vision of the Semantic Web as a giant inference engine. I think the concerns (and experience) of these two communities are somewhat different. I don't particularly care about ontologies, I want key-value pairs and reusable identifiers so I can link stuff together. If, for example, you're working on something like Phenoscape, then I think you have a rather more circumscribed set of data, with potentially complicated interrelationships that you want to make inferences on, in which case ontologies are your friend.

So, I posted a "challenge". It wasn't a challenge so much as a set of RDF to play with. What I'm interested in is seeing how easily we can string this data together to learn stuff. For example, using the RDF I posted earlier here is a table listing the name, conservation status, publication DOI and date, and (where available) image from Wikipedia for frogs with sequences in GenBank.

SpeciesStatusDOIYear describedImage
Atelopus nanayCRhttp://dx.doi.org/10.1655/0018-0831(2002)058[0229:TNSOAA]2.0.CO;22002
Eleutherodactylus mariposaCRhttp://dx.doi.org/10.2307/14669621992
Phrynopus kauneorumCRhttp://dx.doi.org/10.2307/15659932002
Eleutherodactylus eunasterCRhttp://dx.doi.org/10.2307/15630101973
Eleutherodactylus amadeusCRhttp://dx.doi.org/10.2307/14455571987
Eleutherodactylus lamprotesCRhttp://dx.doi.org/10.2307/15630101973
Churamiti maridadiCRhttp://dx.doi.org/10.1080/21564574.2002.96354672002
Eleutherodactylus thorectesCRhttp://dx.doi.org/10.2307/14453811988
Eleutherodactylus apostatesCRhttp://dx.doi.org/10.2307/15630101973
Leptodactylus silvanimbusCRhttp://dx.doi.org/10.2307/15636911980
Eleutherodactylus sciagraphusCRhttp://dx.doi.org/10.2307/15630101973
Bufo chavinCRhttp://dx.doi.org/10.1643/0045-8511(2001)001[0216:NSOBAB]2.0.CO;22001
Eleutherodactylus fowleriCRhttp://dx.doi.org/10.2307/15630101973
Ptychohyla hypomykterCRhttp://dx.doi.org/10.2307/36720601993
Hyla suweonensisDDhttp://dx.doi.org/10.2307/14441381980
Proceratophrys concavitympanumDDhttp://dx.doi.org/10.2307/15654122000
Phrynopus bufoidesDDhttp://dx.doi.org/10.1643/CH-04-278R22005
Boophis periegetesDDhttp://dx.doi.org/10.1111/j.1096-3642.1995.tb01427.x1995
Phyllomedusa duellmaniDDhttp://dx.doi.org/10.2307/14446491982
Boophis liamiDDhttp://dx.doi.org/10.1163/1568538033224407722003
Hyalinobatrachium ignioculusDDhttp://dx.doi.org/10.1670/0022-1511(2003)037[0091:ANSOHA]2.0.CO;22003
Proceratophrys cururuDDhttp://dx.doi.org/10.2307/14477121998
Amolops bellulusDDhttp://dx.doi.org/10.1643/0045-8511(2000)000[0536:ABANSO]2.0.CO;22000
Centrolene bacatumDDhttp://dx.doi.org/10.2307/15645281994
Litoria kumaeDDhttp://dx.doi.org/10.1071/ZO030082004
Phrynopus pesantesiDDhttp://dx.doi.org/10.1643/CH-04-278R22005
Gastrotheca galeataDDhttp://dx.doi.org/10.2307/14436171978
Paratelmatobius cardosoiDDhttp://dx.doi.org/10.2307/14479761999
Rhacophorus catamitusDDhttp://dx.doi.org/10.1655/0733-1347(2002)016[0046:NAPKPF]2.0.CO;22002
Huia melasmaDDhttp://dx.doi.org/10.1643/CH-04-137R32005
Telmatobius vilamensisDDhttp://dx.doi.org/10.1655/0018-0831(2003)059[0253:ANSOTA]2.0.CO;22003
Callulina kisiwamsituENhttp://dx.doi.org/10.1670/209-03A2004
Arthroleptis nikeaeENhttp://dx.doi.org/10.1080/21564574.2003.96354862003
Eleutherodactylus amplinymphaENhttp://dx.doi.org/10.1139/z94-2971994
Eleutherodactylus glaphycompusENhttp://dx.doi.org/10.2307/15630101973
Bufo tacanensisENhttp://dx.doi.org/10.2307/14397001952
Phrynopus brackiENhttp://dx.doi.org/10.2307/14458261990
Telmatobius sibiricusENhttp://dx.doi.org/10.1655/0018-0831(2003)059[0127:ANSOTF]2.0.CO;22003
Cochranella macheENhttp://dx.doi.org/10.1655/03-742004
Eleutherodactylus melacaraENhttp://dx.doi.org/10.2307/14669621992
Plectrohyla glandulosaENhttp://dx.doi.org/10.2307/14410461964
Aglyptodactylus laticepsENhttp://dx.doi.org/10.1111/j.1439-0469.1998.tb00775.x1998
Eleutherodactylus glamyrusENhttp://dx.doi.org/10.2307/15656641997
Gastrotheca trachycepsENhttp://dx.doi.org/10.2307/15643751987
Eleutherodactylus grahamiENhttp://dx.doi.org/10.2307/15639291979
Litoria havinaLChttp://dx.doi.org/10.1071/ZO99302251993
Crinia ripariaLChttp://dx.doi.org/10.2307/14407941965
Litoria longirostrisLChttp://dx.doi.org/10.2307/14431591977
Osteocephalus mutaborLChttp://dx.doi.org/10.1163/1568538023208776092002
Leptobrachium nigropsLChttp://dx.doi.org/10.2307/14409661963
Pseudis tocantinsLChttp://dx.doi.org/10.1590/S0101-817519980004000111998
Mantidactylus argenteusLChttp://dx.doi.org/10.1111/j.1096-3642.1919.tb02128.x1919
Aglyptodactylus securiferLChttp://dx.doi.org/10.1111/j.1439-0469.1998.tb00775.x1998
Pseudis cardosoiLChttp://dx.doi.org/10.1163/1568538005072642000
Uperoleia inundataLChttp://dx.doi.org/10.1071/AJZS0791981
Litoria pronimiaLChttp://dx.doi.org/10.1071/ZO99302251993
Litoria paraewingiLChttp://dx.doi.org/10.1071/ZO97602831976
Philautus aurifasciatusLChttp://dx.doi.org/10.1163/156853887X000361987
Proceratophrys avelinoiLChttp://dx.doi.org/10.1163/156853893X001561993
Osteocephalus deridensLChttp://dx.doi.org/10.1163/1568538005075252000
Gephyromantis boulengeriLChttp://dx.doi.org/10.1111/j.1096-3642.1919.tb02128.x1919
Crossodactylus caramaschiiLChttp://dx.doi.org/10.2307/14469071995
Rana yavapaiensisLChttp://dx.doi.org/10.2307/14453381984
Boophis lichenoidesLChttp://dx.doi.org/10.1163/156853898X000251998
Megistolotis lignariusLChttp://dx.doi.org/10.1071/ZO97901351979
Ansonia endauensisNEhttp://dx.doi.org/10.1655/0018-0831(2006)62[466:ANSOAS]2.0.CO;22006
Ansonia kraensisNEhttp://dx.doi.org/10.2108/zsj.22.8092005
Arthroleptella landdrosiaNThttp://dx.doi.org/10.2307/15653592000
Litoria jungguyNThttp://dx.doi.org/10.1071/ZO020692004
Phrynobatrachus phyllophilusNThttp://dx.doi.org/10.2307/15659252002
Philautus ingeriVUhttp://dx.doi.org/10.1163/156853887X000361987
Gastrotheca dendronastesVUhttp://dx.doi.org/10.2307/14450881983
Hyperolius cystocandicansVUhttp://dx.doi.org/10.2307/14439111977
Boophis sambiranoVUhttp://dx.doi.org/10.1080/21564574.2005.96355202005
Ansonia torrentisVUhttp://dx.doi.org/10.1163/156853883X000211983
Telmatobufo australisVUhttp://dx.doi.org/10.2307/15630861972
Stefania coxiVUhttp://dx.doi.org/10.1655/0018-0831(2002)058[0327:EDOSAH]2.0.CO;22002
Oreolalax multipunctatusVUhttp://dx.doi.org/10.2307/15648281993
Eleutherodactylus guantanameraVUhttp://dx.doi.org/10.2307/14669621992
Spicospina flammocaeruleaVUhttp://dx.doi.org/10.2307/14477571997
Cycloramphus acangatanVUhttp://dx.doi.org/10.1655/02-782003
Leiopelma pakekaVUhttp://dx.doi.org/10.1080/03014223.1998.95175541998
Rana okaloosaeVUhttp://dx.doi.org/10.2307/14448471985
Phrynobatrachus uzungwensisVUhttp://dx.doi.org/10.1163/156853883X000301983


This is a small fraction of the frog species actually in GenBank because I've filtered it down to those that have been linked to Wikipedia (from where we get the conservation status) and which were described in papers with DOIs (from which we get the date of description).

I generated this result using this SPARQL query on a triple store that had the primary data sources (Uniprot, Dbpedia, CrossRef, ION) loaded, together with the all-important "glue" datasets that link ION to CrossRef, and Uniprot to Dbpedia (see previous post for details):


PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
PREFIX uniprot: <http://purl.uniprot.org/core/>
PREFIX tdwg_tn: <http://rs.tdwg.org/ontology/voc/TaxonName#>
PREFIX tdwg_co: <http://rs.tdwg.org/ontology/voc/Common#>
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT ?name ?status ?doi ?date ?thumbnail
WHERE {
?ncbi uniprot:scientificName ?name .
?ncbi rdfs:seeAlso ?dbpedia .
?dbpedia dbpedia-owl:conservationStatus ?status .
?ion tdwg_tn:nameComplete ?name .
?ion tdwg_co:publishedInCitation ?doi .
?doi dcterms:date ?date .

OPTIONAL
{
?dbpedia dbpedia-owl:thumbnail ?thumbnail
}
}
ORDER BY ASC(?status)


This table doesn't tell us a great deal, but we could, for example, graph date of description against conservation status (CR=critical, EN=endangered, VU=vulnerable, NT=not threatened, LC=least concern, DD=data deficient):
Chart
In other words, is it the case that more recently described species are more likely to be endangered than taxa we've known about for some time (based on the assumption that we've found all the common species already)? We could imagine extending this query to retrieve sequences for a class of frog (e.g., critically endangered) so we could compute a measure population genetic variation, etc. We shouldn't take the graph above too seriously because it's based on small fraction of the data, but you get the idea. As more frog taxonomy goes online (there's a lot of stuff in BHL and BioStor, for example) we could add more dates and build a dataset worth analysing properly.

It seems to me that these should be fairly simple things to do, yet they are the sort of thing that if we attempt today it's a world of hurt involving scripts, Excel, data cleaning, etc. before we can do the science.

The thing is, without the "glue" files mapping identifiers across different databases even this simple query isn't possible. Obviously we have no say in how many organisations publish RDF, but within the biodiversity informatics community we should make every effort to use external identifiers wherever possible so that we can make these links. This is the core of my complaint. If we are using RDF to foster data integration so we can query across the diverse data sets that speak to biodiversity, then we are doing it wrong.

Update
Here is a nice visualisation of this dataset from @orovellotti (original here), made using ecoRelevé:

AcNbdh2CMAA3ysc png large

Wednesday, October 19, 2011

TDWG Challenge - what is RDF good for?

Last month, feeling particularly grumpy, I fired off an email to the TDWG-TAG mailing list with the subject Lobbing grenades: a challenge. Here's the email:
It's morning and the coffee hasn't quite kicked in yet, but reading through recent TDWG TAG posts, and mindful of the upcoming meeting in New Orleans (which sadly I won't be attending) I'm seeing a mismatch between the amount of effort being expended on discussions of vocabularies, ontologies, etc. and the concrete results we can point to.

Hence, a challenge:

"What new things have we learnt about biodiversity by converting biodiversity data into RDF?"

I'm not saying we can't learn new things, I'm simply asking what have we learnt so far?

Since around 2006 we have had literally millions of triples in the wild (uBio, ION, Index Fungorum, IPNI, Catalogue of Life, more recently Biodiversity Collections Index, Atlas of Living Australia, World Register of Marine Species, etc.), most of these using the same vocabulary. What new inferences have we made?

Let's make the challenge more concrete. Load all these data sources into a triple store (subchallenge - is this actually possible?). Perhaps add other RDF sources (DBpedia, Bio2RDF, CrossRef). What novel inferences can we make?

I may, of course, simply be in "grumpy old arse" mode, but we have millions of triples in the wild and nothing to show for it. I hope I'm not alone in wondering why...

In the context of the TDWG meeting (happening as we speak and which I'm following via Twitter, hashtag #tdwg) Joel Sachs asked me whether I had any specific data in mind that could form the basis of a discussion. So, here goes. I've assembled some small RDF data sets that it might be fun to play with. Each data set is for frogs, and I've divided them into two sets.

Primary data
These data sets are essentially unmodified RDF fetched from data providers:
  • uniprot.rdf Uniprot RDF for frogs in GenBank
  • ion.rdf Index of Organism Names (ION) RDF for taxonomic names for frogs (filtered to just those names that are also in GenBank, the RDF comes from ION LSIDs)
  • crossref.rdf CrossRef RDF for DOIs for publications that published new frog names (obtaining using CrossRef's support for Linked Data for DOIs)
  • dbpedia.rdf Dbpedia RDF for frogs in GenBank (Update 2011-10-20: the dbpedia.rdf file is a bit big, so here is subset.rdf which has just the conservation status and thumbnail image)


These sources give us information on genomics (at least, they tell us which taxa have been sequenced), where and when the original taxonomic description was published, and by whom, as well as some information on conservation status and what the frog looks like (via Dbpedia). Ideally we just load these files into a triple store and then ask a bunch of questions, such as what is the conservation status of frogs sequenced in Genbank?, is there correlation between the conservation status of a frog and the date it was discovered?, who has described the most frog species?, etc.

My contention is that actually we can't do any of this because the data is siloed due to the lack of shared identifiers and vocabularies (I suspect that there is not a single identifier any of these files share). The only way we can currently link these data sets together is by shared string literals (e.g., taxonomic names), in which case why bother with RDF? So my first challenge is to see whether any of the questions I've just listed can actually be tackled using this data.

Glue
In a slightly more constructive mode, to see if we can make progress I'm providing some additional RDF files, based on projects I'm working on to link data together. These files may help provide some of the missing "glue" to connect these data sets.

  • linkout.rdf The list of links between NCBI and Dbpedia (based on mapping in iPhylo LinkOut)
  • ion_doi.rdf A subset of publications listed in ION have DOIs, this file links the corresponding ION LSIDs to those DOIs (this file is from an ongoing project mapping names to primary literature)


The first file links the ION and CrossRef RDF, so we could start to ask questions about dates of discovery, who described what species, etc.. The second file links NCBI taxon ids (in this case in the form of UniProt URIs) to Wikipedia (in the form of Dbpedia URIs). Dbpedia has information on conservation status, and some frogs will also have pictures, so we can start to join genomics to conservation, as well as make some visualisations.

Update
I've now added another RDF file for 1000 georeferenced GenBank sequences for frogs. The file is genbank.rdf. This file is generated from a local, processed version of EMBL, and uses a mixture of Dublin Core and TDWG vocabularies. Here's an example of a single record:

<?xml version="1.0"?>
<rdf:RDF xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:tcommon="http://rs.tdwg.org/ontology/voc/Common#"
xmlns:toccurrence="http://rs.tdwg.org/ontology/voc/TaxonOccurrence#"
xmlns:uniprot="http://purl.uniprot.org/core/">
<uniprot:Molecule rdf:about="http://bio2rdf.org/genbank:EU566842">
<dcterms:created>2008-07-06</dcterms:created>
<dcterms:modified>2010-12-23</dcterms:modified>
<dcterms:title>EU566842</dcterms:title>
<dcterms:description>Xenopus borealis voucher MHNG:Herp:2644.64
cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial.</dcterms:description>
<dcterms:subject rdf:resource="http://purl.uniprot.org/taxonomy/8354"/>
<dcterms:relation rdf:parseType="Resource">
<rdf:type rdf:resource="http://rs.tdwg.org/ontology/voc/TaxonOccurrence#TaxonOccurrence"/>
<toccurrence:identifiedToString>Xenopus borealis</toccurrence:identifiedToString>
<toccurrence:decimalLatitude>0.66</toccurrence:decimalLatitude>
<geo:lat>0.66</geo:lat>
<toccurrence:decimalLongitude>37.5</toccurrence:decimalLongitude>
<geo:long>37.5</geo:long>
<toccurrence:verbatimCoordinates>0.66 N 37.5 E</toccurrence:verbatimCoordinates>
<toccurrence:country>Kenya</toccurrence:country>
<dcterms:identifier>MHNG:Herp:2644.64</dcterms:identifier>
</dcterms:relation>
</uniprot:Molecule>
</rdf:RDF>

I've added this simply so one could do some geographical queries.

Missing links
There are still lots of missing links here (for example, there's no explicit link between NCBI and ION, so we'd need to create this using taxonomic names), and we could add further links to the literature via sequences for taxa. Then there's the lack of geographic data. We could get some of this via georeferenced sequences in GenBank, but there's no RDF for this (Bio2RDF does have RDF for sequences but it ignores the bulk of the organismal metadata such as voucher specimens and latitude and longitude).

In many ways it's this lack of links that was point of my original email. The reality is that "linked data" isn't linked to anything like the extent that makes it useful. Simply pumping out RDF won't get us very far until we tackle this problem (see also my earlier post Linked data that isn't: the failings of RDF).

So, if you think RDF is the way to go, please tell me what you can learn from these data files.


Monday, November 09, 2009

iTaxon screencast

Sadly I won't be at TDWG 2009, at least not in person. However, there is a session on wikis, which may contain this brief screencast of my iTaxon experiments. The screencast was made in haste, but tries to convey some of the ideas behind these experiments, especially the idea that by linking data together we can generate more interesting and rich views of objects such as scientific publications. The screencast starts with the The amphibian tree of life page.


Friday, July 10, 2009

NCBI taxonomy, TDWG vocabularies, and RDF


Lately I've been returning to playing with RDF and triple stores. This is a serious case of déjà vu, as two blogs I've now abandoned will testify (bioGUID and SemAnt). Basically, a combination of frustration with the tools, data cleaning, and the lack of identifiers got in the way of making much progress. I gave up on triple stores for a while, rolling my own Entity–Attribute–Value (EAV) database, which I used for the Elsevier Challenge (EAV databases are essentially key-value databases, CouchDB being a well-known example).

Now, I'm revisiting triple stores and SPARQL, partly because Linked Data is gaining momentum, and partly because we now have a few LSID providers, and some decent vocabularies from TDWG. Having created a LSID resolver that plays nicely with Linked Data (it also does the same thing for DOIs), it's time to dust off SPARQL and see what can be done.

One reason there's interest in having GUIDs and standard vocabularies is so that we can link different sources of information together. But more than just linking, we should be able to compute across these links and learn new things, or at least add annotations from one database to another.

To make this concrete, take the NCBI taxon 101855 , Lulworthia uniseptata. If we visit the NCBI page we see links to other resources, such as Index Fungorum record 105488, which tells us that Lulworthia uniseptata was published in Trans. Mycol. Soc. Japan 25(4): 382 (1984), and that the current name is Lulwoana uniseptata, which was published in Mycol. Res. 109(5): 562 (2005).

Wouldn't it be nice to be able to automatically link these things together? And wouldn't it be nice to have identifiers for the literature, rather than only human-readable text strings? Using bioGUID, we can discover that Mycol. Res. 109(5): 562 (2005) has the DOI doi:10.1017/S0953756205002716 -- I haven't found Trans. Mycol. Soc. Japan 25(4): 382 (1984) online anywhere.

Now, given that we have LSIDs for Index Fungorum, I can resolve urn:lsid:indexfungorum.org:names:369395 and discover that

urn:lsid:indexfungorum.org:names:369395 tname:hasBasionym urn:lsid:indexfungorum.org:names:105488

and, I can add the statement

urn:lsid:indexfungorum.org:names:36939 tcommon:publishedInCitation doi:10.1017/S0953756205002716

What I'd like to do is link this to the NCBI taxon, so that I can display this additional knowledge in one place (i.e., there is an additional name for this fungus, and where it is published). To do this, I need the NCBI taxonomy in RDF. Turns out that everyone and their dog has been generating RDF versions of the NCBI taxonomy, including Uniport (source of the diagram above). The problem is, each effort creates their own project-specific vocabulary. For example , here is the record for NCBI taxon 101855 in Uniprot RDF (http://www.uniprot.org/taxonomy/101855):


<?xml version='1.0' encoding='UTF-8'?>
<rdf:RDF xmlns="http://purl.uniprot.org/core/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://purl.uniprot.org/taxonomy/101855">
<rdf:type rdf:resource="http://purl.uniprot.org/core/Taxon"/>
<rank rdf:resource="http://purl.uniprot.org/core/Species"/>
<scientificName>Lulworthia uniseptata</scientificName>
<otherName>Zalerion maritimum</otherName>
<rdfs:subClassOf rdf:resource="http://purl.uniprot.org/taxonomy/45817"/>
<partOfLineage>false</partOfLineage>
</rdf:Description>
</rdf:RDF>


Uniprot has it's own vocabulary, http://purl.uniprot.org/core/. So, what I'd like to do is create a version of the NCBI taxonomy using TDWG's TaxonConcept vocabulary, so that it becomes straightforward to link NCBI to name databases such as Index Fungorum, IPNI, Zoobank, and ION that are serving taxon names.

Thursday, June 25, 2009

EOL, Wikipedia, TDWG, LinkedData, and the Vision Thing

Time for more half-baked ideas. There's been a lot of discussion on Twitter about EOL, Linked Data (sometimes abbreviated LOD), and Wikipedia. Pete DeVries (@pjd) is keen on LOD, and has been asking why TDWG isn't playing in this space. I've been muttering dark thoughts about EOL, and singing the praises of Wikipedia. On so it goes on. So, here's one vision of where we could (?should) be going with this.

Let's imagine that we do indeed want to play in the Linked Data space. The concern that tends to raised the most is that biodiversity informatics uses LSIDs as the standard GUID, and this doesn't play nice with Linked Data. This is true, but not life threatening. There are various hacks (like this and this that deal with this).

But, the real concern (I think) is that we need a way to link our stuff to the rest of the Linked Data cloud. That is, wherever possible we need to reuse existing identifiers. In the LOD diagram below (for the latest version see here) DBpedia.org is key to linking much of this together, and major players (such as the BBC) are now using DBpedia.org to make connections.



DBpedia.org is based on Wikipedia, so I think you can see where this is going. There are some 120,000+ taxon pages in Wikipedia, so that's some 120,000+ identifiers in DBpedia.org that others interested in organisms can (and will) use to refer to taxa. Given the centrality of Wikipedia and DBpedia to LOD, why don't we adopt DBpedia.org URIs as the default GUID for our taxa? At present we have numerous, competing identifiers (e.g., NCBI tax ids, ITIS tsn's, Catalogue of Life LSIDs, uBio NameBankID's, plus LSIDs from various nomenclators). For users this is a mess -- which one do I use? Deciding requires dealing with issues (such as the difference between nomenclatural codes, and between taxonomic names and concepts, etc., that frankly, nobody outside our community cares about.

So, if we want to play with LOD, we need to make our identifiers play nice (straightforward), and we should think seriously about adopting DBpedia.org URIs as the default GUID for taxa.

Now, where does this leave EOL? Well, frankly, it should get out of the business of making web pages for taxa, because Wikipedia owns that space already. Their pages are fewer, but often much more detailed than the corresponding EOL page, and Wikipedia reacts faster to new discoveries. Wikipedia supports community editing, versioning, and quite sophisticated tools for handling biblographic references.

There's plenty of scope for userful tools and services for EOL to develop, but I think the real game is elsewhere. Now, Wikipedia is far from perfect. It's basically semi-structured text with a God-awful template language, and it would benefit greatly from more structure (e.g., as could be provided by Semantic Mediawiki), but I think we should think about building upon it. We could build our own (and my experiments over at itaxon.org explore this), but the big challenge is getting a community around a project, and if David Shorthouse's pronouncement that The Community is Dead is correct, then maybe we should get on board with the community that already exists. Perhaps what EOL should be doing is talking to Wikipedia, improving the existing templates for taxon pages, and creating bots to automatically populate Wikipedia with more taxon pages.

Wednesday, April 15, 2009

LSIDs, to proxy or not to proxy?

The LSID discussion rumbles on (see my earlier post). One issue that has re-emerged is the use of HTTP proxies in RDF documents. In a recent email Greg Whitbread wrote:

The existing TDWG recommendation that "5. All references to LSIDs within RDF documents should use the proxified form", basically states that LSID will never appear in any way other than bundled into an http URI - if we are also to publish data as RDF.

That sounds as if it means that those wanting to use LSID resolution will first have to extract the LSID part from the http URI which will now appear everywhere we would expect to find our unique identifier.

Donald [Hobern] has presented a strong case for unique identifiers conforming to the LSID specification but we have now an equally strong case that in its http form our identifier must behave as a dereferenceable URN per W3C linked data recommendations.
My own view is that the RDF should always contain a canonical, un-proxied version of an identifier (whether LSID or DOI), because:
  1. having only the proxied version assumes that there is only one suitable proxy (there may be multiple ones)
  2. it assumes that the specified proxy will always exist (our track record in durable HTTP services is poor)
  3. assumes the specified proxy will always match conform to current standards
  4. it imposes an overhead on clients that want the canonical identifier (i.e., they have to strip away the proxy)
I predict that for any meaningful, successful (read "actually used") identifier there will be multiple services that will be capable of consuming that identifier, not just HTTP proxies. DOIs can be proxied (by several servers, including http://dx.doi.org/ and http://hdl.handle.net ), resolved using OpenURL resolvers, etc.

In order to play ball with Linked Data, there are several ways forward:
  1. always refer to LSIDs in their proxied form (see above for reasons why this might not be a good idea)
  2. ensure that at least one proxy exists which can resolve LSIDs in a linked data friendly way (see bioGUID as an example)
  3. use or develop linked data clients that understand LSIDs (e.g., http://linkeddata.uriburner.com/, see this view of urn:lsid:zoobank.org:pub:2C6BD020-B54A-4119-9693-3231C9FCEFA6)
2 and 3 already exist, so I'm not so keen on 1.

For me this is one of the biggest hurdles facing using HTTP URIs as identifiers -- I have to choose one. As an analogy, I can identify a book using an ISBN (say, 0226644677). How do I represent this in RDF? Well, I could use an HTTP URI, say http://www.amazon.com/Tangled-Trees-Phylogeny-Cospeciation-Coevolution/dp/0226644677/ , or maybe http://www.worldcat.org/isbn/0226644677. There are many, many I could choose from. However, so long as I know that the ISBN is 0226644677, I'm free to use whatever URI best suits my needs. So, what I really want is the ISBN by itself.

Imagine, for example, a publisher such as PLoS or Magnolia Press (publisher of Zootaxa), both of which have recently published taxonomic papers containing LSIDs (e.g., doi:10.1371/journal.pone.0001787). They might want to display LSIDs linked to their own LSID resolver that embellishes the metadata with information they have (e.g., they might wish to highlight links to other content that they host). In a sense this is much the same idea as supported by OpenURL COinS, where OpenURL-format metadata is embedded in a HTML document and the user choose what resolver to use to resolve the links (including tools such as Zotero).

Having LSIDs prefixed with a HTTP proxy makes these task a little harder.

Friday, February 27, 2009

Something's missing from taxonomic name vocabularies

In the wiki examples I've been developing I've been trying to model names using the TDWG LSID vocabularies, particularly TaxonName. Roger Hyam has obviously put a huge amount of work into developing these, and they handle just about everything I need. However, I think that there's one thing missing, namely a way to express the logical relationship between the parts of a multinomial taxonomic name.

For example, consider the fish Chromis circumaurea Pyle, Earle, and Greene, 2008, described by Rich Pyle and colleages (TED have recently posted a great video of Rich talking about discovering new species of fish). Chromis circumaurea is a species in the genus Chromis, and in the TaxonName vocabulary I can represent this relationship using the term "genusPart", which specifies the name of the genus. In a wiki page this could be a link to a page called "Chromis".

But, which "Chromis"? There are at least three:
  • Chromis Hübner 1819
  • Chromis Lacepède 1802
  • Chromis Cuvier, 1814
Only one of these is the fish (Chromis Cuvier, 1814). Cases of the same name being used for different organisms (homonymy) is not uncommon, so linking to strings isn't adequate to express the relationship between the two parts of the name Chromis circumaurea.

I'd alluded to this issue in my first major foray into RDF and taxonomic names (Taxonomic names, metadata, and the Semantic Web), where I proposed using the Dublin Core term "isPartOf" to link the specific epithet to the genus part. In this case, the link would be between URIs for the names Chromis circumaurea Pyle, Earle, and Greene, 2008 and Chromis Cuvier, 1814.

It's a small point, but without some means to link components of a name we're going to struggle to sensibly answer questions such as listing all the species in a given genus (or, perhaps more correctly, all the species names that have been published in a given genus).