iPhylo: integration

Roderic D. M. Page

Showing posts with label integration. Show all posts

Wednesday, February 29, 2012

Making biodiversity data sticky: it's all about links

Sometimes I need to remind myself just why I'm spending so much time trying to make sense of other people's data, and why I go on (and on) about identifiers. One reason for my obsession is I want data to be "sticky", like the burrs shown in the photo above (Who invented velcro? by A-dep). Shared identifiers are like the hooks on the burrs, if two pieces of data have the same identifier they will stick together. Given enough identifiers and enough data, then we could rapidly assemble a "ball" of interconnected data. A published the diagram below as part of my Elsevier Challenge entry (preprint, published version) summarises some of the links between diverse kinds of biological data:
Model

While in principle many of these links should be trivial to create, in practice they aren't. One major obstacle is the lack of globally unique identifiers, or if such identifiers exist they aren't being used. As a result, our data is anything but sticky. In the absence of identifiers, creating links between different data sets can a significant undertaking. One way to tackle this is focus on just one kind of link at a time and create a database of those links. The diagram below shows some of the links I've been working on:
Links

For example, the iPhylo Linkout project creates links between taxon concepts in NCBI and Wikipedia. The iTaxon project is a mapping between taxonomic names and publications. I've briefly explored mapping host-parasite relationships using GenBank, and I'm currently exploring the links between publications and specimens. This list certainly doesn't exhaust the set of possible links, but it's a start. The challenge is to create sufficient links for biodiversity data to finally coalesce and for us to be able to ask questions that span multiple sources and types of data.

Wednesday, October 19, 2011

TDWG Challenge - what is RDF good for?

Last month, feeling particularly grumpy, I fired off an email to the TDWG-TAG mailing list with the subject Lobbing grenades: a challenge. Here's the email:

It's morning and the coffee hasn't quite kicked in yet, but reading through recent TDWG TAG posts, and mindful of the upcoming meeting in New Orleans (which sadly I won't be attending) I'm seeing a mismatch between the amount of effort being expended on discussions of vocabularies, ontologies, etc. and the concrete results we can point to.

Hence, a challenge:

"What new things have we learnt about biodiversity by converting biodiversity data into RDF?"

I'm not saying we can't learn new things, I'm simply asking what have we learnt so far?

Since around 2006 we have had literally millions of triples in the wild (uBio, ION, Index Fungorum, IPNI, Catalogue of Life, more recently Biodiversity Collections Index, Atlas of Living Australia, World Register of Marine Species, etc.), most of these using the same vocabulary. What new inferences have we made?

Let's make the challenge more concrete. Load all these data sources into a triple store (subchallenge - is this actually possible?). Perhaps add other RDF sources (DBpedia, Bio2RDF, CrossRef). What novel inferences can we make?

I may, of course, simply be in "grumpy old arse" mode, but we have millions of triples in the wild and nothing to show for it. I hope I'm not alone in wondering why...

In the context of the TDWG meeting (happening as we speak and which I'm following via Twitter, hashtag #tdwg) Joel Sachs asked me whether I had any specific data in mind that could form the basis of a discussion. So, here goes. I've assembled some small RDF data sets that it might be fun to play with. Each data set is for frogs, and I've divided them into two sets.

Primary data
These data sets are essentially unmodified RDF fetched from data providers:

uniprot.rdf Uniprot RDF for frogs in GenBank
ion.rdf Index of Organism Names (ION) RDF for taxonomic names for frogs (filtered to just those names that are also in GenBank, the RDF comes from ION LSIDs)
crossref.rdf CrossRef RDF for DOIs for publications that published new frog names (obtaining using CrossRef's support for Linked Data for DOIs)
dbpedia.rdf Dbpedia RDF for frogs in GenBank (Update 2011-10-20: the dbpedia.rdf file is a bit big, so here is subset.rdf which has just the conservation status and thumbnail image)

These sources give us information on genomics (at least, they tell us which taxa have been sequenced), where and when the original taxonomic description was published, and by whom, as well as some information on conservation status and what the frog looks like (via Dbpedia). Ideally we just load these files into a triple store and then ask a bunch of questions, such as what is the conservation status of frogs sequenced in Genbank?, is there correlation between the conservation status of a frog and the date it was discovered?, who has described the most frog species?, etc.

My contention is that actually we can't do any of this because the data is siloed due to the lack of shared identifiers and vocabularies (I suspect that there is not a single identifier any of these files share). The only way we can currently link these data sets together is by shared string literals (e.g., taxonomic names), in which case why bother with RDF? So my first challenge is to see whether any of the questions I've just listed can actually be tackled using this data.

Glue
In a slightly more constructive mode, to see if we can make progress I'm providing some additional RDF files, based on projects I'm working on to link data together. These files may help provide some of the missing "glue" to connect these data sets.

linkout.rdf The list of links between NCBI and Dbpedia (based on mapping in iPhylo LinkOut)
ion_doi.rdf A subset of publications listed in ION have DOIs, this file links the corresponding ION LSIDs to those DOIs (this file is from an ongoing project mapping names to primary literature)

The first file links the ION and CrossRef RDF, so we could start to ask questions about dates of discovery, who described what species, etc.. The second file links NCBI taxon ids (in this case in the form of UniProt URIs) to Wikipedia (in the form of Dbpedia URIs). Dbpedia has information on conservation status, and some frogs will also have pictures, so we can start to join genomics to conservation, as well as make some visualisations.

Update
I've now added another RDF file for 1000 georeferenced GenBank sequences for frogs. The file is genbank.rdf. This file is generated from a local, processed version of EMBL, and uses a mixture of Dublin Core and TDWG vocabularies. Here's an example of a single record:


<?xml version="1.0"?>
<rdf:RDF xmlns:dcterms="http://purl.org/dc/terms/" 
xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" 
xmlns:owl="http://www.w3.org/2002/07/owl#" 
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" 
xmlns:tcommon="http://rs.tdwg.org/ontology/voc/Common#" 
xmlns:toccurrence="http://rs.tdwg.org/ontology/voc/TaxonOccurrence#" 
xmlns:uniprot="http://purl.uniprot.org/core/">
  <uniprot:Molecule rdf:about="http://bio2rdf.org/genbank:EU566842">
    <dcterms:created>2008-07-06</dcterms:created>
    <dcterms:modified>2010-12-23</dcterms:modified>
    <dcterms:title>EU566842</dcterms:title>
    <dcterms:description>Xenopus borealis voucher MHNG:Herp:2644.64 
cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial.</dcterms:description>
    <dcterms:subject rdf:resource="http://purl.uniprot.org/taxonomy/8354"/>
    <dcterms:relation rdf:parseType="Resource">
      <rdf:type rdf:resource="http://rs.tdwg.org/ontology/voc/TaxonOccurrence#TaxonOccurrence"/>
      <toccurrence:identifiedToString>Xenopus borealis</toccurrence:identifiedToString>
      <toccurrence:decimalLatitude>0.66</toccurrence:decimalLatitude>
      <geo:lat>0.66</geo:lat>
      <toccurrence:decimalLongitude>37.5</toccurrence:decimalLongitude>
      <geo:long>37.5</geo:long>
      <toccurrence:verbatimCoordinates>0.66 N 37.5 E</toccurrence:verbatimCoordinates>
      <toccurrence:country>Kenya</toccurrence:country>
      <dcterms:identifier>MHNG:Herp:2644.64</dcterms:identifier>
    </dcterms:relation>
  </uniprot:Molecule>
</rdf:RDF>

I've added this simply so one could do some geographical queries.

Missing links
There are still lots of missing links here (for example, there's no explicit link between NCBI and ION, so we'd need to create this using taxonomic names), and we could add further links to the literature via sequences for taxa. Then there's the lack of geographic data. We could get some of this via georeferenced sequences in GenBank, but there's no RDF for this (Bio2RDF does have RDF for sequences but it ignores the bulk of the organismal metadata such as voucher specimens and latitude and longitude).

In many ways it's this lack of links that was point of my original email. The reality is that "linked data" isn't linked to anything like the extent that makes it useful. Simply pumping out RDF won't get us very far until we tackle this problem (see also my earlier post Linked data that isn't: the failings of RDF).

So, if you think RDF is the way to go, please tell me what you can learn from these data files.

Wednesday, May 06, 2009

Integrating and displaying data using RSS

Although I'd been thinking of getting the wiki project ready for e-Biosphere '09 as a challenge entry, lately I've been playing with RSS has a complementary, but quicker way to achieve some simple integration.

I've been playing with RSS on and off for a while, but what reignited my interest was the swine flu timemap I made last week. The neatest thing about the timemap was how easy it was to make. Just take some RSS that is geotagged and you get the timemap (courtesy of Nick Rabinowitz's wonderful Timemap library).

So, I began to think about taking RSS feeds for, say journals and taxonomic and genomic databases and adding them together and displaying them using tools such as timemap (see here for an earlier mock up of some GenBank data). Two obstacles are in the way. The first is that not every data source of interest provides RSS feeds. To address this I've started to develop wrappers around some sources, the first of which is ZooBank.

The second obstacle is that integration requires shared content (e.g., tags, identifiers, or localities). Some integration will be possible geographically (for example, adding geotagged sequences and images to a map), but this won't work for everything. So, I need to spend some time trying to link stuff together. In the case of Zoobank there's some scope for this, as ZooBank metadata sometimes includes DOIs, which enables us to link to the original publication, as well as bookmarking services such as Connotea. I'm aiming to include these links within the feed, as shown in this snippet (see the <link rel="related"...> element):

<entry>
<title>New Protocetid Whale from the Middle Eocene of Pakistan: Birth on Land, Precocial Development, and Sexual Dimorphism</title>
<link rel="alternate" type="text/html" href="http://zoobank.org/urn:lsid:zoobank.org:pub:8625FB9A-1FC3-43C3-9A99-7A3CDE0DFC9C"/>
<updated>2009-05-06T18:37:34+01:00</updated>
<id>urn:uuid:c8f6be01-2359-1805-8bdb-02f271a95ab4</id>
<content type="html">Gingerich, Philip D., Munir ul-Haq, Wighart von Koenigswald, William J. Sanders, B. Holly Smith & Iyad S. Zalmout<br/><a href="http://dx.doi.org/10.1371/journal.pone.0004366">doi:10.1371/journal.pone.0004366</a></content>
<summary type="html">Gingerich, Philip D., Munir ul-Haq, Wighart von Koenigswald, William J. Sanders, B. Holly Smith & Iyad S. Zalmout<br/><a href="http://dx.doi.org/10.1371/journal.pone.0004366">doi:10.1371/journal.pone.0004366</a></summary>
<link rel="related" type="text/html" href="http://dx.doi.org/10.1371/journal.pone.0004366" title="doi:10.1371/journal.pone.0004366"/>
<link rel="related" type="text/html" href="http://bioguid.info/urn:lsid:zoobank.org:pub:8625FB9A-1FC3-43C3-9A99-7A3CDE0DFC9C" title="urn:lsid:zoobank.org:pub:8625FB9A-1FC3-43C3-9A99-7A3CDE0DFC9C"/>
</entry>

What I'm hoping is that there will be enough links to create something rather like my Elsevier Challenge entry, but with a much more diverse set of sources.

Friday, August 15, 2008

DBpedia, and integrating taxonomy with the rest of the linked data world

While biodiversity informatics putters along, generating loads of globally unique identifiers that nobody else uses, perhaps it's time to take a look at the bigger picture. DBPedia is an effort to extract data from Wikipedia and make it available as linked data. At the heart of this effort is the use of HTTP URIs to identify resources, and reusing those URIs. Hence, for many concepts DBpedia URIs are the default option.

Interestingly, in addition to taxa, Wikipedia has pages on prominent (and not so prominent) taxonomists, such as Thomas Say and Henri Milne-Edwards. When it comes to assigning GUIDs to people, DBpedia URIs would be an obvious choice. For example, http://dbpedia.org/resource/Henri_Milne-Edwards is the URI for Henri Milne-Edwards.

This approach has several adavantages. For one, it embeds taxonomic authorities in the broader ocean of linked data. It also makes use of Wikipedia to provide biographical details on taxonomic authorities (many of whom are sufficiently notworthy to appear in Wikipedia). Until we start linking to other data sources, taxonomic data will remain in it's own little ghetto.

Thursday, April 03, 2008

Biodiversity informatics: the challenge of linking data and the role of shared identifiers

The manuscript for Briefings in Bioinformatics that I alluded to earlier has been accepted for publication. I've put a preprint up at Nature Preceding (hdl:10101/npre.2008.1760.1). The final version will appear in print later this year.

Thursday, January 31, 2008

How Shall I Integrate Thee? Let Me Count the Ways...

Leigh Dodds has a nice post How Shall I Integrate Thee? Let Me Count the Ways... about different ways to integrate data.

The one where we share identifiers
The one where we're describing the same thing
The one where we're speaking different languages
The one where we're using different units
The one where we're speaking at different levels of abstraction

Apart from the suggestion that Leigh has been watching way too much Friends, there's much food for thought here. I suspect that "The one where we're describing the same thing" is the one I'll be making most use of.

In Rethinking LSIDs versus HTTP URI I argued that most applications will use HTTP URIs, which makes them accessible, but not terribly useful as identifiers, the reason being that I think it is unlikely that people will reuse HTTP URIs ("The one where we share identifiers"). A good example is Connotea, which has its own URIs for each paper its users bookmark. I won't use these URIs as identifiers in my database (if only because if a user resolves them, they get taken to Connotea's web site, not mine). However, I will store any PubMed and DOI identifiers, so that somebody aggregating information from Connotea (say to retrieve user tags) and my database (say, to get links to sequences and specimens) can work out that the Connotea URI and my URI are talking about the same thing.